Adds a complete LSM-based storage engine with these features: - Single-writer based architecture for the storage engine - WAL for durability, and hey it's configurable - MemTable with skip list implementation for fast read/writes - SSTable with block-based structure for on-disk level-based storage - Background compaction with tiered strategy - ACID transactions - Good documentation (I hope)
11 KiB
Compaction Package Documentation
The compaction
package implements background processes that merge and optimize SSTable files in the Kevo engine. Compaction is a critical component of the LSM tree architecture, responsible for controlling read amplification, managing tombstones, and maintaining overall storage efficiency.
Overview
Compaction combines multiple SSTable files into fewer, larger, and more optimized files. This process is essential for maintaining good read performance and controlling disk usage in an LSM tree-based storage system.
Key responsibilities of the compaction package include:
- Selecting files for compaction based on configurable strategies
- Merging overlapping key ranges across multiple SSTables
- Managing tombstones and deleted data
- Organizing SSTables into a level-based hierarchy
- Coordinating background compaction operations
Architecture
Component Structure
The compaction package consists of several interrelated components that work together:
┌───────────────────────┐
│ CompactionCoordinator │
└───────────┬───────────┘
│
▼
┌───────────────────────┐ ┌───────────────────────┐
│ CompactionStrategy │─────▶│ CompactionExecutor │
└───────────┬───────────┘ └───────────────────────┘
│ │
▼ ▼
┌───────────────────────┐ ┌───────────────────────┐
│ FileTracker │ │ TombstoneManager │
└───────────────────────┘ └───────────────────────┘
- CompactionCoordinator: Orchestrates the compaction process
- CompactionStrategy: Determines which files to compact and when
- CompactionExecutor: Performs the actual merging of files
- FileTracker: Manages the lifecycle of SSTable files
- TombstoneManager: Tracks deleted keys and their lifecycle
Compaction Strategies
Tiered Compaction Strategy
The primary strategy implemented is a tiered (or leveled) compaction strategy, inspired by LevelDB and RocksDB:
-
Level Organization:
- Level 0: Contains files directly flushed from MemTables
- Level 1+: Contains files with non-overlapping key ranges
-
Compaction Triggers:
- L0→L1: When L0 has too many files (causes read amplification)
- Ln→Ln+1: When a level exceeds its size threshold
-
Size Ratio:
- Each level (L+1) can hold approximately 10x more data than level L
- This ratio is configurable (CompactionRatio in configuration)
File Selection Algorithm
The strategy uses several criteria to select files for compaction:
-
L0 Compaction:
- Select all L0 files that overlap with the oldest L0 file
- Include overlapping files from L1
-
Level-N Compaction:
- Select a file from level N based on several possible criteria:
- Oldest file first
- File with most overlapping files in the next level
- File containing known tombstones
- Include all overlapping files from level N+1
- Select a file from level N based on several possible criteria:
-
Range Compaction:
- Select all files in a given key range across multiple levels
- Useful for manual compactions or hotspot optimization
Implementation Details
Compaction Process
The compaction execution follows these steps:
-
File Selection:
- Strategy identifies files to compact
- Input files are grouped by level
-
Merge Process:
- Create merged iterators across all input files
- Write merged data to new output files
- Handle tombstones appropriately
-
File Management:
- Mark input files as obsolete
- Register new output files
- Clean up obsolete files
Tombstone Handling
Tombstones (deletion markers) require special treatment during compaction:
-
Tombstone Tracking:
- Recent deletions are tracked in the TombstoneManager
- Tracks tombstones with timestamps to determine when they can be discarded
-
Tombstone Elimination:
- Basic rule: A tombstone can be discarded if all older SSTables have been compacted
- Tombstones in lower levels can be dropped once they've propagated to higher levels
- Special case: Tombstones indicating overwritten keys can be dropped immediately
-
Preservation Logic:
- Configurable MaxLevelWithTombstones controls how far tombstones propagate
- Required to ensure deleted data doesn't "resurface" from older files
Background Processing
Compaction runs as a background process:
-
Worker Thread:
- Runs on a configurable interval (default 30 seconds)
- Selects and performs one compaction task per cycle
-
Concurrency Control:
- Lock mechanism ensures only one compaction runs at a time
- Avoids conflicts with other operations like flushing
-
Graceful Shutdown:
- Compaction can be stopped cleanly on engine shutdown
- Pending changes are completed before shutdown
File Tracking and Cleanup
The FileTracker component manages file lifecycles:
-
File States:
- Active: Current file in use
- Pending: Being compacted
- Obsolete: Ready for deletion
-
Safe Deletion:
- Files are only deleted when not in use
- Two-phase marking ensures no premature deletions
-
Cleanup Process:
- Runs after each compaction cycle
- Safely removes obsolete files from disk
Performance Considerations
Read Amplification
Compaction is crucial for controlling read amplification:
-
Level Strategy Impact:
- Without compaction, all SSTables would need checking for each read
- With leveling, reads typically check one file per level
-
Optimization for Point Queries:
- Higher levels have fewer overlaps
- Binary search within levels reduces lookups
-
Range Query Optimization:
- Reduced file count improves range scan performance
- Sorted levels allow efficient merge iteration
Write Amplification
The compaction process does introduce write amplification:
-
Cascading Rewrites:
- Data may be rewritten multiple times as it moves through levels
- Key factor in overall write amplification of the storage engine
-
Mitigation Strategies:
- Larger level size ratios reduce compaction frequency
- Careful file selection minimizes unnecessary rewrites
Space Amplification
Compaction also manages space amplification:
-
Duplicate Key Elimination:
- Compaction removes outdated versions of keys
- Critical for preventing unbounded growth
-
Tombstone Purging:
- Eventually removes deletion markers
- Prevents accumulation of "ghost" records
Tuning Parameters
Several parameters can be adjusted to optimize compaction behavior:
-
CompactionLevels (default: 7):
- Number of levels in the storage hierarchy
- More levels mean less write amplification but more read amplification
-
CompactionRatio (default: 10):
- Size ratio between adjacent levels
- Higher ratio means less frequent compaction but larger individual compactions
-
CompactionThreads (default: 2):
- Number of threads for compaction operations
- More threads can speed up compaction but increase resource usage
-
CompactionInterval (default: 30 seconds):
- Time between compaction checks
- Lower values make compaction more responsive but may cause more CPU usage
-
MaxLevelWithTombstones (default: 1):
- Highest level that preserves tombstones
- Controls how long deletion markers persist
Common Usage Patterns
Default Configuration
Most users don't need to interact directly with compaction, as it's managed automatically by the storage engine. The default configuration provides a good balance between read and write performance.
Manual Compaction Trigger
For maintenance or after bulk operations, manual compaction can be triggered:
// Trigger compaction for the entire database
err := engine.GetCompactionManager().TriggerCompaction()
if err != nil {
log.Fatal(err)
}
// Compact a specific key range
startKey := []byte("user:1000")
endKey := []byte("user:2000")
err = engine.GetCompactionManager().CompactRange(startKey, endKey)
if err != nil {
log.Fatal(err)
}
Custom Compaction Strategy
For specialized workloads, a custom compaction strategy can be implemented:
// Example: Creating a coordinator with a custom strategy
customStrategy := NewMyCustomStrategy(config, sstableDir)
coordinator := NewCompactionCoordinator(config, sstableDir, CompactionCoordinatorOptions{
Strategy: customStrategy,
})
// Start background compaction
coordinator.Start()
Trade-offs and Limitations
Compaction Pauses
Compaction can temporarily impact performance:
-
Disk I/O Spikes:
- Compaction involves significant disk I/O
- May affect concurrent read/write operations
-
Resource Sharing:
- Compaction competes with regular operations for system resources
- Tuning needed to balance background work against foreground performance
Size vs. Level Trade-offs
The level structure involves several trade-offs:
-
Few Levels:
- Less read amplification (fewer levels to check)
- More write amplification (more frequent compactions)
-
Many Levels:
- More read amplification (more levels to check)
- Less write amplification (less frequent compactions)
Full Compaction Limitations
Some limitations exist for full database compactions:
-
Resource Intensity:
- Full compaction requires significant I/O and CPU
- May need to be scheduled during low-usage periods
-
Space Requirements:
- Temporarily requires space for both old and new files
- May not be feasible with limited disk space
Advanced Concepts
Dynamic Level Sizing
The implementation uses dynamic level sizing:
-
Target Size Calculation:
- Level L target size = Base size × CompactionRatio^L
- Automatically adjusts as the database grows
-
Level-0 Special Case:
- Level 0 is managed by file count rather than size
- Controls read amplification from recent writes
Compaction Priority
Compaction tasks are prioritized based on several factors:
- Level-0 Buildup: Highest priority to prevent read amplification
- Size Imbalance: Levels exceeding target size
- Tombstone Presence: Files with deletions that can be cleaned up
- File Age: Older files get priority for compaction
Seek-Based Compaction
For future enhancement, seek-based compaction could be implemented:
-
Tracking Hot Files:
- Monitor which files receive the most seek operations
- Prioritize these files for compaction
-
Adaptive Strategy:
- Adjust compaction based on observed workload patterns
- Optimize frequently accessed key ranges