kevo/docs/compaction.md
Jeremy Tregunna 6fc3be617d
Some checks failed
Go Tests / Run Tests (1.24.2) (push) Has been cancelled
feat: Initial release of kevo storage engine.
Adds a complete LSM-based storage engine with these features:
- Single-writer based architecture for the storage engine
- WAL for durability, and hey it's configurable
- MemTable with skip list implementation for fast read/writes
- SSTable with block-based structure for on-disk level-based storage
- Background compaction with tiered strategy
- ACID transactions
- Good documentation (I hope)
2025-04-20 14:06:50 -06:00

11 KiB
Raw Blame History

Compaction Package Documentation

The compaction package implements background processes that merge and optimize SSTable files in the Kevo engine. Compaction is a critical component of the LSM tree architecture, responsible for controlling read amplification, managing tombstones, and maintaining overall storage efficiency.

Overview

Compaction combines multiple SSTable files into fewer, larger, and more optimized files. This process is essential for maintaining good read performance and controlling disk usage in an LSM tree-based storage system.

Key responsibilities of the compaction package include:

  • Selecting files for compaction based on configurable strategies
  • Merging overlapping key ranges across multiple SSTables
  • Managing tombstones and deleted data
  • Organizing SSTables into a level-based hierarchy
  • Coordinating background compaction operations

Architecture

Component Structure

The compaction package consists of several interrelated components that work together:

┌───────────────────────┐
│ CompactionCoordinator │
└───────────┬───────────┘
            │
            ▼
┌───────────────────────┐      ┌───────────────────────┐
│  CompactionStrategy   │─────▶│   CompactionExecutor  │
└───────────┬───────────┘      └───────────────────────┘
            │                              │
            ▼                              ▼
┌───────────────────────┐      ┌───────────────────────┐
│     FileTracker       │      │   TombstoneManager    │
└───────────────────────┘      └───────────────────────┘
  1. CompactionCoordinator: Orchestrates the compaction process
  2. CompactionStrategy: Determines which files to compact and when
  3. CompactionExecutor: Performs the actual merging of files
  4. FileTracker: Manages the lifecycle of SSTable files
  5. TombstoneManager: Tracks deleted keys and their lifecycle

Compaction Strategies

Tiered Compaction Strategy

The primary strategy implemented is a tiered (or leveled) compaction strategy, inspired by LevelDB and RocksDB:

  1. Level Organization:

    • Level 0: Contains files directly flushed from MemTables
    • Level 1+: Contains files with non-overlapping key ranges
  2. Compaction Triggers:

    • L0→L1: When L0 has too many files (causes read amplification)
    • Ln→Ln+1: When a level exceeds its size threshold
  3. Size Ratio:

    • Each level (L+1) can hold approximately 10x more data than level L
    • This ratio is configurable (CompactionRatio in configuration)

File Selection Algorithm

The strategy uses several criteria to select files for compaction:

  1. L0 Compaction:

    • Select all L0 files that overlap with the oldest L0 file
    • Include overlapping files from L1
  2. Level-N Compaction:

    • Select a file from level N based on several possible criteria:
      • Oldest file first
      • File with most overlapping files in the next level
      • File containing known tombstones
    • Include all overlapping files from level N+1
  3. Range Compaction:

    • Select all files in a given key range across multiple levels
    • Useful for manual compactions or hotspot optimization

Implementation Details

Compaction Process

The compaction execution follows these steps:

  1. File Selection:

    • Strategy identifies files to compact
    • Input files are grouped by level
  2. Merge Process:

    • Create merged iterators across all input files
    • Write merged data to new output files
    • Handle tombstones appropriately
  3. File Management:

    • Mark input files as obsolete
    • Register new output files
    • Clean up obsolete files

Tombstone Handling

Tombstones (deletion markers) require special treatment during compaction:

  1. Tombstone Tracking:

    • Recent deletions are tracked in the TombstoneManager
    • Tracks tombstones with timestamps to determine when they can be discarded
  2. Tombstone Elimination:

    • Basic rule: A tombstone can be discarded if all older SSTables have been compacted
    • Tombstones in lower levels can be dropped once they've propagated to higher levels
    • Special case: Tombstones indicating overwritten keys can be dropped immediately
  3. Preservation Logic:

    • Configurable MaxLevelWithTombstones controls how far tombstones propagate
    • Required to ensure deleted data doesn't "resurface" from older files

Background Processing

Compaction runs as a background process:

  1. Worker Thread:

    • Runs on a configurable interval (default 30 seconds)
    • Selects and performs one compaction task per cycle
  2. Concurrency Control:

    • Lock mechanism ensures only one compaction runs at a time
    • Avoids conflicts with other operations like flushing
  3. Graceful Shutdown:

    • Compaction can be stopped cleanly on engine shutdown
    • Pending changes are completed before shutdown

File Tracking and Cleanup

The FileTracker component manages file lifecycles:

  1. File States:

    • Active: Current file in use
    • Pending: Being compacted
    • Obsolete: Ready for deletion
  2. Safe Deletion:

    • Files are only deleted when not in use
    • Two-phase marking ensures no premature deletions
  3. Cleanup Process:

    • Runs after each compaction cycle
    • Safely removes obsolete files from disk

Performance Considerations

Read Amplification

Compaction is crucial for controlling read amplification:

  1. Level Strategy Impact:

    • Without compaction, all SSTables would need checking for each read
    • With leveling, reads typically check one file per level
  2. Optimization for Point Queries:

    • Higher levels have fewer overlaps
    • Binary search within levels reduces lookups
  3. Range Query Optimization:

    • Reduced file count improves range scan performance
    • Sorted levels allow efficient merge iteration

Write Amplification

The compaction process does introduce write amplification:

  1. Cascading Rewrites:

    • Data may be rewritten multiple times as it moves through levels
    • Key factor in overall write amplification of the storage engine
  2. Mitigation Strategies:

    • Larger level size ratios reduce compaction frequency
    • Careful file selection minimizes unnecessary rewrites

Space Amplification

Compaction also manages space amplification:

  1. Duplicate Key Elimination:

    • Compaction removes outdated versions of keys
    • Critical for preventing unbounded growth
  2. Tombstone Purging:

    • Eventually removes deletion markers
    • Prevents accumulation of "ghost" records

Tuning Parameters

Several parameters can be adjusted to optimize compaction behavior:

  1. CompactionLevels (default: 7):

    • Number of levels in the storage hierarchy
    • More levels mean less write amplification but more read amplification
  2. CompactionRatio (default: 10):

    • Size ratio between adjacent levels
    • Higher ratio means less frequent compaction but larger individual compactions
  3. CompactionThreads (default: 2):

    • Number of threads for compaction operations
    • More threads can speed up compaction but increase resource usage
  4. CompactionInterval (default: 30 seconds):

    • Time between compaction checks
    • Lower values make compaction more responsive but may cause more CPU usage
  5. MaxLevelWithTombstones (default: 1):

    • Highest level that preserves tombstones
    • Controls how long deletion markers persist

Common Usage Patterns

Default Configuration

Most users don't need to interact directly with compaction, as it's managed automatically by the storage engine. The default configuration provides a good balance between read and write performance.

Manual Compaction Trigger

For maintenance or after bulk operations, manual compaction can be triggered:

// Trigger compaction for the entire database
err := engine.GetCompactionManager().TriggerCompaction()
if err != nil {
    log.Fatal(err)
}

// Compact a specific key range
startKey := []byte("user:1000")
endKey := []byte("user:2000")
err = engine.GetCompactionManager().CompactRange(startKey, endKey)
if err != nil {
    log.Fatal(err)
}

Custom Compaction Strategy

For specialized workloads, a custom compaction strategy can be implemented:

// Example: Creating a coordinator with a custom strategy
customStrategy := NewMyCustomStrategy(config, sstableDir)
coordinator := NewCompactionCoordinator(config, sstableDir, CompactionCoordinatorOptions{
    Strategy: customStrategy,
})

// Start background compaction
coordinator.Start()

Trade-offs and Limitations

Compaction Pauses

Compaction can temporarily impact performance:

  1. Disk I/O Spikes:

    • Compaction involves significant disk I/O
    • May affect concurrent read/write operations
  2. Resource Sharing:

    • Compaction competes with regular operations for system resources
    • Tuning needed to balance background work against foreground performance

Size vs. Level Trade-offs

The level structure involves several trade-offs:

  1. Few Levels:

    • Less read amplification (fewer levels to check)
    • More write amplification (more frequent compactions)
  2. Many Levels:

    • More read amplification (more levels to check)
    • Less write amplification (less frequent compactions)

Full Compaction Limitations

Some limitations exist for full database compactions:

  1. Resource Intensity:

    • Full compaction requires significant I/O and CPU
    • May need to be scheduled during low-usage periods
  2. Space Requirements:

    • Temporarily requires space for both old and new files
    • May not be feasible with limited disk space

Advanced Concepts

Dynamic Level Sizing

The implementation uses dynamic level sizing:

  1. Target Size Calculation:

    • Level L target size = Base size × CompactionRatio^L
    • Automatically adjusts as the database grows
  2. Level-0 Special Case:

    • Level 0 is managed by file count rather than size
    • Controls read amplification from recent writes

Compaction Priority

Compaction tasks are prioritized based on several factors:

  1. Level-0 Buildup: Highest priority to prevent read amplification
  2. Size Imbalance: Levels exceeding target size
  3. Tombstone Presence: Files with deletions that can be cleaned up
  4. File Age: Older files get priority for compaction

Seek-Based Compaction

For future enhancement, seek-based compaction could be implemented:

  1. Tracking Hot Files:

    • Monitor which files receive the most seek operations
    • Prioritize these files for compaction
  2. Adaptive Strategy:

    • Adjust compaction based on observed workload patterns
    • Optimize frequently accessed key ranges