jer/kevo

Go Tests / Run Tests (1.24.2) (push) Has been cancelled

Details

feat: Initial release of kevo storage engine.

Adds a complete LSM-based storage engine with these features:
- Single-writer based architecture for the storage engine
- WAL for durability, and hey it's configurable
- MemTable with skip list implementation for fast read/writes
- SSTable with block-based structure for on-disk level-based storage
- Background compaction with tiered strategy
- ACID transactions
- Good documentation (I hope)

2025-04-20 14:06:50 -06:00

10 KiB

Raw Blame History

MemTable Package Documentation

The memtable package implements an in-memory data structure for the Kevo engine. MemTables are a key component of the LSM tree architecture, providing fast, sorted, in-memory storage for recently written data before it's flushed to disk as SSTables.

Overview

MemTables serve as the primary write buffer for the storage engine, allowing efficient processing of write operations before they are persisted to disk. The implementation uses a skiplist data structure to provide fast insertions, retrievals, and ordered iteration.

Key responsibilities of the MemTable include:

Providing fast in-memory writes
Supporting efficient key lookups
Offering ordered iteration for range scans
Tracking tombstones for deleted keys
Supporting atomic transitions between mutable and immutable states

Architecture

Core Components

The MemTable package consists of several interrelated components:

SkipList: The core data structure providing O(log n) operations.
MemTable: A wrapper around SkipList with additional functionality.
MemTablePool: A manager for active and immutable MemTables.
Recovery: Mechanisms for rebuilding MemTables from WAL entries.

┌─────────────────┐
│  MemTablePool   │
└───────┬─────────┘
        │
┌───────┴─────────┐      ┌─────────────────┐
│ Active MemTable │      │   Immutable     │
└───────┬─────────┘      │   MemTables     │
        │                └─────────────────┘
┌───────┴─────────┐
│    SkipList     │
└─────────────────┘

Implementation Details

SkipList Data Structure

The SkipList is a probabilistic data structure that allows fast operations by maintaining multiple layers of linked lists:

Nodes: Each node contains:
- Entry data (key, value, sequence number, value type)
- Height information
- Next pointers at each level
Probabilistic Height: New nodes get a random height following a probabilistic distribution:
- Height 1: 100% of nodes
- Height 2: 25% of nodes
- Height 3: 6.25% of nodes, etc.
Search Algorithm:
- Starts at the highest level of the head node
- Moves forward until finding a node greater than the target
- Drops down a level and continues
- This gives O(log n) expected time for operations
Concurrency Considerations:
- Uses atomic operations for pointer manipulation
- Cache-aligned node structure

Memory Management

The MemTable implementation includes careful memory management:

Size Tracking:
- Each entry's size is estimated (key length + value length + overhead)
- Running total maintained using atomic operations
Resource Limits:
- Configurable maximum size (default 32MB)
- Age-based limits (configurable maximum age)
- When limits are reached, the MemTable becomes immutable
Memory Overhead:
- Skip list nodes add overhead (pointers at each level)
- Overhead is controlled by limiting maximum height (12 by default)
- Bracing factor of 4 provides good balance between height and width

Entry Types and Tombstones

The MemTable supports two types of entries:

Value Entries (TypeValue):
- Normal key-value pairs
- Stored with their sequence number
Deletion Tombstones (TypeDeletion):
- Markers indicating a key has been deleted
- Value is nil, but the key and sequence number are preserved
- Essential for proper deletion semantics in the LSM tree architecture

MemTablePool

The MemTablePool manages multiple MemTables:

Active MemTable:
- Single mutable MemTable for current writes
- Becomes immutable when size/age thresholds are reached
Immutable MemTables:
- Former active MemTables waiting to be flushed to disk
- Read-only, no modifications allowed
- Still available for reads while awaiting flush
Lifecycle Management:
- Monitors size and age of active MemTable
- Triggers transitions from active to immutable
- Creates new active MemTable when needed

Iterator Functionality

MemTables provide iterator interfaces for sequential access:

Forward Iteration:
- SeekToFirst(): Position at the first entry
- Seek(key): Position at or after the given key
- Next(): Move to the next entry
- Valid(): Check if the current position is valid
Entry Access:
- Key(): Get the current entry's key
- Value(): Get the current entry's value
- IsTombstone(): Check if the current entry is a deletion marker
Iterator Adapters:
- Adapters to the common iterator interface for the engine

Concurrency and Isolation

MemTables employ a concurrency model suited for the storage engine's architecture:

Read Concurrency:
- Multiple readers can access MemTables concurrently
- Read locks are used for concurrent Get operations
Write Isolation:
- The single-writer architecture ensures only one writer at a time
- Writes to the active MemTable use write locks
Immutable State:
- Once a MemTable becomes immutable, no further modifications occur
- This provides a simple isolation model
Atomic Transitions:
- The transition from mutable to immutable is atomic
- Uses atomic boolean for immutable state flag

Recovery Process

The recovery functionality rebuilds MemTables from WAL data:

WAL Entries:
- Each WAL entry contains an operation type, key, value and sequence number
- Entries are processed in order to rebuild the MemTable state
Sequence Number Handling:
- Maximum sequence number is tracked during recovery
- Ensures future operations have larger sequence numbers
Batch Operations:
- Support for atomic batch operations from WAL
- Batch entries contain multiple operations with sequential sequence numbers

Performance Considerations

Time Complexity

The SkipList data structure offers favorable complexity for MemTable operations:

Operation	Average Case	Worst Case
Insert	O(log n)	O(n)
Lookup	O(log n)	O(n)
Delete	O(log n)	O(n)
Iteration	O(1) per step	O(1) per step

Memory Usage Optimization

Several optimizations are employed to improve memory efficiency:

Shared Memory Allocations:
- Node arrays allocated in contiguous blocks
- Reduces allocation overhead
Cache Awareness:
- Nodes aligned to cache lines (64 bytes)
- Improves CPU cache utilization
Appropriate Sizing:
- Default sizing (32MB) provides good balance
- Configurable based on workload needs

Write Amplification

MemTables help reduce write amplification in the LSM architecture:

Buffering Writes:
- Multiple key updates are consolidated in memory
- Only the latest value gets written to disk
Batching:
- Many small writes batched into larger disk operations
- Improves overall I/O efficiency

Common Usage Patterns

Basic Usage

// Create a new MemTable
memTable := memtable.NewMemTable()

// Add entries with incrementing sequence numbers
memTable.Put([]byte("key1"), []byte("value1"), 1)
memTable.Put([]byte("key2"), []byte("value2"), 2)
memTable.Delete([]byte("key3"), 3)

// Retrieve a value
value, found := memTable.Get([]byte("key1"))
if found {
    fmt.Printf("Value: %s\n", value)
}

// Check if the MemTable is too large
if memTable.ApproximateSize() > 32*1024*1024 {
    memTable.SetImmutable()
    // Write to disk...
}

Using MemTablePool

// Create a pool with configuration
config := config.NewDefaultConfig("/path/to/data")
pool := memtable.NewMemTablePool(config)

// Add entries
pool.Put([]byte("key1"), []byte("value1"), 1)
pool.Delete([]byte("key2"), 2)

// Check if flushing is needed
if pool.IsFlushNeeded() {
    // Switch to a new active MemTable and get the old one for flushing
    immutable := pool.SwitchToNewMemTable()
    
    // Flush the immutable table to disk as an SSTable
    // ...
}

Iterating Over Entries

// Create an iterator
iter := memTable.NewIterator()

// Iterate through all entries
for iter.SeekToFirst(); iter.Valid(); iter.Next() {
    fmt.Printf("%s: ", iter.Key())
    
    if iter.IsTombstone() {
        fmt.Println("<deleted>")
    } else {
        fmt.Printf("%s\n", iter.Value())
    }
}

// Or seek to a specific point
iter.Seek([]byte("key5"))
if iter.Valid() {
    fmt.Printf("Found: %s\n", iter.Key())
}

Configuration Options

The MemTable behavior can be tuned through several configuration parameters:

MemTableSize (default: 32MB):
- Maximum size before triggering a flush
- Larger sizes improve write throughput but increase memory usage
MaxMemTables (default: 4):
- Maximum number of MemTables in memory (active + immutable)
- Higher values allow more in-flight flushes
MaxMemTableAge (default: 600 seconds):
- Maximum age before forcing a flush
- Ensures data isn't held in memory too long

Trade-offs and Limitations

Write Bursts and Flush Stalls

High write bursts can lead to multiple MemTables becoming immutable before the background flush process completes. The system handles this by:

Maintaining multiple immutable MemTables in memory
Tracking the number of immutable MemTables
Potentially slowing down writes if too many immutable MemTables accumulate

Memory Usage vs. Performance

The MemTable configuration involves balancing memory usage against performance:

Larger MemTables:
- Pro: Better write performance, fewer disk flushes
- Con: Higher memory usage, potentially longer recovery time
Smaller MemTables:
- Pro: Lower memory usage, faster recovery
- Con: More frequent flushes, potentially lower write throughput

Ordering and Consistency

The MemTable maintains ordering via:

Key Comparison: Primary ordering by key
Sequence Numbers: Secondary ordering to handle updates to the same key
Value Types: Distinguishing between values and deletion markers

This ensures consistent state even with concurrent reads while a background flush is occurring.

10 KiB Raw Blame History