kevo/docs/memtable.md
Jeremy Tregunna 6fc3be617d
Some checks failed
Go Tests / Run Tests (1.24.2) (push) Has been cancelled
feat: Initial release of kevo storage engine.
Adds a complete LSM-based storage engine with these features:
- Single-writer based architecture for the storage engine
- WAL for durability, and hey it's configurable
- MemTable with skip list implementation for fast read/writes
- SSTable with block-based structure for on-disk level-based storage
- Background compaction with tiered strategy
- ACID transactions
- Good documentation (I hope)
2025-04-20 14:06:50 -06:00

328 lines
10 KiB
Markdown

# MemTable Package Documentation
The `memtable` package implements an in-memory data structure for the Kevo engine. MemTables are a key component of the LSM tree architecture, providing fast, sorted, in-memory storage for recently written data before it's flushed to disk as SSTables.
## Overview
MemTables serve as the primary write buffer for the storage engine, allowing efficient processing of write operations before they are persisted to disk. The implementation uses a skiplist data structure to provide fast insertions, retrievals, and ordered iteration.
Key responsibilities of the MemTable include:
- Providing fast in-memory writes
- Supporting efficient key lookups
- Offering ordered iteration for range scans
- Tracking tombstones for deleted keys
- Supporting atomic transitions between mutable and immutable states
## Architecture
### Core Components
The MemTable package consists of several interrelated components:
1. **SkipList**: The core data structure providing O(log n) operations.
2. **MemTable**: A wrapper around SkipList with additional functionality.
3. **MemTablePool**: A manager for active and immutable MemTables.
4. **Recovery**: Mechanisms for rebuilding MemTables from WAL entries.
```
┌─────────────────┐
│ MemTablePool │
└───────┬─────────┘
┌───────┴─────────┐ ┌─────────────────┐
│ Active MemTable │ │ Immutable │
└───────┬─────────┘ │ MemTables │
│ └─────────────────┘
┌───────┴─────────┐
│ SkipList │
└─────────────────┘
```
## Implementation Details
### SkipList Data Structure
The SkipList is a probabilistic data structure that allows fast operations by maintaining multiple layers of linked lists:
1. **Nodes**: Each node contains:
- Entry data (key, value, sequence number, value type)
- Height information
- Next pointers at each level
2. **Probabilistic Height**: New nodes get a random height following a probabilistic distribution:
- Height 1: 100% of nodes
- Height 2: 25% of nodes
- Height 3: 6.25% of nodes, etc.
3. **Search Algorithm**:
- Starts at the highest level of the head node
- Moves forward until finding a node greater than the target
- Drops down a level and continues
- This gives O(log n) expected time for operations
4. **Concurrency Considerations**:
- Uses atomic operations for pointer manipulation
- Cache-aligned node structure
### Memory Management
The MemTable implementation includes careful memory management:
1. **Size Tracking**:
- Each entry's size is estimated (key length + value length + overhead)
- Running total maintained using atomic operations
2. **Resource Limits**:
- Configurable maximum size (default 32MB)
- Age-based limits (configurable maximum age)
- When limits are reached, the MemTable becomes immutable
3. **Memory Overhead**:
- Skip list nodes add overhead (pointers at each level)
- Overhead is controlled by limiting maximum height (12 by default)
- Bracing factor of 4 provides good balance between height and width
### Entry Types and Tombstones
The MemTable supports two types of entries:
1. **Value Entries** (`TypeValue`):
- Normal key-value pairs
- Stored with their sequence number
2. **Deletion Tombstones** (`TypeDeletion`):
- Markers indicating a key has been deleted
- Value is nil, but the key and sequence number are preserved
- Essential for proper deletion semantics in the LSM tree architecture
### MemTablePool
The MemTablePool manages multiple MemTables:
1. **Active MemTable**:
- Single mutable MemTable for current writes
- Becomes immutable when size/age thresholds are reached
2. **Immutable MemTables**:
- Former active MemTables waiting to be flushed to disk
- Read-only, no modifications allowed
- Still available for reads while awaiting flush
3. **Lifecycle Management**:
- Monitors size and age of active MemTable
- Triggers transitions from active to immutable
- Creates new active MemTable when needed
### Iterator Functionality
MemTables provide iterator interfaces for sequential access:
1. **Forward Iteration**:
- `SeekToFirst()`: Position at the first entry
- `Seek(key)`: Position at or after the given key
- `Next()`: Move to the next entry
- `Valid()`: Check if the current position is valid
2. **Entry Access**:
- `Key()`: Get the current entry's key
- `Value()`: Get the current entry's value
- `IsTombstone()`: Check if the current entry is a deletion marker
3. **Iterator Adapters**:
- Adapters to the common iterator interface for the engine
## Concurrency and Isolation
MemTables employ a concurrency model suited for the storage engine's architecture:
1. **Read Concurrency**:
- Multiple readers can access MemTables concurrently
- Read locks are used for concurrent Get operations
2. **Write Isolation**:
- The single-writer architecture ensures only one writer at a time
- Writes to the active MemTable use write locks
3. **Immutable State**:
- Once a MemTable becomes immutable, no further modifications occur
- This provides a simple isolation model
4. **Atomic Transitions**:
- The transition from mutable to immutable is atomic
- Uses atomic boolean for immutable state flag
## Recovery Process
The recovery functionality rebuilds MemTables from WAL data:
1. **WAL Entries**:
- Each WAL entry contains an operation type, key, value and sequence number
- Entries are processed in order to rebuild the MemTable state
2. **Sequence Number Handling**:
- Maximum sequence number is tracked during recovery
- Ensures future operations have larger sequence numbers
3. **Batch Operations**:
- Support for atomic batch operations from WAL
- Batch entries contain multiple operations with sequential sequence numbers
## Performance Considerations
### Time Complexity
The SkipList data structure offers favorable complexity for MemTable operations:
| Operation | Average Case | Worst Case |
|-----------|--------------|------------|
| Insert | O(log n) | O(n) |
| Lookup | O(log n) | O(n) |
| Delete | O(log n) | O(n) |
| Iteration | O(1) per step| O(1) per step |
### Memory Usage Optimization
Several optimizations are employed to improve memory efficiency:
1. **Shared Memory Allocations**:
- Node arrays allocated in contiguous blocks
- Reduces allocation overhead
2. **Cache Awareness**:
- Nodes aligned to cache lines (64 bytes)
- Improves CPU cache utilization
3. **Appropriate Sizing**:
- Default sizing (32MB) provides good balance
- Configurable based on workload needs
### Write Amplification
MemTables help reduce write amplification in the LSM architecture:
1. **Buffering Writes**:
- Multiple key updates are consolidated in memory
- Only the latest value gets written to disk
2. **Batching**:
- Many small writes batched into larger disk operations
- Improves overall I/O efficiency
## Common Usage Patterns
### Basic Usage
```go
// Create a new MemTable
memTable := memtable.NewMemTable()
// Add entries with incrementing sequence numbers
memTable.Put([]byte("key1"), []byte("value1"), 1)
memTable.Put([]byte("key2"), []byte("value2"), 2)
memTable.Delete([]byte("key3"), 3)
// Retrieve a value
value, found := memTable.Get([]byte("key1"))
if found {
fmt.Printf("Value: %s\n", value)
}
// Check if the MemTable is too large
if memTable.ApproximateSize() > 32*1024*1024 {
memTable.SetImmutable()
// Write to disk...
}
```
### Using MemTablePool
```go
// Create a pool with configuration
config := config.NewDefaultConfig("/path/to/data")
pool := memtable.NewMemTablePool(config)
// Add entries
pool.Put([]byte("key1"), []byte("value1"), 1)
pool.Delete([]byte("key2"), 2)
// Check if flushing is needed
if pool.IsFlushNeeded() {
// Switch to a new active MemTable and get the old one for flushing
immutable := pool.SwitchToNewMemTable()
// Flush the immutable table to disk as an SSTable
// ...
}
```
### Iterating Over Entries
```go
// Create an iterator
iter := memTable.NewIterator()
// Iterate through all entries
for iter.SeekToFirst(); iter.Valid(); iter.Next() {
fmt.Printf("%s: ", iter.Key())
if iter.IsTombstone() {
fmt.Println("<deleted>")
} else {
fmt.Printf("%s\n", iter.Value())
}
}
// Or seek to a specific point
iter.Seek([]byte("key5"))
if iter.Valid() {
fmt.Printf("Found: %s\n", iter.Key())
}
```
## Configuration Options
The MemTable behavior can be tuned through several configuration parameters:
1. **MemTableSize** (default: 32MB):
- Maximum size before triggering a flush
- Larger sizes improve write throughput but increase memory usage
2. **MaxMemTables** (default: 4):
- Maximum number of MemTables in memory (active + immutable)
- Higher values allow more in-flight flushes
3. **MaxMemTableAge** (default: 600 seconds):
- Maximum age before forcing a flush
- Ensures data isn't held in memory too long
## Trade-offs and Limitations
### Write Bursts and Flush Stalls
High write bursts can lead to multiple MemTables becoming immutable before the background flush process completes. The system handles this by:
1. Maintaining multiple immutable MemTables in memory
2. Tracking the number of immutable MemTables
3. Potentially slowing down writes if too many immutable MemTables accumulate
### Memory Usage vs. Performance
The MemTable configuration involves balancing memory usage against performance:
1. **Larger MemTables**:
- Pro: Better write performance, fewer disk flushes
- Con: Higher memory usage, potentially longer recovery time
2. **Smaller MemTables**:
- Pro: Lower memory usage, faster recovery
- Con: More frequent flushes, potentially lower write throughput
### Ordering and Consistency
The MemTable maintains ordering via:
1. **Key Comparison**: Primary ordering by key
2. **Sequence Numbers**: Secondary ordering to handle updates to the same key
3. **Value Types**: Distinguishing between values and deletion markers
This ensures consistent state even with concurrent reads while a background flush is occurring.