Some checks failed
Go Tests / Run Tests (1.24.2) (push) Has been cancelled
Adds a complete LSM-based storage engine with these features: - Single-writer based architecture for the storage engine - WAL for durability, and hey it's configurable - MemTable with skip list implementation for fast read/writes - SSTable with block-based structure for on-disk level-based storage - Background compaction with tiered strategy - ACID transactions - Good documentation (I hope)
328 lines
10 KiB
Markdown
328 lines
10 KiB
Markdown
# MemTable Package Documentation
|
|
|
|
The `memtable` package implements an in-memory data structure for the Kevo engine. MemTables are a key component of the LSM tree architecture, providing fast, sorted, in-memory storage for recently written data before it's flushed to disk as SSTables.
|
|
|
|
## Overview
|
|
|
|
MemTables serve as the primary write buffer for the storage engine, allowing efficient processing of write operations before they are persisted to disk. The implementation uses a skiplist data structure to provide fast insertions, retrievals, and ordered iteration.
|
|
|
|
Key responsibilities of the MemTable include:
|
|
- Providing fast in-memory writes
|
|
- Supporting efficient key lookups
|
|
- Offering ordered iteration for range scans
|
|
- Tracking tombstones for deleted keys
|
|
- Supporting atomic transitions between mutable and immutable states
|
|
|
|
## Architecture
|
|
|
|
### Core Components
|
|
|
|
The MemTable package consists of several interrelated components:
|
|
|
|
1. **SkipList**: The core data structure providing O(log n) operations.
|
|
2. **MemTable**: A wrapper around SkipList with additional functionality.
|
|
3. **MemTablePool**: A manager for active and immutable MemTables.
|
|
4. **Recovery**: Mechanisms for rebuilding MemTables from WAL entries.
|
|
|
|
```
|
|
┌─────────────────┐
|
|
│ MemTablePool │
|
|
└───────┬─────────┘
|
|
│
|
|
┌───────┴─────────┐ ┌─────────────────┐
|
|
│ Active MemTable │ │ Immutable │
|
|
└───────┬─────────┘ │ MemTables │
|
|
│ └─────────────────┘
|
|
┌───────┴─────────┐
|
|
│ SkipList │
|
|
└─────────────────┘
|
|
```
|
|
|
|
## Implementation Details
|
|
|
|
### SkipList Data Structure
|
|
|
|
The SkipList is a probabilistic data structure that allows fast operations by maintaining multiple layers of linked lists:
|
|
|
|
1. **Nodes**: Each node contains:
|
|
- Entry data (key, value, sequence number, value type)
|
|
- Height information
|
|
- Next pointers at each level
|
|
|
|
2. **Probabilistic Height**: New nodes get a random height following a probabilistic distribution:
|
|
- Height 1: 100% of nodes
|
|
- Height 2: 25% of nodes
|
|
- Height 3: 6.25% of nodes, etc.
|
|
|
|
3. **Search Algorithm**:
|
|
- Starts at the highest level of the head node
|
|
- Moves forward until finding a node greater than the target
|
|
- Drops down a level and continues
|
|
- This gives O(log n) expected time for operations
|
|
|
|
4. **Concurrency Considerations**:
|
|
- Uses atomic operations for pointer manipulation
|
|
- Cache-aligned node structure
|
|
|
|
### Memory Management
|
|
|
|
The MemTable implementation includes careful memory management:
|
|
|
|
1. **Size Tracking**:
|
|
- Each entry's size is estimated (key length + value length + overhead)
|
|
- Running total maintained using atomic operations
|
|
|
|
2. **Resource Limits**:
|
|
- Configurable maximum size (default 32MB)
|
|
- Age-based limits (configurable maximum age)
|
|
- When limits are reached, the MemTable becomes immutable
|
|
|
|
3. **Memory Overhead**:
|
|
- Skip list nodes add overhead (pointers at each level)
|
|
- Overhead is controlled by limiting maximum height (12 by default)
|
|
- Bracing factor of 4 provides good balance between height and width
|
|
|
|
### Entry Types and Tombstones
|
|
|
|
The MemTable supports two types of entries:
|
|
|
|
1. **Value Entries** (`TypeValue`):
|
|
- Normal key-value pairs
|
|
- Stored with their sequence number
|
|
|
|
2. **Deletion Tombstones** (`TypeDeletion`):
|
|
- Markers indicating a key has been deleted
|
|
- Value is nil, but the key and sequence number are preserved
|
|
- Essential for proper deletion semantics in the LSM tree architecture
|
|
|
|
### MemTablePool
|
|
|
|
The MemTablePool manages multiple MemTables:
|
|
|
|
1. **Active MemTable**:
|
|
- Single mutable MemTable for current writes
|
|
- Becomes immutable when size/age thresholds are reached
|
|
|
|
2. **Immutable MemTables**:
|
|
- Former active MemTables waiting to be flushed to disk
|
|
- Read-only, no modifications allowed
|
|
- Still available for reads while awaiting flush
|
|
|
|
3. **Lifecycle Management**:
|
|
- Monitors size and age of active MemTable
|
|
- Triggers transitions from active to immutable
|
|
- Creates new active MemTable when needed
|
|
|
|
### Iterator Functionality
|
|
|
|
MemTables provide iterator interfaces for sequential access:
|
|
|
|
1. **Forward Iteration**:
|
|
- `SeekToFirst()`: Position at the first entry
|
|
- `Seek(key)`: Position at or after the given key
|
|
- `Next()`: Move to the next entry
|
|
- `Valid()`: Check if the current position is valid
|
|
|
|
2. **Entry Access**:
|
|
- `Key()`: Get the current entry's key
|
|
- `Value()`: Get the current entry's value
|
|
- `IsTombstone()`: Check if the current entry is a deletion marker
|
|
|
|
3. **Iterator Adapters**:
|
|
- Adapters to the common iterator interface for the engine
|
|
|
|
## Concurrency and Isolation
|
|
|
|
MemTables employ a concurrency model suited for the storage engine's architecture:
|
|
|
|
1. **Read Concurrency**:
|
|
- Multiple readers can access MemTables concurrently
|
|
- Read locks are used for concurrent Get operations
|
|
|
|
2. **Write Isolation**:
|
|
- The single-writer architecture ensures only one writer at a time
|
|
- Writes to the active MemTable use write locks
|
|
|
|
3. **Immutable State**:
|
|
- Once a MemTable becomes immutable, no further modifications occur
|
|
- This provides a simple isolation model
|
|
|
|
4. **Atomic Transitions**:
|
|
- The transition from mutable to immutable is atomic
|
|
- Uses atomic boolean for immutable state flag
|
|
|
|
## Recovery Process
|
|
|
|
The recovery functionality rebuilds MemTables from WAL data:
|
|
|
|
1. **WAL Entries**:
|
|
- Each WAL entry contains an operation type, key, value and sequence number
|
|
- Entries are processed in order to rebuild the MemTable state
|
|
|
|
2. **Sequence Number Handling**:
|
|
- Maximum sequence number is tracked during recovery
|
|
- Ensures future operations have larger sequence numbers
|
|
|
|
3. **Batch Operations**:
|
|
- Support for atomic batch operations from WAL
|
|
- Batch entries contain multiple operations with sequential sequence numbers
|
|
|
|
## Performance Considerations
|
|
|
|
### Time Complexity
|
|
|
|
The SkipList data structure offers favorable complexity for MemTable operations:
|
|
|
|
| Operation | Average Case | Worst Case |
|
|
|-----------|--------------|------------|
|
|
| Insert | O(log n) | O(n) |
|
|
| Lookup | O(log n) | O(n) |
|
|
| Delete | O(log n) | O(n) |
|
|
| Iteration | O(1) per step| O(1) per step |
|
|
|
|
### Memory Usage Optimization
|
|
|
|
Several optimizations are employed to improve memory efficiency:
|
|
|
|
1. **Shared Memory Allocations**:
|
|
- Node arrays allocated in contiguous blocks
|
|
- Reduces allocation overhead
|
|
|
|
2. **Cache Awareness**:
|
|
- Nodes aligned to cache lines (64 bytes)
|
|
- Improves CPU cache utilization
|
|
|
|
3. **Appropriate Sizing**:
|
|
- Default sizing (32MB) provides good balance
|
|
- Configurable based on workload needs
|
|
|
|
### Write Amplification
|
|
|
|
MemTables help reduce write amplification in the LSM architecture:
|
|
|
|
1. **Buffering Writes**:
|
|
- Multiple key updates are consolidated in memory
|
|
- Only the latest value gets written to disk
|
|
|
|
2. **Batching**:
|
|
- Many small writes batched into larger disk operations
|
|
- Improves overall I/O efficiency
|
|
|
|
## Common Usage Patterns
|
|
|
|
### Basic Usage
|
|
|
|
```go
|
|
// Create a new MemTable
|
|
memTable := memtable.NewMemTable()
|
|
|
|
// Add entries with incrementing sequence numbers
|
|
memTable.Put([]byte("key1"), []byte("value1"), 1)
|
|
memTable.Put([]byte("key2"), []byte("value2"), 2)
|
|
memTable.Delete([]byte("key3"), 3)
|
|
|
|
// Retrieve a value
|
|
value, found := memTable.Get([]byte("key1"))
|
|
if found {
|
|
fmt.Printf("Value: %s\n", value)
|
|
}
|
|
|
|
// Check if the MemTable is too large
|
|
if memTable.ApproximateSize() > 32*1024*1024 {
|
|
memTable.SetImmutable()
|
|
// Write to disk...
|
|
}
|
|
```
|
|
|
|
### Using MemTablePool
|
|
|
|
```go
|
|
// Create a pool with configuration
|
|
config := config.NewDefaultConfig("/path/to/data")
|
|
pool := memtable.NewMemTablePool(config)
|
|
|
|
// Add entries
|
|
pool.Put([]byte("key1"), []byte("value1"), 1)
|
|
pool.Delete([]byte("key2"), 2)
|
|
|
|
// Check if flushing is needed
|
|
if pool.IsFlushNeeded() {
|
|
// Switch to a new active MemTable and get the old one for flushing
|
|
immutable := pool.SwitchToNewMemTable()
|
|
|
|
// Flush the immutable table to disk as an SSTable
|
|
// ...
|
|
}
|
|
```
|
|
|
|
### Iterating Over Entries
|
|
|
|
```go
|
|
// Create an iterator
|
|
iter := memTable.NewIterator()
|
|
|
|
// Iterate through all entries
|
|
for iter.SeekToFirst(); iter.Valid(); iter.Next() {
|
|
fmt.Printf("%s: ", iter.Key())
|
|
|
|
if iter.IsTombstone() {
|
|
fmt.Println("<deleted>")
|
|
} else {
|
|
fmt.Printf("%s\n", iter.Value())
|
|
}
|
|
}
|
|
|
|
// Or seek to a specific point
|
|
iter.Seek([]byte("key5"))
|
|
if iter.Valid() {
|
|
fmt.Printf("Found: %s\n", iter.Key())
|
|
}
|
|
```
|
|
|
|
## Configuration Options
|
|
|
|
The MemTable behavior can be tuned through several configuration parameters:
|
|
|
|
1. **MemTableSize** (default: 32MB):
|
|
- Maximum size before triggering a flush
|
|
- Larger sizes improve write throughput but increase memory usage
|
|
|
|
2. **MaxMemTables** (default: 4):
|
|
- Maximum number of MemTables in memory (active + immutable)
|
|
- Higher values allow more in-flight flushes
|
|
|
|
3. **MaxMemTableAge** (default: 600 seconds):
|
|
- Maximum age before forcing a flush
|
|
- Ensures data isn't held in memory too long
|
|
|
|
## Trade-offs and Limitations
|
|
|
|
### Write Bursts and Flush Stalls
|
|
|
|
High write bursts can lead to multiple MemTables becoming immutable before the background flush process completes. The system handles this by:
|
|
|
|
1. Maintaining multiple immutable MemTables in memory
|
|
2. Tracking the number of immutable MemTables
|
|
3. Potentially slowing down writes if too many immutable MemTables accumulate
|
|
|
|
### Memory Usage vs. Performance
|
|
|
|
The MemTable configuration involves balancing memory usage against performance:
|
|
|
|
1. **Larger MemTables**:
|
|
- Pro: Better write performance, fewer disk flushes
|
|
- Con: Higher memory usage, potentially longer recovery time
|
|
|
|
2. **Smaller MemTables**:
|
|
- Pro: Lower memory usage, faster recovery
|
|
- Con: More frequent flushes, potentially lower write throughput
|
|
|
|
### Ordering and Consistency
|
|
|
|
The MemTable maintains ordering via:
|
|
|
|
1. **Key Comparison**: Primary ordering by key
|
|
2. **Sequence Numbers**: Secondary ordering to handle updates to the same key
|
|
3. **Value Types**: Distinguishing between values and deletion markers
|
|
|
|
This ensures consistent state even with concurrent reads while a background flush is occurring. |