Adds a complete LSM-based storage engine with these features: - Single-writer based architecture for the storage engine - WAL for durability, and hey it's configurable - MemTable with skip list implementation for fast read/writes - SSTable with block-based structure for on-disk level-based storage - Background compaction with tiered strategy - ACID transactions - Good documentation (I hope)
9.2 KiB
Engine Package Documentation
The engine
package provides the core storage engine functionality for the Kevo project. It integrates all components (WAL, MemTable, SSTables, Compaction) into a unified storage system with a simple interface.
Overview
The Engine is the main entry point for interacting with the storage system. It implements a Log-Structured Merge (LSM) tree architecture, which provides efficient writes and reasonable read performance for key-value storage.
Key responsibilities of the Engine include:
- Managing the write path (WAL, MemTable, flush to SSTable)
- Coordinating the read path across multiple storage layers
- Handling concurrency with a single-writer design
- Providing transaction support
- Coordinating background operations like compaction
Architecture
Components and Data Flow
The engine orchestrates a multi-layered storage hierarchy:
┌───────────────────┐
│ Client Request │
└─────────┬─────────┘
│
▼
┌───────────────────┐ ┌───────────────────┐
│ Engine │◄────┤ Transactions │
└─────────┬─────────┘ └───────────────────┘
│
▼
┌───────────────────┐ ┌───────────────────┐
│ Write-Ahead Log │ │ Statistics │
└─────────┬─────────┘ └───────────────────┘
│
▼
┌───────────────────┐
│ MemTable │
└─────────┬─────────┘
│
▼
┌───────────────────┐ ┌───────────────────┐
│ Immutable MTs │◄────┤ Background │
└─────────┬─────────┘ │ Flush │
│ └───────────────────┘
▼
┌───────────────────┐ ┌───────────────────┐
│ SSTables │◄────┤ Compaction │
└───────────────────┘ └───────────────────┘
Key Sequence
-
Write Path:
- Client calls
Put()
orDelete()
- Operation is logged in WAL for durability
- Data is added to the active MemTable
- When the MemTable reaches its size threshold, it becomes immutable
- A background process flushes immutable MemTables to SSTables
- Periodically, compaction merges SSTables for better read performance
- Client calls
-
Read Path:
- Client calls
Get()
- Engine searches for the key in this order: a. Active MemTable b. Immutable MemTables (if any) c. SSTables (from newest to oldest)
- First occurrence of the key determines the result
- Tombstones (deletion markers) cause key not found results
- Client calls
Implementation Details
Engine Structure
The Engine struct contains several important fields:
- Configuration: The engine's configuration and paths
- Storage Components: WAL, MemTable pool, and SSTable readers
- Concurrency Control: Locks for coordination
- State Management: Tracking variables for file numbers, sequence numbers, etc.
- Background Processes: Channels and goroutines for background tasks
Key Operations
Initialization
The NewEngine()
function initializes a storage engine by:
- Creating required directories
- Loading or creating configuration
- Initializing the WAL
- Creating a MemTable pool
- Loading existing SSTables
- Recovering data from WAL if necessary
- Starting background tasks for flushing and compaction
Write Operations
The Put()
and Delete()
methods follow a similar pattern:
- Acquire a write lock
- Append the operation to the WAL
- Update the active MemTable
- Check if the MemTable needs to be flushed
- Release the lock
Read Operations
The Get()
method:
- Acquires a read lock
- Checks the MemTable for the key
- If not found, checks SSTables in order from newest to oldest
- Handles tombstones (deletion markers) appropriately
- Returns the value or a "key not found" error
MemTable Flushing
When a MemTable becomes full:
- The
scheduleFlush()
method switches to a new active MemTable - The filled MemTable becomes immutable
- A background process flushes the immutable MemTable to an SSTable
SSTable Management
SSTables are organized by level for compaction:
- Level 0 contains SSTables directly flushed from MemTables
- Higher levels are created through compaction
- Keys may overlap between SSTables in Level 0
- Keys are non-overlapping between SSTables in higher levels
Transaction Support
The engine provides ACID-compliant transactions through:
- Atomicity: WAL logging and atomic batch operations
- Consistency: Single-writer architecture
- Isolation: Reader-writer concurrency control (similar to SQLite)
- Durability: WAL ensures operations are persisted before being considered committed
Transactions are created using the BeginTransaction()
method, which returns a Transaction
interface with these key methods:
Get()
,Put()
,Delete()
: For data operationsNewIterator()
,NewRangeIterator()
: For scanning dataCommit()
,Rollback()
: For transaction control
Error Handling
The engine handles various error conditions:
- File system errors during WAL and SSTable operations
- Memory limitations
- Concurrency issues
- Recovery from crashes
Key errors that may be returned include:
ErrEngineClosed
: When operations are attempted on a closed engineErrKeyNotFound
: When a key is not found during retrieval
Performance Considerations
Statistics
The engine maintains detailed statistics for monitoring:
- Operation counters (puts, gets, deletes)
- Hit and miss rates
- Bytes read and written
- Flush counts and MemTable sizes
- Error tracking
These statistics can be accessed via the GetStats()
method.
Tuning Parameters
Performance can be tuned through the configuration parameters:
- MemTable size
- WAL sync mode
- SSTable block size
- Compaction settings
Resource Management
The engine manages resources to prevent excessive memory usage:
- MemTables are flushed when they reach a size threshold
- Background processing prevents memory buildup
- File descriptors for SSTables are managed carefully
Common Usage Patterns
Basic Usage
// Create an engine
eng, err := engine.NewEngine("/path/to/data")
if err != nil {
log.Fatal(err)
}
defer eng.Close()
// Store and retrieve data
err = eng.Put([]byte("key"), []byte("value"))
if err != nil {
log.Fatal(err)
}
value, err := eng.Get([]byte("key"))
if err != nil {
log.Fatal(err)
}
fmt.Printf("Value: %s\n", value)
Using Transactions
// Begin a transaction
tx, err := eng.BeginTransaction(false) // false = read-write transaction
if err != nil {
log.Fatal(err)
}
// Perform operations in the transaction
err = tx.Put([]byte("key1"), []byte("value1"))
if err != nil {
tx.Rollback()
log.Fatal(err)
}
// Commit the transaction
err = tx.Commit()
if err != nil {
log.Fatal(err)
}
Iterating Over Keys
// Get an iterator for all keys
iter, err := eng.GetIterator()
if err != nil {
log.Fatal(err)
}
// Iterate from the first key
for iter.SeekToFirst(); iter.Valid(); iter.Next() {
fmt.Printf("%s: %s\n", iter.Key(), iter.Value())
}
// Get an iterator for a specific range
rangeIter, err := eng.GetRangeIterator([]byte("start"), []byte("end"))
if err != nil {
log.Fatal(err)
}
// Iterate through the range
for rangeIter.SeekToFirst(); rangeIter.Valid(); rangeIter.Next() {
fmt.Printf("%s: %s\n", rangeIter.Key(), rangeIter.Value())
}
Comparison with Other Storage Engines
Unlike many production storage engines like RocksDB or LevelDB, the Kevo engine prioritizes:
- Simplicity: Clear Go implementation with minimal dependencies
- Educational Value: Code readability over absolute performance
- Composability: Clean interfaces for higher-level abstractions
- Single-Node Focus: No distributed features to complicate the design
Features missing compared to production engines:
- Bloom filters (optional enhancement)
- Advanced caching systems
- Complex compression schemes
- Multi-node distribution capabilities
Limitations and Trade-offs
- Write Amplification: LSM-trees involve multiple writes of the same data
- Read Amplification: May need to check multiple layers for a single key
- Space Amplification: Some space overhead for tombstones and overlapping keys
- Background Compaction: Performance may be affected by background compaction
However, the design mitigates these issues:
- Efficient in-memory structures minimize disk accesses
- Hierarchical iterators optimize range scans
- Compaction strategies reduce read amplification over time