kevo/docs/engine.md
Jeremy Tregunna 6fc3be617d
Some checks failed
Go Tests / Run Tests (1.24.2) (push) Has been cancelled
feat: Initial release of kevo storage engine.
Adds a complete LSM-based storage engine with these features:
- Single-writer based architecture for the storage engine
- WAL for durability, and hey it's configurable
- MemTable with skip list implementation for fast read/writes
- SSTable with block-based structure for on-disk level-based storage
- Background compaction with tiered strategy
- ACID transactions
- Good documentation (I hope)
2025-04-20 14:06:50 -06:00

283 lines
9.2 KiB
Markdown

# Engine Package Documentation
The `engine` package provides the core storage engine functionality for the Kevo project. It integrates all components (WAL, MemTable, SSTables, Compaction) into a unified storage system with a simple interface.
## Overview
The Engine is the main entry point for interacting with the storage system. It implements a Log-Structured Merge (LSM) tree architecture, which provides efficient writes and reasonable read performance for key-value storage.
Key responsibilities of the Engine include:
- Managing the write path (WAL, MemTable, flush to SSTable)
- Coordinating the read path across multiple storage layers
- Handling concurrency with a single-writer design
- Providing transaction support
- Coordinating background operations like compaction
## Architecture
### Components and Data Flow
The engine orchestrates a multi-layered storage hierarchy:
```
┌───────────────────┐
│ Client Request │
└─────────┬─────────┘
┌───────────────────┐ ┌───────────────────┐
│ Engine │◄────┤ Transactions │
└─────────┬─────────┘ └───────────────────┘
┌───────────────────┐ ┌───────────────────┐
│ Write-Ahead Log │ │ Statistics │
└─────────┬─────────┘ └───────────────────┘
┌───────────────────┐
│ MemTable │
└─────────┬─────────┘
┌───────────────────┐ ┌───────────────────┐
│ Immutable MTs │◄────┤ Background │
└─────────┬─────────┘ │ Flush │
│ └───────────────────┘
┌───────────────────┐ ┌───────────────────┐
│ SSTables │◄────┤ Compaction │
└───────────────────┘ └───────────────────┘
```
### Key Sequence
1. **Write Path**:
- Client calls `Put()` or `Delete()`
- Operation is logged in WAL for durability
- Data is added to the active MemTable
- When the MemTable reaches its size threshold, it becomes immutable
- A background process flushes immutable MemTables to SSTables
- Periodically, compaction merges SSTables for better read performance
2. **Read Path**:
- Client calls `Get()`
- Engine searches for the key in this order:
a. Active MemTable
b. Immutable MemTables (if any)
c. SSTables (from newest to oldest)
- First occurrence of the key determines the result
- Tombstones (deletion markers) cause key not found results
## Implementation Details
### Engine Structure
The Engine struct contains several important fields:
- **Configuration**: The engine's configuration and paths
- **Storage Components**: WAL, MemTable pool, and SSTable readers
- **Concurrency Control**: Locks for coordination
- **State Management**: Tracking variables for file numbers, sequence numbers, etc.
- **Background Processes**: Channels and goroutines for background tasks
### Key Operations
#### Initialization
The `NewEngine()` function initializes a storage engine by:
1. Creating required directories
2. Loading or creating configuration
3. Initializing the WAL
4. Creating a MemTable pool
5. Loading existing SSTables
6. Recovering data from WAL if necessary
7. Starting background tasks for flushing and compaction
#### Write Operations
The `Put()` and `Delete()` methods follow a similar pattern:
1. Acquire a write lock
2. Append the operation to the WAL
3. Update the active MemTable
4. Check if the MemTable needs to be flushed
5. Release the lock
#### Read Operations
The `Get()` method:
1. Acquires a read lock
2. Checks the MemTable for the key
3. If not found, checks SSTables in order from newest to oldest
4. Handles tombstones (deletion markers) appropriately
5. Returns the value or a "key not found" error
#### MemTable Flushing
When a MemTable becomes full:
1. The `scheduleFlush()` method switches to a new active MemTable
2. The filled MemTable becomes immutable
3. A background process flushes the immutable MemTable to an SSTable
#### SSTable Management
SSTables are organized by level for compaction:
- Level 0 contains SSTables directly flushed from MemTables
- Higher levels are created through compaction
- Keys may overlap between SSTables in Level 0
- Keys are non-overlapping between SSTables in higher levels
## Transaction Support
The engine provides ACID-compliant transactions through:
1. **Atomicity**: WAL logging and atomic batch operations
2. **Consistency**: Single-writer architecture
3. **Isolation**: Reader-writer concurrency control (similar to SQLite)
4. **Durability**: WAL ensures operations are persisted before being considered committed
Transactions are created using the `BeginTransaction()` method, which returns a `Transaction` interface with these key methods:
- `Get()`, `Put()`, `Delete()`: For data operations
- `NewIterator()`, `NewRangeIterator()`: For scanning data
- `Commit()`, `Rollback()`: For transaction control
## Error Handling
The engine handles various error conditions:
- File system errors during WAL and SSTable operations
- Memory limitations
- Concurrency issues
- Recovery from crashes
Key errors that may be returned include:
- `ErrEngineClosed`: When operations are attempted on a closed engine
- `ErrKeyNotFound`: When a key is not found during retrieval
## Performance Considerations
### Statistics
The engine maintains detailed statistics for monitoring:
- Operation counters (puts, gets, deletes)
- Hit and miss rates
- Bytes read and written
- Flush counts and MemTable sizes
- Error tracking
These statistics can be accessed via the `GetStats()` method.
### Tuning Parameters
Performance can be tuned through the configuration parameters:
- MemTable size
- WAL sync mode
- SSTable block size
- Compaction settings
### Resource Management
The engine manages resources to prevent excessive memory usage:
- MemTables are flushed when they reach a size threshold
- Background processing prevents memory buildup
- File descriptors for SSTables are managed carefully
## Common Usage Patterns
### Basic Usage
```go
// Create an engine
eng, err := engine.NewEngine("/path/to/data")
if err != nil {
log.Fatal(err)
}
defer eng.Close()
// Store and retrieve data
err = eng.Put([]byte("key"), []byte("value"))
if err != nil {
log.Fatal(err)
}
value, err := eng.Get([]byte("key"))
if err != nil {
log.Fatal(err)
}
fmt.Printf("Value: %s\n", value)
```
### Using Transactions
```go
// Begin a transaction
tx, err := eng.BeginTransaction(false) // false = read-write transaction
if err != nil {
log.Fatal(err)
}
// Perform operations in the transaction
err = tx.Put([]byte("key1"), []byte("value1"))
if err != nil {
tx.Rollback()
log.Fatal(err)
}
// Commit the transaction
err = tx.Commit()
if err != nil {
log.Fatal(err)
}
```
### Iterating Over Keys
```go
// Get an iterator for all keys
iter, err := eng.GetIterator()
if err != nil {
log.Fatal(err)
}
// Iterate from the first key
for iter.SeekToFirst(); iter.Valid(); iter.Next() {
fmt.Printf("%s: %s\n", iter.Key(), iter.Value())
}
// Get an iterator for a specific range
rangeIter, err := eng.GetRangeIterator([]byte("start"), []byte("end"))
if err != nil {
log.Fatal(err)
}
// Iterate through the range
for rangeIter.SeekToFirst(); rangeIter.Valid(); rangeIter.Next() {
fmt.Printf("%s: %s\n", rangeIter.Key(), rangeIter.Value())
}
```
## Comparison with Other Storage Engines
Unlike many production storage engines like RocksDB or LevelDB, the Kevo engine prioritizes:
1. **Simplicity**: Clear Go implementation with minimal dependencies
2. **Educational Value**: Code readability over absolute performance
3. **Composability**: Clean interfaces for higher-level abstractions
4. **Single-Node Focus**: No distributed features to complicate the design
Features missing compared to production engines:
- Bloom filters (optional enhancement)
- Advanced caching systems
- Complex compression schemes
- Multi-node distribution capabilities
## Limitations and Trade-offs
- **Write Amplification**: LSM-trees involve multiple writes of the same data
- **Read Amplification**: May need to check multiple layers for a single key
- **Space Amplification**: Some space overhead for tombstones and overlapping keys
- **Background Compaction**: Performance may be affected by background compaction
However, the design mitigates these issues:
- Efficient in-memory structures minimize disk accesses
- Hierarchical iterators optimize range scans
- Compaction strategies reduce read amplification over time