kevo/docs/engine.md
Jeremy Tregunna 6fc3be617d
Some checks failed
Go Tests / Run Tests (1.24.2) (push) Has been cancelled
feat: Initial release of kevo storage engine.
Adds a complete LSM-based storage engine with these features:
- Single-writer based architecture for the storage engine
- WAL for durability, and hey it's configurable
- MemTable with skip list implementation for fast read/writes
- SSTable with block-based structure for on-disk level-based storage
- Background compaction with tiered strategy
- ACID transactions
- Good documentation (I hope)
2025-04-20 14:06:50 -06:00

9.2 KiB

Engine Package Documentation

The engine package provides the core storage engine functionality for the Kevo project. It integrates all components (WAL, MemTable, SSTables, Compaction) into a unified storage system with a simple interface.

Overview

The Engine is the main entry point for interacting with the storage system. It implements a Log-Structured Merge (LSM) tree architecture, which provides efficient writes and reasonable read performance for key-value storage.

Key responsibilities of the Engine include:

  • Managing the write path (WAL, MemTable, flush to SSTable)
  • Coordinating the read path across multiple storage layers
  • Handling concurrency with a single-writer design
  • Providing transaction support
  • Coordinating background operations like compaction

Architecture

Components and Data Flow

The engine orchestrates a multi-layered storage hierarchy:

┌───────────────────┐
│  Client Request   │
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐     ┌───────────────────┐
│      Engine       │◄────┤   Transactions    │
└─────────┬─────────┘     └───────────────────┘
          │
          ▼
┌───────────────────┐     ┌───────────────────┐
│  Write-Ahead Log  │     │    Statistics     │
└─────────┬─────────┘     └───────────────────┘
          │
          ▼
┌───────────────────┐
│     MemTable      │
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐     ┌───────────────────┐
│  Immutable MTs    │◄────┤    Background     │
└─────────┬─────────┘     │      Flush        │
          │               └───────────────────┘
          ▼
┌───────────────────┐     ┌───────────────────┐
│     SSTables      │◄────┤     Compaction    │
└───────────────────┘     └───────────────────┘

Key Sequence

  1. Write Path:

    • Client calls Put() or Delete()
    • Operation is logged in WAL for durability
    • Data is added to the active MemTable
    • When the MemTable reaches its size threshold, it becomes immutable
    • A background process flushes immutable MemTables to SSTables
    • Periodically, compaction merges SSTables for better read performance
  2. Read Path:

    • Client calls Get()
    • Engine searches for the key in this order: a. Active MemTable b. Immutable MemTables (if any) c. SSTables (from newest to oldest)
    • First occurrence of the key determines the result
    • Tombstones (deletion markers) cause key not found results

Implementation Details

Engine Structure

The Engine struct contains several important fields:

  • Configuration: The engine's configuration and paths
  • Storage Components: WAL, MemTable pool, and SSTable readers
  • Concurrency Control: Locks for coordination
  • State Management: Tracking variables for file numbers, sequence numbers, etc.
  • Background Processes: Channels and goroutines for background tasks

Key Operations

Initialization

The NewEngine() function initializes a storage engine by:

  1. Creating required directories
  2. Loading or creating configuration
  3. Initializing the WAL
  4. Creating a MemTable pool
  5. Loading existing SSTables
  6. Recovering data from WAL if necessary
  7. Starting background tasks for flushing and compaction

Write Operations

The Put() and Delete() methods follow a similar pattern:

  1. Acquire a write lock
  2. Append the operation to the WAL
  3. Update the active MemTable
  4. Check if the MemTable needs to be flushed
  5. Release the lock

Read Operations

The Get() method:

  1. Acquires a read lock
  2. Checks the MemTable for the key
  3. If not found, checks SSTables in order from newest to oldest
  4. Handles tombstones (deletion markers) appropriately
  5. Returns the value or a "key not found" error

MemTable Flushing

When a MemTable becomes full:

  1. The scheduleFlush() method switches to a new active MemTable
  2. The filled MemTable becomes immutable
  3. A background process flushes the immutable MemTable to an SSTable

SSTable Management

SSTables are organized by level for compaction:

  • Level 0 contains SSTables directly flushed from MemTables
  • Higher levels are created through compaction
  • Keys may overlap between SSTables in Level 0
  • Keys are non-overlapping between SSTables in higher levels

Transaction Support

The engine provides ACID-compliant transactions through:

  1. Atomicity: WAL logging and atomic batch operations
  2. Consistency: Single-writer architecture
  3. Isolation: Reader-writer concurrency control (similar to SQLite)
  4. Durability: WAL ensures operations are persisted before being considered committed

Transactions are created using the BeginTransaction() method, which returns a Transaction interface with these key methods:

  • Get(), Put(), Delete(): For data operations
  • NewIterator(), NewRangeIterator(): For scanning data
  • Commit(), Rollback(): For transaction control

Error Handling

The engine handles various error conditions:

  • File system errors during WAL and SSTable operations
  • Memory limitations
  • Concurrency issues
  • Recovery from crashes

Key errors that may be returned include:

  • ErrEngineClosed: When operations are attempted on a closed engine
  • ErrKeyNotFound: When a key is not found during retrieval

Performance Considerations

Statistics

The engine maintains detailed statistics for monitoring:

  • Operation counters (puts, gets, deletes)
  • Hit and miss rates
  • Bytes read and written
  • Flush counts and MemTable sizes
  • Error tracking

These statistics can be accessed via the GetStats() method.

Tuning Parameters

Performance can be tuned through the configuration parameters:

  • MemTable size
  • WAL sync mode
  • SSTable block size
  • Compaction settings

Resource Management

The engine manages resources to prevent excessive memory usage:

  • MemTables are flushed when they reach a size threshold
  • Background processing prevents memory buildup
  • File descriptors for SSTables are managed carefully

Common Usage Patterns

Basic Usage

// Create an engine
eng, err := engine.NewEngine("/path/to/data")
if err != nil {
    log.Fatal(err)
}
defer eng.Close()

// Store and retrieve data
err = eng.Put([]byte("key"), []byte("value"))
if err != nil {
    log.Fatal(err)
}

value, err := eng.Get([]byte("key"))
if err != nil {
    log.Fatal(err)
}
fmt.Printf("Value: %s\n", value)

Using Transactions

// Begin a transaction
tx, err := eng.BeginTransaction(false) // false = read-write transaction
if err != nil {
    log.Fatal(err)
}

// Perform operations in the transaction
err = tx.Put([]byte("key1"), []byte("value1"))
if err != nil {
    tx.Rollback()
    log.Fatal(err)
}

// Commit the transaction
err = tx.Commit()
if err != nil {
    log.Fatal(err)
}

Iterating Over Keys

// Get an iterator for all keys
iter, err := eng.GetIterator()
if err != nil {
    log.Fatal(err)
}

// Iterate from the first key
for iter.SeekToFirst(); iter.Valid(); iter.Next() {
    fmt.Printf("%s: %s\n", iter.Key(), iter.Value())
}

// Get an iterator for a specific range
rangeIter, err := eng.GetRangeIterator([]byte("start"), []byte("end"))
if err != nil {
    log.Fatal(err)
}

// Iterate through the range
for rangeIter.SeekToFirst(); rangeIter.Valid(); rangeIter.Next() {
    fmt.Printf("%s: %s\n", rangeIter.Key(), rangeIter.Value())
}

Comparison with Other Storage Engines

Unlike many production storage engines like RocksDB or LevelDB, the Kevo engine prioritizes:

  1. Simplicity: Clear Go implementation with minimal dependencies
  2. Educational Value: Code readability over absolute performance
  3. Composability: Clean interfaces for higher-level abstractions
  4. Single-Node Focus: No distributed features to complicate the design

Features missing compared to production engines:

  • Bloom filters (optional enhancement)
  • Advanced caching systems
  • Complex compression schemes
  • Multi-node distribution capabilities

Limitations and Trade-offs

  • Write Amplification: LSM-trees involve multiple writes of the same data
  • Read Amplification: May need to check multiple layers for a single key
  • Space Amplification: Some space overhead for tombstones and overlapping keys
  • Background Compaction: Performance may be affected by background compaction

However, the design mitigates these issues:

  • Efficient in-memory structures minimize disk accesses
  • Hierarchical iterators optimize range scans
  • Compaction strategies reduce read amplification over time