kevo/docs/memtable.md

# MemTable Package Documentation

The `memtable` package implements an in-memory data structure for the Kevo engine. MemTables are a key component of the LSM tree architecture, providing fast, sorted, in-memory storage for recently written data before it's flushed to disk as SSTables.

## Overview

MemTables serve as the primary write buffer for the storage engine, allowing efficient processing of write operations before they are persisted to disk. The implementation uses a skiplist data structure to provide fast insertions, retrievals, and ordered iteration.

Key responsibilities of the MemTable include:
- Providing fast in-memory writes
- Supporting efficient key lookups
- Offering ordered iteration for range scans
- Tracking tombstones for deleted keys
- Supporting atomic transitions between mutable and immutable states

## Architecture

### Core Components

The MemTable package consists of several interrelated components:

1. **SkipList**: The core data structure providing O(log n) operations.
2. **MemTable**: A wrapper around SkipList with additional functionality.
3. **MemTablePool**: A manager for active and immutable MemTables.
4. **Recovery**: Mechanisms for rebuilding MemTables from WAL entries.

```
┌─────────────────┐
│  MemTablePool   │
└───────┬─────────┘
        │
┌───────┴─────────┐      ┌─────────────────┐
│ Active MemTable │      │   Immutable     │
└───────┬─────────┘      │   MemTables     │
        │                └─────────────────┘
┌───────┴─────────┐
│    SkipList     │
└─────────────────┘
```

## Implementation Details

### SkipList Data Structure

The SkipList is a probabilistic data structure that allows fast operations by maintaining multiple layers of linked lists:

1. **Nodes**: Each node contains:
   - Entry data (key, value, sequence number, value type)
   - Height information
   - Next pointers at each level

2. **Probabilistic Height**: New nodes get a random height following a probabilistic distribution:
   - Height 1: 100% of nodes
   - Height 2: 25% of nodes
   - Height 3: 6.25% of nodes, etc.

3. **Search Algorithm**:
   - Starts at the highest level of the head node
   - Moves forward until finding a node greater than the target
   - Drops down a level and continues
   - This gives O(log n) expected time for operations

4. **Concurrency Considerations**:
   - Uses atomic operations for pointer manipulation
   - Cache-aligned node structure

### Memory Management

The MemTable implementation includes careful memory management:

1. **Size Tracking**:
   - Each entry's size is estimated (key length + value length + overhead)
   - Running total maintained using atomic operations

2. **Resource Limits**:
   - Configurable maximum size (default 32MB)
   - Age-based limits (configurable maximum age)
   - When limits are reached, the MemTable becomes immutable

3. **Memory Overhead**:
   - Skip list nodes add overhead (pointers at each level)
   - Overhead is controlled by limiting maximum height (12 by default)
   - Bracing factor of 4 provides good balance between height and width

### Entry Types and Tombstones

The MemTable supports two types of entries:

1. **Value Entries** (`TypeValue`):
   - Normal key-value pairs
   - Stored with their sequence number

2. **Deletion Tombstones** (`TypeDeletion`):
   - Markers indicating a key has been deleted
   - Value is nil, but the key and sequence number are preserved
   - Essential for proper deletion semantics in the LSM tree architecture

### MemTablePool

The MemTablePool manages multiple MemTables:

1. **Active MemTable**:
   - Single mutable MemTable for current writes
   - Becomes immutable when size/age thresholds are reached

2. **Immutable MemTables**:
   - Former active MemTables waiting to be flushed to disk
   - Read-only, no modifications allowed
   - Still available for reads while awaiting flush

3. **Lifecycle Management**:
   - Monitors size and age of active MemTable
   - Triggers transitions from active to immutable
   - Creates new active MemTable when needed

### Iterator Functionality

MemTables provide iterator interfaces for sequential access:

1. **Forward Iteration**:
   - `SeekToFirst()`: Position at the first entry
   - `Seek(key)`: Position at or after the given key
   - `Next()`: Move to the next entry
   - `Valid()`: Check if the current position is valid

2. **Entry Access**:
   - `Key()`: Get the current entry's key
   - `Value()`: Get the current entry's value
   - `IsTombstone()`: Check if the current entry is a deletion marker

3. **Iterator Adapters**:
   - Adapters to the common iterator interface for the engine

## Concurrency and Isolation

MemTables employ a concurrency model suited for the storage engine's architecture:

1. **Read Concurrency**:
   - Multiple readers can access MemTables concurrently
   - Read locks are used for concurrent Get operations

2. **Write Isolation**:
   - The single-writer architecture ensures only one writer at a time
   - Writes to the active MemTable use write locks

3. **Immutable State**:
   - Once a MemTable becomes immutable, no further modifications occur
   - This provides a simple isolation model

4. **Atomic Transitions**:
   - The transition from mutable to immutable is atomic
   - Uses atomic boolean for immutable state flag

## Recovery Process

The recovery functionality rebuilds MemTables from WAL data:

1. **WAL Entries**:
   - Each WAL entry contains an operation type, key, value and sequence number
   - Entries are processed in order to rebuild the MemTable state

2. **Sequence Number Handling**:
   - Maximum sequence number is tracked during recovery
   - Ensures future operations have larger sequence numbers

3. **Batch Operations**:
   - Support for atomic batch operations from WAL
   - Batch entries contain multiple operations with sequential sequence numbers

## Performance Considerations

### Time Complexity

The SkipList data structure offers favorable complexity for MemTable operations:

| Operation | Average Case | Worst Case |
|-----------|--------------|------------|
| Insert    | O(log n)     | O(n)       |
| Lookup    | O(log n)     | O(n)       |
| Delete    | O(log n)     | O(n)       |
| Iteration | O(1) per step| O(1) per step |

### Memory Usage Optimization

Several optimizations are employed to improve memory efficiency:

1. **Shared Memory Allocations**:
   - Node arrays allocated in contiguous blocks
   - Reduces allocation overhead

2. **Cache Awareness**:
   - Nodes aligned to cache lines (64 bytes)
   - Improves CPU cache utilization

3. **Appropriate Sizing**:
   - Default sizing (32MB) provides good balance
   - Configurable based on workload needs

### Write Amplification

MemTables help reduce write amplification in the LSM architecture:

1. **Buffering Writes**:
   - Multiple key updates are consolidated in memory
   - Only the latest value gets written to disk

2. **Batching**:
   - Many small writes batched into larger disk operations
   - Improves overall I/O efficiency

## Common Usage Patterns

### Basic Usage

```go
// Create a new MemTable
memTable := memtable.NewMemTable()

// Add entries with incrementing sequence numbers
memTable.Put([]byte("key1"), []byte("value1"), 1)
memTable.Put([]byte("key2"), []byte("value2"), 2)
memTable.Delete([]byte("key3"), 3)

// Retrieve a value
value, found := memTable.Get([]byte("key1"))
if found {
    fmt.Printf("Value: %s\n", value)
}

// Check if the MemTable is too large
if memTable.ApproximateSize() > 32*1024*1024 {
    memTable.SetImmutable()
    // Write to disk...
}
```

### Using MemTablePool

```go
// Create a pool with configuration
config := config.NewDefaultConfig("/path/to/data")
pool := memtable.NewMemTablePool(config)

// Add entries
pool.Put([]byte("key1"), []byte("value1"), 1)
pool.Delete([]byte("key2"), 2)

// Check if flushing is needed
if pool.IsFlushNeeded() {
    // Switch to a new active MemTable and get the old one for flushing
    immutable := pool.SwitchToNewMemTable()

    // Flush the immutable table to disk as an SSTable
    // ...
}
```

### Iterating Over Entries

```go
// Create an iterator
iter := memTable.NewIterator()

// Iterate through all entries
for iter.SeekToFirst(); iter.Valid(); iter.Next() {
    fmt.Printf("%s: ", iter.Key())

    if iter.IsTombstone() {
        fmt.Println("<deleted>")
    } else {
        fmt.Printf("%s\n", iter.Value())
    }
}

// Or seek to a specific point
iter.Seek([]byte("key5"))
if iter.Valid() {
    fmt.Printf("Found: %s\n", iter.Key())
}
```

## Configuration Options

The MemTable behavior can be tuned through several configuration parameters:

1. **MemTableSize** (default: 32MB):
   - Maximum size before triggering a flush
   - Larger sizes improve write throughput but increase memory usage

2. **MaxMemTables** (default: 4):
   - Maximum number of MemTables in memory (active + immutable)
   - Higher values allow more in-flight flushes

3. **MaxMemTableAge** (default: 600 seconds):
   - Maximum age before forcing a flush
   - Ensures data isn't held in memory too long

## Trade-offs and Limitations

### Write Bursts and Flush Stalls

High write bursts can lead to multiple MemTables becoming immutable before the background flush process completes. The system handles this by:

1. Maintaining multiple immutable MemTables in memory
2. Tracking the number of immutable MemTables
3. Potentially slowing down writes if too many immutable MemTables accumulate

### Memory Usage vs. Performance

The MemTable configuration involves balancing memory usage against performance:

1. **Larger MemTables**:
   - Pro: Better write performance, fewer disk flushes
   - Con: Higher memory usage, potentially longer recovery time

2. **Smaller MemTables**:
   - Pro: Lower memory usage, faster recovery
   - Con: More frequent flushes, potentially lower write throughput

### Ordering and Consistency

The MemTable maintains ordering via:

1. **Key Comparison**: Primary ordering by key
2. **Sequence Numbers**: Secondary ordering to handle updates to the same key
3. **Value Types**: Distinguishing between values and deletion markers

This ensures consistent state even with concurrent reads while a background flush is occurring.