Adds a complete LSM-based storage engine with these features: - Single-writer based architecture for the storage engine - WAL for durability, and hey it's configurable - MemTable with skip list implementation for fast read/writes - SSTable with block-based structure for on-disk level-based storage - Background compaction with tiered strategy - ACID transactions - Good documentation (I hope)
15 KiB
SSTable Package Documentation
The sstable
package implements the Sorted String Table (SSTable) persistent storage format for the Kevo engine. SSTables are immutable, ordered files that store key-value pairs and are optimized for efficient reading, particularly for range scans.
Overview
SSTables form the persistent storage layer of the LSM tree architecture in the Kevo engine. They store key-value pairs in sorted order, with a hierarchical structure that allows efficient retrieval with minimal disk I/O.
Key responsibilities of the SSTable package include:
- Writing sorted key-value pairs to immutable files
- Reading and searching data efficiently
- Providing iterators for sequential access
- Ensuring data integrity with checksums
- Supporting efficient binary search through block indexing
File Format Specification
The SSTable file format is designed for efficient storage and retrieval of sorted key-value pairs. It follows a structured layout with multiple layers of organization:
┌─────────────────────────────────────────────────────────────────┐
│ Data Blocks │
├─────────────────────────────────────────────────────────────────┤
│ Index Block │
├─────────────────────────────────────────────────────────────────┤
│ Footer │
└─────────────────────────────────────────────────────────────────┘
1. Data Blocks
The bulk of an SSTable consists of data blocks, each containing a series of key-value entries:
- Keys are sorted lexicographically within and across blocks
- Keys are compressed using a prefix compression technique
- Each block has restart points where full keys are stored
- Data blocks have a default target size of 16KB
- Each block includes:
- Entry data (compressed keys and values)
- Restart point offsets
- Restart point count
- Checksum for data integrity
2. Index Block
The index block is a special block that allows efficient location of data blocks:
- Contains one entry per data block
- Each entry includes:
- First key in the data block
- Offset of the data block in the file
- Size of the data block
- Allows binary search to locate the appropriate data block for a key
3. Footer
The footer is a fixed-size section at the end of the file containing metadata:
- Index block offset
- Index block size
- Total entry count
- Min/max key offsets (for future use)
- Magic number for file format verification
- Footer checksum
Block Format
Each block (both data and index) has the following internal format:
┌──────────────────────┬─────────────────┬──────────┬──────────┐
│ Entry Data │ Restart Points │ Count │ Checksum │
└──────────────────────┴─────────────────┴──────────┴──────────┘
Entry data consists of a series of entries, each with:
- For restart points: full key length, full key
- For other entries: shared prefix length, unshared length, unshared key bytes
- Value length, value data
Implementation Details
Core Components
Writer
The Writer
handles creating new SSTable files:
- FileManager: Handles file I/O and atomic file creation
- BlockManager: Manages building and serializing data blocks
- IndexBuilder: Constructs the index block from data block metadata
The write process follows these steps:
- Collect sorted key-value pairs
- Build data blocks when they reach target size
- Track index information as blocks are written
- Build and write the index block
- Write the footer
- Finalize the file with atomic rename
Reader
The Reader
provides access to data in SSTable files:
- File handling: Memory-maps the file for efficient access
- Footer parsing: Reads metadata to locate index and blocks
- Block cache: Optionally caches recently accessed blocks
- Search algorithm: Binary search through the index, then within blocks
The read process follows these steps:
- Parse the footer to locate the index block
- Binary search the index to find the appropriate data block
- Read and parse the data block
- Binary search within the block for the specific key
Block Handling
The block system includes several specialized components:
- Block Builder: Constructs blocks with prefix compression
- Block Reader: Parses serialized blocks
- Block Iterator: Provides sequential access to entries in a block
Key Features
Prefix Compression
To reduce storage space, keys are stored using prefix compression:
- Blocks have "restart points" at regular intervals (default every 16 keys)
- At restart points, full keys are stored
- Between restart points, keys store:
- Length of shared prefix with previous key
- Length of unshared suffix
- Unshared suffix bytes
This provides significant space savings for keys with common prefixes.
Memory Mapping
For efficient reading, SSTable files are memory-mapped:
- File data is mapped into virtual memory
- OS handles paging and read-ahead
- Reduces system call overhead
- Allows direct access to file data without explicit reads
Tombstones
SSTables support deletion through tombstone markers:
- Tombstones are stored as entries with nil values
- They indicate a key has been deleted
- Compaction eventually removes tombstones and deleted keys
Checksum Verification
Data integrity is ensured through checksums:
- Each block has a 64-bit xxHash checksum
- The footer also has a checksum
- Checksums are verified when blocks are read
- Corrupted blocks trigger appropriate error handling
Block Structure and Index Format
Data Block Structure
Data blocks are the primary storage units in an SSTable:
┌────────┬────────┬─────────────┐ ┌────────┬────────┬─────────────┐
│Entry 1 │Entry 2 │ ... │ │Restart │ Count │ Checksum │
│ │ │ │ │ Points │ │ │
└────────┴────────┴─────────────┘ └────────┴────────┴─────────────┘
Entry Data (Variable Size) Block Footer (Fixed Size)
Each entry in a data block has the following format:
For restart points:
┌───────────┬───────────┬───────────┬───────────┐
│ Key Length│ Key │Value Length│ Value │
│ (2 bytes)│ (variable)│ (4 bytes) │(variable) │
└───────────┴───────────┴───────────┴───────────┘
For non-restart points (using prefix compression):
┌───────────┬───────────┬───────────┬───────────┬───────────┐
│ Shared │ Unshared │ Unshared │ Value │ Value │
│ Length │ Length │ Key │ Length │ │
│ (2 bytes) │ (2 bytes) │(variable) │ (4 bytes) │(variable) │
└───────────┴───────────┴───────────┴───────────┴───────────┘
Index Block Structure
The index block has a similar structure to data blocks but contains entries that point to data blocks:
┌─────────────────┬─────────────────┬──────────┬──────────┐
│ Index Entries │ Restart Points │ Count │ Checksum │
└─────────────────┴─────────────────┴──────────┴──────────┘
Each index entry contains:
- Key: First key in the corresponding data block
- Value: Block offset (8 bytes) + block size (4 bytes)
Footer Format
The footer is a fixed-size structure at the end of the file:
┌─────────────┬────────────┬────────────┬────────────┬────────────┬─────────┐
│ Index │ Index │ Entry │ Min │ Max │ Checksum│
│ Offset │ Size │ Count │Key Offset │Key Offset │ │
│ (8 bytes) │ (4 bytes) │ (4 bytes) │ (8 bytes) │ (8 bytes) │(8 bytes)│
└─────────────┴────────────┴────────────┴────────────┴────────────┴─────────┘
Performance Considerations
Read Optimization
SSTables are heavily optimized for read operations:
- Block Structure: The block-based approach minimizes I/O
- Block Size Tuning: Default 16KB balances random vs. sequential access
- Memory Mapping: Efficient OS-level caching
- Two-level Search: Index search followed by block search
- Restart Points: Balance between compression and lookup speed
Space Efficiency
Several techniques reduce storage requirements:
- Prefix Compression: Reduces space for similar keys
- Delta Encoding: Used in the index for block offsets
- Configurable Block Size: Can be tuned for specific workloads
I/O Patterns
Understanding I/O patterns helps optimize performance:
- Sequential Writes: SSTables are written sequentially
- Random Reads: Point lookups may access arbitrary blocks
- Range Scans: Sequential reading of multiple blocks
- Index Loading: Always loaded first for any operation
Iterators and Range Scans
Iterator Types
The SSTable package provides several iterators:
- Block Iterator: Iterates within a single block
- SSTable Iterator: Iterates across all blocks in an SSTable
- Iterator Adapter: Adapts to the common engine iterator interface
Range Scan Functionality
Range scans are efficient operations in SSTables:
- Use the index to find the starting block
- Iterate through entries in that block
- Continue to subsequent blocks as needed
- Respect range boundaries (start/end keys)
Implementation Notes
The iterator implementation includes:
- Lazy Loading: Blocks are loaded only when needed
- Positioning Methods: Seek, SeekToFirst, Next
- Validation: Bounds checking and state validation
- Key/Value Access: Direct access to current entry data
Common Usage Patterns
Writing an SSTable
// Create a new SSTable writer
writer, err := sstable.NewWriter("/path/to/output.sst")
if err != nil {
log.Fatal(err)
}
// Add key-value pairs in sorted order
writer.Add([]byte("key1"), []byte("value1"))
writer.Add([]byte("key2"), []byte("value2"))
writer.Add([]byte("key3"), []byte("value3"))
// Add a tombstone (deletion marker)
writer.AddTombstone([]byte("key4"))
// Finalize the SSTable
if err := writer.Finish(); err != nil {
log.Fatal(err)
}
Reading from an SSTable
// Open an SSTable for reading
reader, err := sstable.OpenReader("/path/to/table.sst")
if err != nil {
log.Fatal(err)
}
defer reader.Close()
// Get a specific value
value, err := reader.Get([]byte("key1"))
if err != nil {
if err == sstable.ErrNotFound {
fmt.Println("Key not found")
} else {
log.Fatal(err)
}
} else {
fmt.Printf("Value: %s\n", value)
}
Iterating Through an SSTable
// Create an iterator
iter := reader.NewIterator()
// Iterate through all entries
for iter.SeekToFirst(); iter.Valid(); iter.Next() {
fmt.Printf("%s: ", iter.Key())
if iter.IsTombstone() {
fmt.Println("<deleted>")
} else {
fmt.Printf("%s\n", iter.Value())
}
}
// Or iterate over a specific range
rangeIter := reader.NewIterator()
startKey := []byte("key2")
endKey := []byte("key4")
for rangeIter.Seek(startKey); rangeIter.Valid() && bytes.Compare(rangeIter.Key(), endKey) < 0; rangeIter.Next() {
fmt.Printf("%s: %s\n", rangeIter.Key(), rangeIter.Value())
}
Configuration Options
The SSTable behavior can be tuned through several configuration parameters:
-
Block Size (default: 16KB):
- Controls the target size for data blocks
- Larger blocks improve compression and sequential reads
- Smaller blocks improve random access performance
-
Restart Interval (default: 16 entries):
- Controls how often restart points occur in blocks
- Affects the balance between compression and lookup speed
-
Index Key Interval (default: ~64KB):
- Controls how frequently keys are indexed
- Affects the size of the index and lookup performance
Trade-offs and Limitations
Immutability
SSTables are immutable, which brings benefits and challenges:
-
Benefits:
- Simplifies concurrent read access
- No locking required for reads
- Enables efficient merging during compaction
-
Challenges:
- Updates require rewriting
- Deletes are implemented as tombstones
- Space amplification until compaction
Size vs. Performance Trade-offs
Several design decisions involve balancing size against performance:
- Block Size: Larger blocks improve compression but may result in reading unnecessary data
- Restart Points: More frequent restarts improve random lookup but reduce compression
- Index Density: Denser indices improve lookup speed but increase memory usage
Specialized Use Cases
The SSTable format is optimized for:
- Append-only workloads: Where data is written once and read many times
- Range scans: Where sequential access to sorted data is common
- Batch processing: Where data can be sorted before writing
It's less optimal for:
- Frequent updates: Due to immutability
- Very large keys or values: Which can cause inefficient storage
- Random writes: Which require external sorting