Some checks failed
Go Tests / Run Tests (1.24.2) (push) Has been cancelled
Adds a complete LSM-based storage engine with these features: - Single-writer based architecture for the storage engine - WAL for durability, and hey it's configurable - MemTable with skip list implementation for fast read/writes - SSTable with block-based structure for on-disk level-based storage - Background compaction with tiered strategy - ACID transactions - Good documentation (I hope)
408 lines
15 KiB
Markdown
408 lines
15 KiB
Markdown
# SSTable Package Documentation
|
|
|
|
The `sstable` package implements the Sorted String Table (SSTable) persistent storage format for the Kevo engine. SSTables are immutable, ordered files that store key-value pairs and are optimized for efficient reading, particularly for range scans.
|
|
|
|
## Overview
|
|
|
|
SSTables form the persistent storage layer of the LSM tree architecture in the Kevo engine. They store key-value pairs in sorted order, with a hierarchical structure that allows efficient retrieval with minimal disk I/O.
|
|
|
|
Key responsibilities of the SSTable package include:
|
|
- Writing sorted key-value pairs to immutable files
|
|
- Reading and searching data efficiently
|
|
- Providing iterators for sequential access
|
|
- Ensuring data integrity with checksums
|
|
- Supporting efficient binary search through block indexing
|
|
|
|
## File Format Specification
|
|
|
|
The SSTable file format is designed for efficient storage and retrieval of sorted key-value pairs. It follows a structured layout with multiple layers of organization:
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Data Blocks │
|
|
├─────────────────────────────────────────────────────────────────┤
|
|
│ Index Block │
|
|
├─────────────────────────────────────────────────────────────────┤
|
|
│ Footer │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### 1. Data Blocks
|
|
|
|
The bulk of an SSTable consists of data blocks, each containing a series of key-value entries:
|
|
|
|
- Keys are sorted lexicographically within and across blocks
|
|
- Keys are compressed using a prefix compression technique
|
|
- Each block has restart points where full keys are stored
|
|
- Data blocks have a default target size of 16KB
|
|
- Each block includes:
|
|
- Entry data (compressed keys and values)
|
|
- Restart point offsets
|
|
- Restart point count
|
|
- Checksum for data integrity
|
|
|
|
### 2. Index Block
|
|
|
|
The index block is a special block that allows efficient location of data blocks:
|
|
|
|
- Contains one entry per data block
|
|
- Each entry includes:
|
|
- First key in the data block
|
|
- Offset of the data block in the file
|
|
- Size of the data block
|
|
- Allows binary search to locate the appropriate data block for a key
|
|
|
|
### 3. Footer
|
|
|
|
The footer is a fixed-size section at the end of the file containing metadata:
|
|
|
|
- Index block offset
|
|
- Index block size
|
|
- Total entry count
|
|
- Min/max key offsets (for future use)
|
|
- Magic number for file format verification
|
|
- Footer checksum
|
|
|
|
### Block Format
|
|
|
|
Each block (both data and index) has the following internal format:
|
|
|
|
```
|
|
┌──────────────────────┬─────────────────┬──────────┬──────────┐
|
|
│ Entry Data │ Restart Points │ Count │ Checksum │
|
|
└──────────────────────┴─────────────────┴──────────┴──────────┘
|
|
```
|
|
|
|
Entry data consists of a series of entries, each with:
|
|
1. For restart points: full key length, full key
|
|
2. For other entries: shared prefix length, unshared length, unshared key bytes
|
|
3. Value length, value data
|
|
|
|
## Implementation Details
|
|
|
|
### Core Components
|
|
|
|
#### Writer
|
|
|
|
The `Writer` handles creating new SSTable files:
|
|
|
|
1. **FileManager**: Handles file I/O and atomic file creation
|
|
2. **BlockManager**: Manages building and serializing data blocks
|
|
3. **IndexBuilder**: Constructs the index block from data block metadata
|
|
|
|
The write process follows these steps:
|
|
1. Collect sorted key-value pairs
|
|
2. Build data blocks when they reach target size
|
|
3. Track index information as blocks are written
|
|
4. Build and write the index block
|
|
5. Write the footer
|
|
6. Finalize the file with atomic rename
|
|
|
|
#### Reader
|
|
|
|
The `Reader` provides access to data in SSTable files:
|
|
|
|
1. **File handling**: Memory-maps the file for efficient access
|
|
2. **Footer parsing**: Reads metadata to locate index and blocks
|
|
3. **Block cache**: Optionally caches recently accessed blocks
|
|
4. **Search algorithm**: Binary search through the index, then within blocks
|
|
|
|
The read process follows these steps:
|
|
1. Parse the footer to locate the index block
|
|
2. Binary search the index to find the appropriate data block
|
|
3. Read and parse the data block
|
|
4. Binary search within the block for the specific key
|
|
|
|
#### Block Handling
|
|
|
|
The block system includes several specialized components:
|
|
|
|
1. **Block Builder**: Constructs blocks with prefix compression
|
|
2. **Block Reader**: Parses serialized blocks
|
|
3. **Block Iterator**: Provides sequential access to entries in a block
|
|
|
|
### Key Features
|
|
|
|
#### Prefix Compression
|
|
|
|
To reduce storage space, keys are stored using prefix compression:
|
|
|
|
1. Blocks have "restart points" at regular intervals (default every 16 keys)
|
|
2. At restart points, full keys are stored
|
|
3. Between restart points, keys store:
|
|
- Length of shared prefix with previous key
|
|
- Length of unshared suffix
|
|
- Unshared suffix bytes
|
|
|
|
This provides significant space savings for keys with common prefixes.
|
|
|
|
#### Memory Mapping
|
|
|
|
For efficient reading, SSTable files are memory-mapped:
|
|
|
|
1. File data is mapped into virtual memory
|
|
2. OS handles paging and read-ahead
|
|
3. Reduces system call overhead
|
|
4. Allows direct access to file data without explicit reads
|
|
|
|
#### Tombstones
|
|
|
|
SSTables support deletion through tombstone markers:
|
|
|
|
1. Tombstones are stored as entries with nil values
|
|
2. They indicate a key has been deleted
|
|
3. Compaction eventually removes tombstones and deleted keys
|
|
|
|
#### Checksum Verification
|
|
|
|
Data integrity is ensured through checksums:
|
|
|
|
1. Each block has a 64-bit xxHash checksum
|
|
2. The footer also has a checksum
|
|
3. Checksums are verified when blocks are read
|
|
4. Corrupted blocks trigger appropriate error handling
|
|
|
|
## Block Structure and Index Format
|
|
|
|
### Data Block Structure
|
|
|
|
Data blocks are the primary storage units in an SSTable:
|
|
|
|
```
|
|
┌────────┬────────┬─────────────┐ ┌────────┬────────┬─────────────┐
|
|
│Entry 1 │Entry 2 │ ... │ │Restart │ Count │ Checksum │
|
|
│ │ │ │ │ Points │ │ │
|
|
└────────┴────────┴─────────────┘ └────────┴────────┴─────────────┘
|
|
Entry Data (Variable Size) Block Footer (Fixed Size)
|
|
```
|
|
|
|
Each entry in a data block has the following format:
|
|
|
|
For restart points:
|
|
```
|
|
┌───────────┬───────────┬───────────┬───────────┐
|
|
│ Key Length│ Key │Value Length│ Value │
|
|
│ (2 bytes)│ (variable)│ (4 bytes) │(variable) │
|
|
└───────────┴───────────┴───────────┴───────────┘
|
|
```
|
|
|
|
For non-restart points (using prefix compression):
|
|
```
|
|
┌───────────┬───────────┬───────────┬───────────┬───────────┐
|
|
│ Shared │ Unshared │ Unshared │ Value │ Value │
|
|
│ Length │ Length │ Key │ Length │ │
|
|
│ (2 bytes) │ (2 bytes) │(variable) │ (4 bytes) │(variable) │
|
|
└───────────┴───────────┴───────────┴───────────┴───────────┘
|
|
```
|
|
|
|
### Index Block Structure
|
|
|
|
The index block has a similar structure to data blocks but contains entries that point to data blocks:
|
|
|
|
```
|
|
┌─────────────────┬─────────────────┬──────────┬──────────┐
|
|
│ Index Entries │ Restart Points │ Count │ Checksum │
|
|
└─────────────────┴─────────────────┴──────────┴──────────┘
|
|
```
|
|
|
|
Each index entry contains:
|
|
- Key: First key in the corresponding data block
|
|
- Value: Block offset (8 bytes) + block size (4 bytes)
|
|
|
|
### Footer Format
|
|
|
|
The footer is a fixed-size structure at the end of the file:
|
|
|
|
```
|
|
┌─────────────┬────────────┬────────────┬────────────┬────────────┬─────────┐
|
|
│ Index │ Index │ Entry │ Min │ Max │ Checksum│
|
|
│ Offset │ Size │ Count │Key Offset │Key Offset │ │
|
|
│ (8 bytes) │ (4 bytes) │ (4 bytes) │ (8 bytes) │ (8 bytes) │(8 bytes)│
|
|
└─────────────┴────────────┴────────────┴────────────┴────────────┴─────────┘
|
|
```
|
|
|
|
## Performance Considerations
|
|
|
|
### Read Optimization
|
|
|
|
SSTables are heavily optimized for read operations:
|
|
|
|
1. **Block Structure**: The block-based approach minimizes I/O
|
|
2. **Block Size Tuning**: Default 16KB balances random vs. sequential access
|
|
3. **Memory Mapping**: Efficient OS-level caching
|
|
4. **Two-level Search**: Index search followed by block search
|
|
5. **Restart Points**: Balance between compression and lookup speed
|
|
|
|
### Space Efficiency
|
|
|
|
Several techniques reduce storage requirements:
|
|
|
|
1. **Prefix Compression**: Reduces space for similar keys
|
|
2. **Delta Encoding**: Used in the index for block offsets
|
|
3. **Configurable Block Size**: Can be tuned for specific workloads
|
|
|
|
### I/O Patterns
|
|
|
|
Understanding I/O patterns helps optimize performance:
|
|
|
|
1. **Sequential Writes**: SSTables are written sequentially
|
|
2. **Random Reads**: Point lookups may access arbitrary blocks
|
|
3. **Range Scans**: Sequential reading of multiple blocks
|
|
4. **Index Loading**: Always loaded first for any operation
|
|
|
|
## Iterators and Range Scans
|
|
|
|
### Iterator Types
|
|
|
|
The SSTable package provides several iterators:
|
|
|
|
1. **Block Iterator**: Iterates within a single block
|
|
2. **SSTable Iterator**: Iterates across all blocks in an SSTable
|
|
3. **Iterator Adapter**: Adapts to the common engine iterator interface
|
|
|
|
### Range Scan Functionality
|
|
|
|
Range scans are efficient operations in SSTables:
|
|
|
|
1. Use the index to find the starting block
|
|
2. Iterate through entries in that block
|
|
3. Continue to subsequent blocks as needed
|
|
4. Respect range boundaries (start/end keys)
|
|
|
|
### Implementation Notes
|
|
|
|
The iterator implementation includes:
|
|
|
|
1. **Lazy Loading**: Blocks are loaded only when needed
|
|
2. **Positioning Methods**: Seek, SeekToFirst, Next
|
|
3. **Validation**: Bounds checking and state validation
|
|
4. **Key/Value Access**: Direct access to current entry data
|
|
|
|
## Common Usage Patterns
|
|
|
|
### Writing an SSTable
|
|
|
|
```go
|
|
// Create a new SSTable writer
|
|
writer, err := sstable.NewWriter("/path/to/output.sst")
|
|
if err != nil {
|
|
log.Fatal(err)
|
|
}
|
|
|
|
// Add key-value pairs in sorted order
|
|
writer.Add([]byte("key1"), []byte("value1"))
|
|
writer.Add([]byte("key2"), []byte("value2"))
|
|
writer.Add([]byte("key3"), []byte("value3"))
|
|
|
|
// Add a tombstone (deletion marker)
|
|
writer.AddTombstone([]byte("key4"))
|
|
|
|
// Finalize the SSTable
|
|
if err := writer.Finish(); err != nil {
|
|
log.Fatal(err)
|
|
}
|
|
```
|
|
|
|
### Reading from an SSTable
|
|
|
|
```go
|
|
// Open an SSTable for reading
|
|
reader, err := sstable.OpenReader("/path/to/table.sst")
|
|
if err != nil {
|
|
log.Fatal(err)
|
|
}
|
|
defer reader.Close()
|
|
|
|
// Get a specific value
|
|
value, err := reader.Get([]byte("key1"))
|
|
if err != nil {
|
|
if err == sstable.ErrNotFound {
|
|
fmt.Println("Key not found")
|
|
} else {
|
|
log.Fatal(err)
|
|
}
|
|
} else {
|
|
fmt.Printf("Value: %s\n", value)
|
|
}
|
|
```
|
|
|
|
### Iterating Through an SSTable
|
|
|
|
```go
|
|
// Create an iterator
|
|
iter := reader.NewIterator()
|
|
|
|
// Iterate through all entries
|
|
for iter.SeekToFirst(); iter.Valid(); iter.Next() {
|
|
fmt.Printf("%s: ", iter.Key())
|
|
|
|
if iter.IsTombstone() {
|
|
fmt.Println("<deleted>")
|
|
} else {
|
|
fmt.Printf("%s\n", iter.Value())
|
|
}
|
|
}
|
|
|
|
// Or iterate over a specific range
|
|
rangeIter := reader.NewIterator()
|
|
startKey := []byte("key2")
|
|
endKey := []byte("key4")
|
|
|
|
for rangeIter.Seek(startKey); rangeIter.Valid() && bytes.Compare(rangeIter.Key(), endKey) < 0; rangeIter.Next() {
|
|
fmt.Printf("%s: %s\n", rangeIter.Key(), rangeIter.Value())
|
|
}
|
|
```
|
|
|
|
## Configuration Options
|
|
|
|
The SSTable behavior can be tuned through several configuration parameters:
|
|
|
|
1. **Block Size** (default: 16KB):
|
|
- Controls the target size for data blocks
|
|
- Larger blocks improve compression and sequential reads
|
|
- Smaller blocks improve random access performance
|
|
|
|
2. **Restart Interval** (default: 16 entries):
|
|
- Controls how often restart points occur in blocks
|
|
- Affects the balance between compression and lookup speed
|
|
|
|
3. **Index Key Interval** (default: ~64KB):
|
|
- Controls how frequently keys are indexed
|
|
- Affects the size of the index and lookup performance
|
|
|
|
## Trade-offs and Limitations
|
|
|
|
### Immutability
|
|
|
|
SSTables are immutable, which brings benefits and challenges:
|
|
|
|
1. **Benefits**:
|
|
- Simplifies concurrent read access
|
|
- No locking required for reads
|
|
- Enables efficient merging during compaction
|
|
|
|
2. **Challenges**:
|
|
- Updates require rewriting
|
|
- Deletes are implemented as tombstones
|
|
- Space amplification until compaction
|
|
|
|
### Size vs. Performance Trade-offs
|
|
|
|
Several design decisions involve balancing size against performance:
|
|
|
|
1. **Block Size**: Larger blocks improve compression but may result in reading unnecessary data
|
|
2. **Restart Points**: More frequent restarts improve random lookup but reduce compression
|
|
3. **Index Density**: Denser indices improve lookup speed but increase memory usage
|
|
|
|
### Specialized Use Cases
|
|
|
|
The SSTable format is optimized for:
|
|
|
|
1. **Append-only workloads**: Where data is written once and read many times
|
|
2. **Range scans**: Where sequential access to sorted data is common
|
|
3. **Batch processing**: Where data can be sorted before writing
|
|
|
|
It's less optimal for:
|
|
1. **Frequent updates**: Due to immutability
|
|
2. **Very large keys or values**: Which can cause inefficient storage
|
|
3. **Random writes**: Which require external sorting |