Adds a complete LSM-based storage engine with these features: - Single-writer based architecture for the storage engine - WAL for durability, and hey it's configurable - MemTable with skip list implementation for fast read/writes - SSTable with block-based structure for on-disk level-based storage - Background compaction with tiered strategy - ACID transactions - Good documentation (I hope)
11 KiB
Write-Ahead Log (WAL) Package Documentation
The wal
package implements a durable, crash-resistant Write-Ahead Log for the Kevo engine. It serves as the primary mechanism for ensuring data durability and atomicity, especially during system crashes or power failures.
Overview
The Write-Ahead Log records all database modifications before they are applied to the main database structures. This follows the "write-ahead logging" principle: all changes must be logged before being applied to the database, ensuring that if a system crash occurs, the database can be recovered to a consistent state by replaying the log.
Key responsibilities of the WAL include:
- Recording database operations in a durable manner
- Supporting atomic batch operations
- Providing crash recovery mechanisms
- Managing log file rotation and cleanup
File Format and Record Structure
WAL File Format
WAL files use a .wal
extension and are named with a timestamp:
<timestamp>.wal (e.g., 01745172985771529746.wal)
The timestamp-based naming allows for chronological ordering during recovery.
Record Format
Records in the WAL have a consistent structure:
┌──────────────┬──────────────┬──────────────┬──────────────────────┐
│ CRC-32 │ Length │ Type │ Payload │
│ (4 bytes) │ (2 bytes) │ (1 byte) │ (Length bytes) │
└──────────────┴──────────────┴──────────────┴──────────────────────┘
Header (7 bytes) Data
- CRC-32: A checksum of the payload for data integrity verification
- Length: The payload length (up to 32KB)
- Type: The record type:
RecordTypeFull (1)
: A complete recordRecordTypeFirst (2)
: First fragment of a large recordRecordTypeMiddle (3)
: Middle fragment of a large recordRecordTypeLast (4)
: Last fragment of a large record
Records larger than the maximum size (32KB) are automatically split into multiple fragments.
Operation Payload Format
For standard operations (Put/Delete), the payload format is:
┌──────────────┬──────────────┬──────────────┬──────────────┬──────────────┬──────────────┐
│ Op Type │ Sequence │ Key Len │ Key │ Value Len │ Value │
│ (1 byte) │ (8 bytes) │ (4 bytes) │ (Key Len) │ (4 bytes) │ (Value Len) │
└──────────────┴──────────────┴──────────────┴──────────────┴──────────────┴──────────────┘
- Op Type: The operation type:
OpTypePut (1)
: Key-value insertionOpTypeDelete (2)
: Key deletionOpTypeMerge (3)
: Value merging (reserved for future use)OpTypeBatch (4)
: Batch of operations
- Sequence: A monotonically increasing sequence number
- Key Len / Key: The length and bytes of the key
- Value Len / Value: The length and bytes of the value (omitted for delete operations)
Implementation Details
Core Components
WAL Writer
The WAL
struct manages writing to the log file and includes:
- Buffered writing for efficiency
- CRC32 checksums for data integrity
- Sequence number management
- Synchronization control based on configuration
WAL Reader
The Reader
struct handles reading and validating records:
- Verifies CRC32 checksums
- Reconstructs fragmented records
- Presents a logical view of entries to consumers
Batch Processing
The Batch
struct handles atomic multi-operation groups:
- Collect multiple operations (Put/Delete)
- Write them as a single atomic unit
- Track operation counts and sizes
Key Operations
Writing Operations
The Append
method writes a single operation to the log:
- Assigns a sequence number
- Computes the required size
- Determines if fragmentation is needed
- Writes the record(s) with appropriate headers
- Syncs to disk based on configuration
Batch Operations
The AppendBatch
method handles writing multiple operations atomically:
- Writes a batch header with operation count
- Assigns sequential sequence numbers to operations
- Writes all operations with the same basic format
- Syncs to disk based on configuration
Record Fragmentation
For records larger than 32KB:
- The record is split into fragments
- First fragment (
RecordTypeFirst
) contains metadata and part of the key - Middle fragments (
RecordTypeMiddle
) contain continuing data - Last fragment (
RecordTypeLast
) contains the final portion
Reading and Recovery
The ReadEntry
method reads entries from the log:
- Reads a physical record
- Validates the checksum
- If it's a fragmented record, collects all fragments
- Parses the entry data into an
Entry
struct
Durability Guarantees
The WAL provides configurable durability through three sync modes:
-
Immediate Sync Mode (
SyncImmediate
):- Every write is immediately synced to disk
- Highest durability, lowest performance
- Data safe even in case of system crash or power failure
- Suitable for critical data where durability is paramount
-
Batch Sync Mode (
SyncBatch
):- Syncs after a configurable amount of data is written
- Balances durability and performance
- May lose very recent transactions in case of crash
- Default setting for most workloads
-
No Sync Mode (
SyncNone
):- Relies on OS caching and background flushing
- Highest performance, lowest durability
- Data may be lost in case of crash
- Suitable for non-critical or easily reproducible data
The application can choose the appropriate sync mode based on its durability requirements.
Recovery Process
WAL recovery happens during engine startup:
-
WAL File Discovery:
- Scan for all
.wal
files in the WAL directory - Sort files by timestamp (filename)
- Scan for all
-
Sequential Replay:
- Process each file in chronological order
- For each file, read and validate all records
- Apply valid operations to rebuild the MemTable
-
Error Handling:
- Skip corrupted records when possible
- If a file is heavily corrupted, move to the next file
- As long as one file is processed successfully, recovery continues
-
Sequence Number Recovery:
- Track the highest sequence number seen
- Update the next sequence number for future operations
-
WAL Reset:
- After recovery, either reuse the last WAL file (if not full)
- Or create a new WAL file for future operations
The recovery process is designed to be robust against partial corruption and to recover as much data as possible.
Corruption Handling
The WAL implements several mechanisms to handle and recover from corruption:
-
CRC32 Checksums:
- Every record includes a CRC32 checksum
- Corrupted records are detected and skipped
-
Scanning Recovery:
- When corruption is detected, the reader can scan ahead
- Tries to find the next valid record header
-
Progressive Recovery:
- Even if some records are lost, subsequent valid records are processed
- Files with too many errors are skipped, but recovery continues with later files
-
Backup Mechanism:
- Problematic WAL files can be moved to a backup directory
- This allows recovery to proceed with a clean slate if needed
Performance Considerations
Buffered Writing
The WAL uses buffered I/O to reduce the number of system calls:
- Writes go through a 64KB buffer
- The buffer is flushed when sync is called
- This significantly improves write throughput
Sync Frequency Trade-offs
The sync frequency directly impacts performance:
SyncImmediate
: 1 sync per write operation (slowest, safest)SyncBatch
: 1 sync per N bytes written (configurable balance)SyncNone
: No explicit syncs (fastest, least safe)
File Size Management
WAL files have a configurable maximum size (default 64MB):
- Full files are closed and new ones created
- This prevents individual files from growing too large
- Facilitates easier backup and cleanup
Common Usage Patterns
Basic Usage
// Create a new WAL
cfg := config.NewDefaultConfig("/path/to/data")
myWAL, err := wal.NewWAL(cfg, "/path/to/data/wal")
if err != nil {
log.Fatal(err)
}
// Append operations
seqNum, err := myWAL.Append(wal.OpTypePut, []byte("key"), []byte("value"))
if err != nil {
log.Fatal(err)
}
// Ensure durability
if err := myWAL.Sync(); err != nil {
log.Fatal(err)
}
// Close the WAL when done
if err := myWAL.Close(); err != nil {
log.Fatal(err)
}
Using Batches for Atomicity
// Create a batch
batch := wal.NewBatch()
batch.Put([]byte("key1"), []byte("value1"))
batch.Put([]byte("key2"), []byte("value2"))
batch.Delete([]byte("key3"))
// Write the batch atomically
startSeq, err := myWAL.AppendBatch(batch.ToEntries())
if err != nil {
log.Fatal(err)
}
WAL Recovery
// Handler function for each recovered entry
handler := func(entry *wal.Entry) error {
switch entry.Type {
case wal.OpTypePut:
// Apply Put operation
memTable.Put(entry.Key, entry.Value, entry.SequenceNumber)
case wal.OpTypeDelete:
// Apply Delete operation
memTable.Delete(entry.Key, entry.SequenceNumber)
}
return nil
}
// Replay all WAL files in a directory
if err := wal.ReplayWALDir("/path/to/data/wal", handler); err != nil {
log.Fatal(err)
}
Trade-offs and Limitations
Write Amplification
The WAL doubles write operations (once to WAL, once to final storage):
- This is a necessary trade-off for durability
- Can be mitigated through batching and appropriate sync modes
Recovery Time
Recovery time is proportional to the size of the WAL:
- Large WAL files or many operations increase startup time
- Mitigated by regular compaction that makes old WAL files obsolete
Corruption Resilience
While the WAL can recover from some corruption:
- Severe corruption at the start of a file may render it unreadable
- Header corruption can cause loss of subsequent records
- Partial sync before crash can lead to truncated records
These limitations are managed through:
- Regular WAL rotation
- Multiple independent WAL files
- Robust error handling during recovery