docs: added idea, plan, and todo docs
This commit is contained in:
commit
ee23a47a74
52
IDEA.md
Normal file
52
IDEA.md
Normal file
@ -0,0 +1,52 @@
|
|||||||
|
# Go Storage: A Minimalist LSM Storage Engine
|
||||||
|
|
||||||
|
## Vision
|
||||||
|
|
||||||
|
Build a clean, composable, and educational storage engine in Go that follows Log-Structured Merge Tree (LSM) principles, focusing on simplicity while providing the building blocks needed for higher-level database implementations.
|
||||||
|
|
||||||
|
## Goals
|
||||||
|
|
||||||
|
### 1. Extreme Simplicity
|
||||||
|
- Create minimal but complete primitives that can support various database paradigms (KV, relational, graph)
|
||||||
|
- Prioritize readability and educational value over hyper-optimization
|
||||||
|
- Use idiomatic Go with clear interfaces and documentation
|
||||||
|
- Implement a single-writer architecture for simplicity and reduced concurrency complexity
|
||||||
|
|
||||||
|
### 2. Durability + Performance
|
||||||
|
- Implement the LSM architecture pattern: Write-Ahead Log → MemTable → SSTables
|
||||||
|
- Provide configurable durability guarantees (sync vs. batched fsync)
|
||||||
|
- Optimize for both point lookups and range scans
|
||||||
|
|
||||||
|
### 3. Configurability
|
||||||
|
- Store all configuration parameters in a versioned, persistent manifest
|
||||||
|
- Allow tuning of memory usage, compaction behavior, and durability settings
|
||||||
|
- Support reproducible startup states across restarts
|
||||||
|
|
||||||
|
### 4. Composable Primitives
|
||||||
|
- Design clean interfaces for fundamental operations (reads, writes, snapshots, iteration)
|
||||||
|
- Enable building of higher-level abstractions (SQL, Gremlin, custom query languages)
|
||||||
|
- Support both transactional and analytical workloads
|
||||||
|
- Provide simple atomic write primitives that can be built upon:
|
||||||
|
- Leverage read snapshots from immutable LSM structure
|
||||||
|
- Support basic atomic batch operations
|
||||||
|
- Ensure crash recovery through proper WAL handling
|
||||||
|
|
||||||
|
## Target Use Cases
|
||||||
|
|
||||||
|
1. **Educational Tool**: Learn and teach storage engine internals
|
||||||
|
2. **Embedded Storage**: Applications needing local, durable storage with predictable performance
|
||||||
|
3. **Prototype Foundation**: Base layer for experimenting with novel database designs
|
||||||
|
4. **Go Ecosystem Component**: Reusable storage layer for Go applications and services
|
||||||
|
|
||||||
|
## Non-Goals
|
||||||
|
|
||||||
|
1. **Feature Parity with Production Engines**: Not trying to compete with RocksDB, LevelDB, etc.
|
||||||
|
2. **Multi-Node Distribution**: Focusing on single-node operation
|
||||||
|
3. **Complex Query Planning**: Leaving higher-level query features to layers built on top
|
||||||
|
|
||||||
|
## Success Criteria
|
||||||
|
|
||||||
|
1. **Correctness**: Data is never lost or corrupted, even during crashes
|
||||||
|
2. **Understandability**: Code is clear enough to serve as an educational reference
|
||||||
|
3. **Performance**: Reasonable throughput and latency for common operations
|
||||||
|
4. **Extensibility**: Can be built upon to create specialized database engines
|
154
PLAN.md
Normal file
154
PLAN.md
Normal file
@ -0,0 +1,154 @@
|
|||||||
|
# Implementation Plan for Go Storage Engine
|
||||||
|
|
||||||
|
## Architecture Overview
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐
|
||||||
|
│ Client API │────▶│ MemTable │────▶│ Immutable SSTable Files │
|
||||||
|
└─────────────┘ └─────────────┘ └─────────────────────────┘
|
||||||
|
│ ▲ ▲
|
||||||
|
│ │ │
|
||||||
|
▼ │ │
|
||||||
|
┌─────────────┐ │ ┌─────────────────────────┐
|
||||||
|
│ Write- │────────────┘ │ Background Compaction │
|
||||||
|
│ Ahead Log │ │ Process │
|
||||||
|
└─────────────┘ └─────────────────────────┘
|
||||||
|
│ │
|
||||||
|
│ │
|
||||||
|
▼ ▼
|
||||||
|
┌─────────────────────────────────────────────────────────────────┐
|
||||||
|
│ Persistent Storage │
|
||||||
|
└─────────────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
## Package Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
go-storage/
|
||||||
|
├── cmd/
|
||||||
|
│ └── storage-bench/ # Benchmarking tool
|
||||||
|
│
|
||||||
|
├── pkg/
|
||||||
|
│ ├── config/ # Configuration and manifest
|
||||||
|
│ ├── wal/ # Write-ahead logging with transaction markers
|
||||||
|
│ ├── memtable/ # In-memory table implementation
|
||||||
|
│ ├── sstable/ # SSTable read/write
|
||||||
|
│ │ ├── block/ # Block format implementation
|
||||||
|
│ │ └── footer/ # File footer and metadata
|
||||||
|
│ ├── compaction/ # Compaction strategies
|
||||||
|
│ ├── iterator/ # Merged iterator implementation
|
||||||
|
│ ├── transaction/ # Transaction management with Snapshot + WAL
|
||||||
|
│ │ ├── snapshot/ # Read snapshot implementation
|
||||||
|
│ │ └── txbuffer/ # Transaction write buffer
|
||||||
|
│ └── engine/ # Main engine implementation with single-writer architecture
|
||||||
|
│
|
||||||
|
└── internal/
|
||||||
|
├── checksum/ # Checksum utilities (xxHash64)
|
||||||
|
└── utils/ # Shared internal utilities
|
||||||
|
```
|
||||||
|
|
||||||
|
## Development Phases
|
||||||
|
|
||||||
|
### Phase A: Foundation (1-2 weeks)
|
||||||
|
1. Set up project structure and Go module
|
||||||
|
2. Implement config package with serialization/deserialization
|
||||||
|
3. Build basic WAL with:
|
||||||
|
- Append operations (Put/Delete)
|
||||||
|
- Replay functionality
|
||||||
|
- Configurable fsync modes
|
||||||
|
4. Write comprehensive tests for WAL durability
|
||||||
|
|
||||||
|
### Phase B: In-Memory Layer (1 week)
|
||||||
|
1. Implement MemTable with:
|
||||||
|
- Skip list data structure
|
||||||
|
- Sorted key iteration
|
||||||
|
- Size tracking for flush threshold
|
||||||
|
2. Connect WAL replay to MemTable restore
|
||||||
|
3. Test concurrent read/write scenarios
|
||||||
|
|
||||||
|
### Phase C: Persistent Storage (2 weeks)
|
||||||
|
1. Design and implement SSTable format:
|
||||||
|
- Block-based layout with restart points
|
||||||
|
- Checksummed blocks
|
||||||
|
- Index and metadata in footer
|
||||||
|
2. Build SSTable writer:
|
||||||
|
- Convert MemTable to blocks
|
||||||
|
- Generate sparse index
|
||||||
|
- Write footer with checksums
|
||||||
|
3. Implement SSTable reader:
|
||||||
|
- Block loading and validation
|
||||||
|
- Binary search through index
|
||||||
|
- Iterator interface
|
||||||
|
|
||||||
|
### Phase D: Basic Engine Integration (1 week)
|
||||||
|
1. Implement Level 0 flush mechanism:
|
||||||
|
- MemTable to SSTable conversion
|
||||||
|
- File management and naming
|
||||||
|
2. Create read path that merges:
|
||||||
|
- Current MemTable
|
||||||
|
- Immutable MemTables awaiting flush
|
||||||
|
- Level 0 SSTable files
|
||||||
|
|
||||||
|
### Phase E: Compaction (2 weeks)
|
||||||
|
1. Implement a single, efficient compaction strategy:
|
||||||
|
- Simple tiered compaction approach
|
||||||
|
2. Handle tombstones and key deletion
|
||||||
|
3. Manage file obsolescence and cleanup
|
||||||
|
4. Build background compaction scheduling
|
||||||
|
|
||||||
|
### Phase F: Basic Atomicity and Advanced Features (2-3 weeks)
|
||||||
|
1. Implement merged iterator across all levels
|
||||||
|
2. Add snapshot capability for reads:
|
||||||
|
- Point-in-time view of the database
|
||||||
|
- Consistent reads across MemTable and SSTables
|
||||||
|
3. Implement simple atomic batch operations:
|
||||||
|
- Support atomic multi-key writes
|
||||||
|
- Ensure proper crash recovery for batch operations
|
||||||
|
- Design interfaces that can be extended for full transactions
|
||||||
|
4. Add basic statistics and metrics
|
||||||
|
|
||||||
|
### Phase G: Optimization and Benchmarking (1 week)
|
||||||
|
1. Develop benchmark suite for:
|
||||||
|
- Random vs sequential writes
|
||||||
|
- Point reads vs range scans
|
||||||
|
- Compaction overhead and pauses
|
||||||
|
2. Optimize critical paths based on profiling
|
||||||
|
3. Tune default configuration parameters
|
||||||
|
|
||||||
|
### Phase H: Optional Enhancements (as needed)
|
||||||
|
1. Add Bloom filters to reduce disk reads
|
||||||
|
2. Create monitoring hooks and detailed metrics
|
||||||
|
3. Add crash recovery testing
|
||||||
|
|
||||||
|
## Testing Strategy
|
||||||
|
|
||||||
|
1. **Unit Tests**: Each component thoroughly tested in isolation
|
||||||
|
2. **Integration Tests**: End-to-end tests for complete workflows
|
||||||
|
3. **Property Tests**: Generate randomized operations and verify correctness
|
||||||
|
4. **Crash Tests**: Simulate crashes and verify recovery
|
||||||
|
5. **Benchmarks**: Measure performance across different workloads
|
||||||
|
|
||||||
|
## Implementation Notes
|
||||||
|
|
||||||
|
### Error Handling
|
||||||
|
- Use descriptive error types and wrap errors with context
|
||||||
|
- Implement recovery mechanisms for all critical operations
|
||||||
|
- Validate checksums at every read opportunity
|
||||||
|
|
||||||
|
### Concurrency
|
||||||
|
- Implement single-writer architecture for the main write path
|
||||||
|
- Allow concurrent readers (snapshots) to proceed without blocking
|
||||||
|
- Use appropriate synchronization for reader-writer coordination
|
||||||
|
- Ensure proper isolation between transactions
|
||||||
|
|
||||||
|
### Batch Operation Management
|
||||||
|
- Use WAL for atomic batch operation durability
|
||||||
|
- Leverage LSM's natural versioning for snapshots
|
||||||
|
- Provide simple interfaces that can be built upon for transactions
|
||||||
|
- Ensure proper crash recovery for batch operations
|
||||||
|
|
||||||
|
### Go Idioms
|
||||||
|
- Follow standard Go project layout
|
||||||
|
- Use interfaces for component boundaries
|
||||||
|
- Rely on Go's GC but manage large memory allocations carefully
|
||||||
|
- Use context for cancellation where appropriate
|
198
TODO.md
Normal file
198
TODO.md
Normal file
@ -0,0 +1,198 @@
|
|||||||
|
# Go Storage Engine Todo List
|
||||||
|
|
||||||
|
This document outlines the implementation tasks for the Go Storage Engine, organized by development phases. Follow these guidelines:
|
||||||
|
|
||||||
|
- Work on tasks in the order they appear
|
||||||
|
- Check off exactly one item (✓) before moving to the next unchecked item
|
||||||
|
- Each phase must be completed before starting the next phase
|
||||||
|
- Test thoroughly before marking an item complete
|
||||||
|
|
||||||
|
## Phase A: Foundation
|
||||||
|
|
||||||
|
- [ ] Setup project structure and Go module
|
||||||
|
- [ ] Create directory structure following the package layout in PLAN.md
|
||||||
|
- [ ] Initialize Go module and dependencies
|
||||||
|
- [ ] Set up testing framework
|
||||||
|
|
||||||
|
- [ ] Implement config package
|
||||||
|
- [ ] Define configuration struct with serialization/deserialization
|
||||||
|
- [ ] Include configurable parameters for durability, compaction, memory usage
|
||||||
|
- [ ] Create manifest loading/saving functionality
|
||||||
|
- [ ] Add versioning support for config changes
|
||||||
|
|
||||||
|
- [ ] Build Write-Ahead Log (WAL)
|
||||||
|
- [ ] Implement append-only file with atomic operations
|
||||||
|
- [ ] Add Put/Delete operation encoding
|
||||||
|
- [ ] Create replay functionality with error recovery
|
||||||
|
- [ ] Implement both synchronous (default) and batched fsync modes
|
||||||
|
- [ ] Add checksumming for entries
|
||||||
|
|
||||||
|
- [ ] Write WAL tests
|
||||||
|
- [ ] Test durability with simulated crashes
|
||||||
|
- [ ] Verify replay correctness
|
||||||
|
- [ ] Benchmark write performance with different sync options
|
||||||
|
- [ ] Test error handling and recovery
|
||||||
|
|
||||||
|
## Phase B: In-Memory Layer
|
||||||
|
|
||||||
|
- [ ] Implement MemTable
|
||||||
|
- [ ] Create skip list data structure aligned to 64-byte cache lines
|
||||||
|
- [ ] Add key/value insertion and lookup operations
|
||||||
|
- [ ] Implement sorted key iteration
|
||||||
|
- [ ] Add size tracking for flush threshold detection
|
||||||
|
|
||||||
|
- [ ] Connect WAL replay to MemTable
|
||||||
|
- [ ] Create recovery logic to rebuild MemTable from WAL
|
||||||
|
- [ ] Implement consistent snapshot reads during recovery
|
||||||
|
- [ ] Handle errors during replay with appropriate fallbacks
|
||||||
|
|
||||||
|
- [ ] Test concurrent read/write scenarios
|
||||||
|
- [ ] Verify reader isolation during writes
|
||||||
|
- [ ] Test snapshot consistency guarantees
|
||||||
|
- [ ] Benchmark read/write performance under load
|
||||||
|
|
||||||
|
## Phase C: Persistent Storage
|
||||||
|
|
||||||
|
- [ ] Design SSTable format
|
||||||
|
- [ ] Define 16KB block structure with restart points
|
||||||
|
- [ ] Create checksumming for blocks (xxHash64)
|
||||||
|
- [ ] Define index structure with entries every ~64KB
|
||||||
|
- [ ] Design file footer with metadata (version, timestamp, key count, etc.)
|
||||||
|
|
||||||
|
- [ ] Implement SSTable writer
|
||||||
|
- [ ] Add functionality to convert MemTable to blocks
|
||||||
|
- [ ] Create sparse index generator
|
||||||
|
- [ ] Implement footer writing with checksums
|
||||||
|
- [ ] Add atomic file creation for crash safety
|
||||||
|
|
||||||
|
- [ ] Build SSTable reader
|
||||||
|
- [ ] Implement block loading with validation
|
||||||
|
- [ ] Create binary search through index
|
||||||
|
- [ ] Develop iterator interface for scanning
|
||||||
|
- [ ] Add error handling for corrupted files
|
||||||
|
|
||||||
|
## Phase D: Basic Engine Integration
|
||||||
|
|
||||||
|
- [ ] Implement Level 0 flush mechanism
|
||||||
|
- [ ] Create MemTable to SSTable conversion process
|
||||||
|
- [ ] Implement file management and naming scheme
|
||||||
|
- [ ] Add background flush triggering based on size
|
||||||
|
|
||||||
|
- [ ] Create read path that merges data sources
|
||||||
|
- [ ] Implement read from current MemTable
|
||||||
|
- [ ] Add reads from immutable MemTables awaiting flush
|
||||||
|
- [ ] Create mechanism to read from Level 0 SSTable files
|
||||||
|
- [ ] Build priority-based lookup across all sources
|
||||||
|
|
||||||
|
## Phase E: Compaction
|
||||||
|
|
||||||
|
- [ ] Implement tiered compaction strategy
|
||||||
|
- [ ] Create file selection algorithm based on overlap/size
|
||||||
|
- [ ] Implement merge-sorted reading from input files
|
||||||
|
- [ ] Add atomic output file generation
|
||||||
|
- [ ] Create size ratio and file count based triggering
|
||||||
|
|
||||||
|
- [ ] Handle tombstones and key deletion
|
||||||
|
- [ ] Implement tombstone markers
|
||||||
|
- [ ] Create logic for tombstone garbage collection
|
||||||
|
- [ ] Test deletion correctness across compactions
|
||||||
|
|
||||||
|
- [ ] Manage file obsolescence and cleanup
|
||||||
|
- [ ] Implement safe file deletion after compaction
|
||||||
|
- [ ] Create consistent file tracking
|
||||||
|
- [ ] Add error handling for cleanup failures
|
||||||
|
|
||||||
|
- [ ] Build background compaction
|
||||||
|
- [ ] Implement worker pool for compaction tasks
|
||||||
|
- [ ] Add rate limiting to prevent I/O saturation
|
||||||
|
- [ ] Create metrics for monitoring compaction progress
|
||||||
|
- [ ] Implement priority scheduling for urgent compactions
|
||||||
|
|
||||||
|
## Phase F: Basic Atomicity and Features
|
||||||
|
|
||||||
|
- [ ] Implement merged iterator across all levels
|
||||||
|
- [ ] Create priority merging iterator
|
||||||
|
- [ ] Add efficient seeking capabilities
|
||||||
|
- [ ] Implement proper cleanup for resources
|
||||||
|
|
||||||
|
- [ ] Add snapshot capability
|
||||||
|
- [ ] Create point-in-time view mechanism
|
||||||
|
- [ ] Implement consistent reads across all data sources
|
||||||
|
- [ ] Add resource tracking and cleanup
|
||||||
|
- [ ] Test isolation guarantees
|
||||||
|
|
||||||
|
- [ ] Implement atomic batch operations
|
||||||
|
- [ ] Create batch data structure for multiple operations
|
||||||
|
- [ ] Implement atomic batch commit to WAL
|
||||||
|
- [ ] Add crash recovery for batches
|
||||||
|
- [ ] Design extensible interfaces for future transaction support
|
||||||
|
|
||||||
|
- [ ] Add basic statistics and metrics
|
||||||
|
- [ ] Implement counters for operations
|
||||||
|
- [ ] Add timing measurements for critical paths
|
||||||
|
- [ ] Create exportable metrics interface
|
||||||
|
- [ ] Test accuracy of metrics
|
||||||
|
|
||||||
|
## Phase G: Optimization and Benchmarking
|
||||||
|
|
||||||
|
- [ ] Develop benchmark suite
|
||||||
|
- [ ] Create random/sequential write benchmarks
|
||||||
|
- [ ] Implement point read and range scan benchmarks
|
||||||
|
- [ ] Add compaction overhead measurements
|
||||||
|
- [ ] Build reproducible benchmark harness
|
||||||
|
|
||||||
|
- [ ] Optimize critical paths
|
||||||
|
- [ ] Profile and identify bottlenecks
|
||||||
|
- [ ] Optimize memory usage patterns
|
||||||
|
- [ ] Improve cache efficiency in hot paths
|
||||||
|
- [ ] Reduce GC pressure for large operations
|
||||||
|
|
||||||
|
- [ ] Tune default configuration
|
||||||
|
- [ ] Benchmark with different parameters
|
||||||
|
- [ ] Determine optimal defaults for general use cases
|
||||||
|
- [ ] Document configuration recommendations
|
||||||
|
|
||||||
|
## Phase H: Optional Enhancements
|
||||||
|
|
||||||
|
- [ ] Add Bloom filters
|
||||||
|
- [ ] Implement configurable Bloom filter
|
||||||
|
- [ ] Add to SSTable format
|
||||||
|
- [ ] Create adaptive sizing based on false positive rates
|
||||||
|
- [ ] Benchmark improvement in read performance
|
||||||
|
|
||||||
|
- [ ] Create monitoring hooks
|
||||||
|
- [ ] Add detailed internal event tracking
|
||||||
|
- [ ] Implement exportable metrics
|
||||||
|
- [ ] Create health check mechanisms
|
||||||
|
- [ ] Add performance alerts
|
||||||
|
|
||||||
|
- [ ] Add crash recovery testing
|
||||||
|
- [ ] Build fault injection framework
|
||||||
|
- [ ] Create randomized crash scenarios
|
||||||
|
- [ ] Implement validation for post-recovery state
|
||||||
|
- [ ] Test edge cases in recovery
|
||||||
|
|
||||||
|
## API Implementation
|
||||||
|
|
||||||
|
- [ ] Implement Engine interface
|
||||||
|
- [ ] `Put(ctx context.Context, key, value []byte, opts ...WriteOption) error`
|
||||||
|
- [ ] `Get(ctx context.Context, key []byte, opts ...ReadOption) ([]byte, error)`
|
||||||
|
- [ ] `Delete(ctx context.Context, key []byte, opts ...WriteOption) error`
|
||||||
|
- [ ] `Batch(ctx context.Context, ops []Operation, opts ...WriteOption) error`
|
||||||
|
- [ ] `NewIterator(opts IteratorOptions) Iterator`
|
||||||
|
- [ ] `Snapshot() Snapshot`
|
||||||
|
- [ ] `Close() error`
|
||||||
|
|
||||||
|
- [ ] Implement error types
|
||||||
|
- [ ] `ErrIO` - I/O errors with recovery procedures
|
||||||
|
- [ ] `ErrCorruption` - Data integrity issues
|
||||||
|
- [ ] `ErrConfig` - Configuration errors
|
||||||
|
- [ ] `ErrResource` - Resource exhaustion
|
||||||
|
- [ ] `ErrConcurrency` - Race conditions
|
||||||
|
- [ ] `ErrNotFound` - Key not found
|
||||||
|
|
||||||
|
- [ ] Create comprehensive documentation
|
||||||
|
- [ ] API usage examples
|
||||||
|
- [ ] Configuration guidelines
|
||||||
|
- [ ] Performance characteristics
|
||||||
|
- [ ] Error handling recommendations
|
Loading…
Reference in New Issue
Block a user