From ee23a47a74bf5f9ea14044d874800450b092388d Mon Sep 17 00:00:00 2001 From: Jeremy Tregunna Date: Sat, 19 Apr 2025 14:06:53 -0600 Subject: [PATCH] docs: added idea, plan, and todo docs --- IDEA.md | 52 +++++++++++++++ PLAN.md | 154 +++++++++++++++++++++++++++++++++++++++++++ TODO.md | 198 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ go.mod | 3 + 4 files changed, 407 insertions(+) create mode 100644 IDEA.md create mode 100644 PLAN.md create mode 100644 TODO.md create mode 100644 go.mod diff --git a/IDEA.md b/IDEA.md new file mode 100644 index 0000000..2c69200 --- /dev/null +++ b/IDEA.md @@ -0,0 +1,52 @@ +# Go Storage: A Minimalist LSM Storage Engine + +## Vision + +Build a clean, composable, and educational storage engine in Go that follows Log-Structured Merge Tree (LSM) principles, focusing on simplicity while providing the building blocks needed for higher-level database implementations. + +## Goals + +### 1. Extreme Simplicity +- Create minimal but complete primitives that can support various database paradigms (KV, relational, graph) +- Prioritize readability and educational value over hyper-optimization +- Use idiomatic Go with clear interfaces and documentation +- Implement a single-writer architecture for simplicity and reduced concurrency complexity + +### 2. Durability + Performance +- Implement the LSM architecture pattern: Write-Ahead Log → MemTable → SSTables +- Provide configurable durability guarantees (sync vs. batched fsync) +- Optimize for both point lookups and range scans + +### 3. Configurability +- Store all configuration parameters in a versioned, persistent manifest +- Allow tuning of memory usage, compaction behavior, and durability settings +- Support reproducible startup states across restarts + +### 4. Composable Primitives +- Design clean interfaces for fundamental operations (reads, writes, snapshots, iteration) +- Enable building of higher-level abstractions (SQL, Gremlin, custom query languages) +- Support both transactional and analytical workloads +- Provide simple atomic write primitives that can be built upon: + - Leverage read snapshots from immutable LSM structure + - Support basic atomic batch operations + - Ensure crash recovery through proper WAL handling + +## Target Use Cases + +1. **Educational Tool**: Learn and teach storage engine internals +2. **Embedded Storage**: Applications needing local, durable storage with predictable performance +3. **Prototype Foundation**: Base layer for experimenting with novel database designs +4. **Go Ecosystem Component**: Reusable storage layer for Go applications and services + +## Non-Goals + +1. **Feature Parity with Production Engines**: Not trying to compete with RocksDB, LevelDB, etc. +2. **Multi-Node Distribution**: Focusing on single-node operation +3. **Complex Query Planning**: Leaving higher-level query features to layers built on top + +## Success Criteria + +1. **Correctness**: Data is never lost or corrupted, even during crashes +2. **Understandability**: Code is clear enough to serve as an educational reference +3. **Performance**: Reasonable throughput and latency for common operations +4. **Extensibility**: Can be built upon to create specialized database engines \ No newline at end of file diff --git a/PLAN.md b/PLAN.md new file mode 100644 index 0000000..7b34e37 --- /dev/null +++ b/PLAN.md @@ -0,0 +1,154 @@ +# Implementation Plan for Go Storage Engine + +## Architecture Overview + +``` +┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ +│ Client API │────▶│ MemTable │────▶│ Immutable SSTable Files │ +└─────────────┘ └─────────────┘ └─────────────────────────┘ + │ ▲ ▲ + │ │ │ + ▼ │ │ +┌─────────────┐ │ ┌─────────────────────────┐ +│ Write- │────────────┘ │ Background Compaction │ +│ Ahead Log │ │ Process │ +└─────────────┘ └─────────────────────────┘ + │ │ + │ │ + ▼ ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ Persistent Storage │ +└─────────────────────────────────────────────────────────────────┘ +``` + +## Package Structure + +``` +go-storage/ +├── cmd/ +│ └── storage-bench/ # Benchmarking tool +│ +├── pkg/ +│ ├── config/ # Configuration and manifest +│ ├── wal/ # Write-ahead logging with transaction markers +│ ├── memtable/ # In-memory table implementation +│ ├── sstable/ # SSTable read/write +│ │ ├── block/ # Block format implementation +│ │ └── footer/ # File footer and metadata +│ ├── compaction/ # Compaction strategies +│ ├── iterator/ # Merged iterator implementation +│ ├── transaction/ # Transaction management with Snapshot + WAL +│ │ ├── snapshot/ # Read snapshot implementation +│ │ └── txbuffer/ # Transaction write buffer +│ └── engine/ # Main engine implementation with single-writer architecture +│ +└── internal/ + ├── checksum/ # Checksum utilities (xxHash64) + └── utils/ # Shared internal utilities +``` + +## Development Phases + +### Phase A: Foundation (1-2 weeks) +1. Set up project structure and Go module +2. Implement config package with serialization/deserialization +3. Build basic WAL with: + - Append operations (Put/Delete) + - Replay functionality + - Configurable fsync modes +4. Write comprehensive tests for WAL durability + +### Phase B: In-Memory Layer (1 week) +1. Implement MemTable with: + - Skip list data structure + - Sorted key iteration + - Size tracking for flush threshold +2. Connect WAL replay to MemTable restore +3. Test concurrent read/write scenarios + +### Phase C: Persistent Storage (2 weeks) +1. Design and implement SSTable format: + - Block-based layout with restart points + - Checksummed blocks + - Index and metadata in footer +2. Build SSTable writer: + - Convert MemTable to blocks + - Generate sparse index + - Write footer with checksums +3. Implement SSTable reader: + - Block loading and validation + - Binary search through index + - Iterator interface + +### Phase D: Basic Engine Integration (1 week) +1. Implement Level 0 flush mechanism: + - MemTable to SSTable conversion + - File management and naming +2. Create read path that merges: + - Current MemTable + - Immutable MemTables awaiting flush + - Level 0 SSTable files + +### Phase E: Compaction (2 weeks) +1. Implement a single, efficient compaction strategy: + - Simple tiered compaction approach +2. Handle tombstones and key deletion +3. Manage file obsolescence and cleanup +4. Build background compaction scheduling + +### Phase F: Basic Atomicity and Advanced Features (2-3 weeks) +1. Implement merged iterator across all levels +2. Add snapshot capability for reads: + - Point-in-time view of the database + - Consistent reads across MemTable and SSTables +3. Implement simple atomic batch operations: + - Support atomic multi-key writes + - Ensure proper crash recovery for batch operations + - Design interfaces that can be extended for full transactions +4. Add basic statistics and metrics + +### Phase G: Optimization and Benchmarking (1 week) +1. Develop benchmark suite for: + - Random vs sequential writes + - Point reads vs range scans + - Compaction overhead and pauses +2. Optimize critical paths based on profiling +3. Tune default configuration parameters + +### Phase H: Optional Enhancements (as needed) +1. Add Bloom filters to reduce disk reads +2. Create monitoring hooks and detailed metrics +3. Add crash recovery testing + +## Testing Strategy + +1. **Unit Tests**: Each component thoroughly tested in isolation +2. **Integration Tests**: End-to-end tests for complete workflows +3. **Property Tests**: Generate randomized operations and verify correctness +4. **Crash Tests**: Simulate crashes and verify recovery +5. **Benchmarks**: Measure performance across different workloads + +## Implementation Notes + +### Error Handling +- Use descriptive error types and wrap errors with context +- Implement recovery mechanisms for all critical operations +- Validate checksums at every read opportunity + +### Concurrency +- Implement single-writer architecture for the main write path +- Allow concurrent readers (snapshots) to proceed without blocking +- Use appropriate synchronization for reader-writer coordination +- Ensure proper isolation between transactions + +### Batch Operation Management +- Use WAL for atomic batch operation durability +- Leverage LSM's natural versioning for snapshots +- Provide simple interfaces that can be built upon for transactions +- Ensure proper crash recovery for batch operations + +### Go Idioms +- Follow standard Go project layout +- Use interfaces for component boundaries +- Rely on Go's GC but manage large memory allocations carefully +- Use context for cancellation where appropriate \ No newline at end of file diff --git a/TODO.md b/TODO.md new file mode 100644 index 0000000..f9487a7 --- /dev/null +++ b/TODO.md @@ -0,0 +1,198 @@ +# Go Storage Engine Todo List + +This document outlines the implementation tasks for the Go Storage Engine, organized by development phases. Follow these guidelines: + +- Work on tasks in the order they appear +- Check off exactly one item (✓) before moving to the next unchecked item +- Each phase must be completed before starting the next phase +- Test thoroughly before marking an item complete + +## Phase A: Foundation + +- [ ] Setup project structure and Go module + - [ ] Create directory structure following the package layout in PLAN.md + - [ ] Initialize Go module and dependencies + - [ ] Set up testing framework + +- [ ] Implement config package + - [ ] Define configuration struct with serialization/deserialization + - [ ] Include configurable parameters for durability, compaction, memory usage + - [ ] Create manifest loading/saving functionality + - [ ] Add versioning support for config changes + +- [ ] Build Write-Ahead Log (WAL) + - [ ] Implement append-only file with atomic operations + - [ ] Add Put/Delete operation encoding + - [ ] Create replay functionality with error recovery + - [ ] Implement both synchronous (default) and batched fsync modes + - [ ] Add checksumming for entries + +- [ ] Write WAL tests + - [ ] Test durability with simulated crashes + - [ ] Verify replay correctness + - [ ] Benchmark write performance with different sync options + - [ ] Test error handling and recovery + +## Phase B: In-Memory Layer + +- [ ] Implement MemTable + - [ ] Create skip list data structure aligned to 64-byte cache lines + - [ ] Add key/value insertion and lookup operations + - [ ] Implement sorted key iteration + - [ ] Add size tracking for flush threshold detection + +- [ ] Connect WAL replay to MemTable + - [ ] Create recovery logic to rebuild MemTable from WAL + - [ ] Implement consistent snapshot reads during recovery + - [ ] Handle errors during replay with appropriate fallbacks + +- [ ] Test concurrent read/write scenarios + - [ ] Verify reader isolation during writes + - [ ] Test snapshot consistency guarantees + - [ ] Benchmark read/write performance under load + +## Phase C: Persistent Storage + +- [ ] Design SSTable format + - [ ] Define 16KB block structure with restart points + - [ ] Create checksumming for blocks (xxHash64) + - [ ] Define index structure with entries every ~64KB + - [ ] Design file footer with metadata (version, timestamp, key count, etc.) + +- [ ] Implement SSTable writer + - [ ] Add functionality to convert MemTable to blocks + - [ ] Create sparse index generator + - [ ] Implement footer writing with checksums + - [ ] Add atomic file creation for crash safety + +- [ ] Build SSTable reader + - [ ] Implement block loading with validation + - [ ] Create binary search through index + - [ ] Develop iterator interface for scanning + - [ ] Add error handling for corrupted files + +## Phase D: Basic Engine Integration + +- [ ] Implement Level 0 flush mechanism + - [ ] Create MemTable to SSTable conversion process + - [ ] Implement file management and naming scheme + - [ ] Add background flush triggering based on size + +- [ ] Create read path that merges data sources + - [ ] Implement read from current MemTable + - [ ] Add reads from immutable MemTables awaiting flush + - [ ] Create mechanism to read from Level 0 SSTable files + - [ ] Build priority-based lookup across all sources + +## Phase E: Compaction + +- [ ] Implement tiered compaction strategy + - [ ] Create file selection algorithm based on overlap/size + - [ ] Implement merge-sorted reading from input files + - [ ] Add atomic output file generation + - [ ] Create size ratio and file count based triggering + +- [ ] Handle tombstones and key deletion + - [ ] Implement tombstone markers + - [ ] Create logic for tombstone garbage collection + - [ ] Test deletion correctness across compactions + +- [ ] Manage file obsolescence and cleanup + - [ ] Implement safe file deletion after compaction + - [ ] Create consistent file tracking + - [ ] Add error handling for cleanup failures + +- [ ] Build background compaction + - [ ] Implement worker pool for compaction tasks + - [ ] Add rate limiting to prevent I/O saturation + - [ ] Create metrics for monitoring compaction progress + - [ ] Implement priority scheduling for urgent compactions + +## Phase F: Basic Atomicity and Features + +- [ ] Implement merged iterator across all levels + - [ ] Create priority merging iterator + - [ ] Add efficient seeking capabilities + - [ ] Implement proper cleanup for resources + +- [ ] Add snapshot capability + - [ ] Create point-in-time view mechanism + - [ ] Implement consistent reads across all data sources + - [ ] Add resource tracking and cleanup + - [ ] Test isolation guarantees + +- [ ] Implement atomic batch operations + - [ ] Create batch data structure for multiple operations + - [ ] Implement atomic batch commit to WAL + - [ ] Add crash recovery for batches + - [ ] Design extensible interfaces for future transaction support + +- [ ] Add basic statistics and metrics + - [ ] Implement counters for operations + - [ ] Add timing measurements for critical paths + - [ ] Create exportable metrics interface + - [ ] Test accuracy of metrics + +## Phase G: Optimization and Benchmarking + +- [ ] Develop benchmark suite + - [ ] Create random/sequential write benchmarks + - [ ] Implement point read and range scan benchmarks + - [ ] Add compaction overhead measurements + - [ ] Build reproducible benchmark harness + +- [ ] Optimize critical paths + - [ ] Profile and identify bottlenecks + - [ ] Optimize memory usage patterns + - [ ] Improve cache efficiency in hot paths + - [ ] Reduce GC pressure for large operations + +- [ ] Tune default configuration + - [ ] Benchmark with different parameters + - [ ] Determine optimal defaults for general use cases + - [ ] Document configuration recommendations + +## Phase H: Optional Enhancements + +- [ ] Add Bloom filters + - [ ] Implement configurable Bloom filter + - [ ] Add to SSTable format + - [ ] Create adaptive sizing based on false positive rates + - [ ] Benchmark improvement in read performance + +- [ ] Create monitoring hooks + - [ ] Add detailed internal event tracking + - [ ] Implement exportable metrics + - [ ] Create health check mechanisms + - [ ] Add performance alerts + +- [ ] Add crash recovery testing + - [ ] Build fault injection framework + - [ ] Create randomized crash scenarios + - [ ] Implement validation for post-recovery state + - [ ] Test edge cases in recovery + +## API Implementation + +- [ ] Implement Engine interface + - [ ] `Put(ctx context.Context, key, value []byte, opts ...WriteOption) error` + - [ ] `Get(ctx context.Context, key []byte, opts ...ReadOption) ([]byte, error)` + - [ ] `Delete(ctx context.Context, key []byte, opts ...WriteOption) error` + - [ ] `Batch(ctx context.Context, ops []Operation, opts ...WriteOption) error` + - [ ] `NewIterator(opts IteratorOptions) Iterator` + - [ ] `Snapshot() Snapshot` + - [ ] `Close() error` + +- [ ] Implement error types + - [ ] `ErrIO` - I/O errors with recovery procedures + - [ ] `ErrCorruption` - Data integrity issues + - [ ] `ErrConfig` - Configuration errors + - [ ] `ErrResource` - Resource exhaustion + - [ ] `ErrConcurrency` - Race conditions + - [ ] `ErrNotFound` - Key not found + +- [ ] Create comprehensive documentation + - [ ] API usage examples + - [ ] Configuration guidelines + - [ ] Performance characteristics + - [ ] Error handling recommendations \ No newline at end of file diff --git a/go.mod b/go.mod new file mode 100644 index 0000000..55d6233 --- /dev/null +++ b/go.mod @@ -0,0 +1,3 @@ +module git.canoozie.net/jer/go-storage + +go 1.24.2