kevo/docs/wal.md

# Write-Ahead Log (WAL) Package Documentation

The `wal` package implements a durable, crash-resistant Write-Ahead Log for the Kevo engine. It serves as the primary mechanism for ensuring data durability and atomicity, especially during system crashes or power failures.

## Overview

The Write-Ahead Log records all database modifications before they are applied to the main database structures. This follows the "write-ahead logging" principle: all changes must be logged before being applied to the database, ensuring that if a system crash occurs, the database can be recovered to a consistent state by replaying the log.

Key responsibilities of the WAL include:
- Recording database operations in a durable manner
- Supporting atomic batch operations
- Providing crash recovery mechanisms
- Managing log file rotation and cleanup

## File Format and Record Structure

### WAL File Format

WAL files use a `.wal` extension and are named with a timestamp:
```
<timestamp>.wal  (e.g., 01745172985771529746.wal)
```

The timestamp-based naming allows for chronological ordering during recovery.

### Record Format

Records in the WAL have a consistent structure:

```
┌──────────────┬──────────────┬──────────────┬──────────────────────┐
│    CRC-32    │    Length    │    Type      │       Payload        │
│   (4 bytes)  │   (2 bytes)  │   (1 byte)   │    (Length bytes)    │
└──────────────┴──────────────┴──────────────┴──────────────────────┘
     Header (7 bytes)                          Data
```

- **CRC-32**: A checksum of the payload for data integrity verification
- **Length**: The payload length (up to 32KB)
- **Type**: The record type:
  - `RecordTypeFull (1)`: A complete record
  - `RecordTypeFirst (2)`: First fragment of a large record
  - `RecordTypeMiddle (3)`: Middle fragment of a large record
  - `RecordTypeLast (4)`: Last fragment of a large record

Records larger than the maximum size (32KB) are automatically split into multiple fragments.

### Operation Payload Format

For standard operations (Put/Delete), the payload format is:

```
┌──────────────┬──────────────┬──────────────┬──────────────┬──────────────┬──────────────┐
│   Op Type    │  Sequence    │   Key Len    │     Key      │  Value Len   │    Value     │
│   (1 byte)   │  (8 bytes)   │  (4 bytes)   │ (Key Len)    │  (4 bytes)   │ (Value Len)  │
└──────────────┴──────────────┴──────────────┴──────────────┴──────────────┴──────────────┘
```

- **Op Type**: The operation type:
  - `OpTypePut (1)`: Key-value insertion
  - `OpTypeDelete (2)`: Key deletion
  - `OpTypeMerge (3)`: Value merging (reserved for future use)
  - `OpTypeBatch (4)`: Batch of operations
- **Sequence**: A monotonically increasing sequence number
- **Key Len / Key**: The length and bytes of the key
- **Value Len / Value**: The length and bytes of the value (omitted for delete operations)

## Implementation Details

### Core Components

#### WAL Writer

The `WAL` struct manages writing to the log file and includes:
- Buffered writing for efficiency
- CRC32 checksums for data integrity
- Sequence number management
- Synchronization control based on configuration

#### WAL Reader

The `Reader` struct handles reading and validating records:
- Verifies CRC32 checksums
- Reconstructs fragmented records
- Presents a logical view of entries to consumers

#### Batch Processing

The `Batch` struct handles atomic multi-operation groups:
- Collect multiple operations (Put/Delete)
- Write them as a single atomic unit
- Track operation counts and sizes

### Key Operations

#### Writing Operations

The `Append` method writes a single operation to the log:
1. Assigns a sequence number
2. Computes the required size
3. Determines if fragmentation is needed
4. Writes the record(s) with appropriate headers
5. Syncs to disk based on configuration

#### Batch Operations

The `AppendBatch` method handles writing multiple operations atomically:
1. Writes a batch header with operation count
2. Assigns sequential sequence numbers to operations
3. Writes all operations with the same basic format
4. Syncs to disk based on configuration

#### Record Fragmentation

For records larger than 32KB:
1. The record is split into fragments
2. First fragment (`RecordTypeFirst`) contains metadata and part of the key
3. Middle fragments (`RecordTypeMiddle`) contain continuing data
4. Last fragment (`RecordTypeLast`) contains the final portion

#### Reading and Recovery

The `ReadEntry` method reads entries from the log:
1. Reads a physical record
2. Validates the checksum
3. If it's a fragmented record, collects all fragments
4. Parses the entry data into an `Entry` struct

## Durability Guarantees

The WAL provides configurable durability through three sync modes:

1. **Immediate Sync Mode (`SyncImmediate`)**:
   - Every write is immediately synced to disk
   - Highest durability, lowest performance
   - Data safe even in case of system crash or power failure
   - Suitable for critical data where durability is paramount

2. **Batch Sync Mode (`SyncBatch`)**:
   - Syncs after a configurable amount of data is written
   - Balances durability and performance
   - May lose very recent transactions in case of crash
   - Default setting for most workloads

3. **No Sync Mode (`SyncNone`)**:
   - Relies on OS caching and background flushing
   - Highest performance, lowest durability
   - Data may be lost in case of crash
   - Suitable for non-critical or easily reproducible data

The application can choose the appropriate sync mode based on its durability requirements.

## Recovery Process

WAL recovery happens during engine startup:

1. **WAL File Discovery**:
   - Scan for all `.wal` files in the WAL directory
   - Sort files by timestamp (filename)

2. **Sequential Replay**:
   - Process each file in chronological order
   - For each file, read and validate all records
   - Apply valid operations to rebuild the MemTable

3. **Error Handling**:
   - Skip corrupted records when possible
   - If a file is heavily corrupted, move to the next file
   - As long as one file is processed successfully, recovery continues

4. **Sequence Number Recovery**:
   - Track the highest sequence number seen
   - Update the next sequence number for future operations

5. **WAL Reset**:
   - After recovery, either reuse the last WAL file (if not full)
   - Or create a new WAL file for future operations

The recovery process is designed to be robust against partial corruption and to recover as much data as possible.

## Corruption Handling

The WAL implements several mechanisms to handle and recover from corruption:

1. **CRC32 Checksums**:
   - Every record includes a CRC32 checksum
   - Corrupted records are detected and skipped

2. **Scanning Recovery**:
   - When corruption is detected, the reader can scan ahead
   - Tries to find the next valid record header

3. **Progressive Recovery**:
   - Even if some records are lost, subsequent valid records are processed
   - Files with too many errors are skipped, but recovery continues with later files

4. **Backup Mechanism**:
   - Problematic WAL files can be moved to a backup directory
   - This allows recovery to proceed with a clean slate if needed

## Performance Considerations

### Buffered Writing

The WAL uses buffered I/O to reduce the number of system calls:
- Writes go through a 64KB buffer
- The buffer is flushed when sync is called
- This significantly improves write throughput

### Sync Frequency Trade-offs

The sync frequency directly impacts performance:
- `SyncImmediate`: 1 sync per write operation (slowest, safest)
- `SyncBatch`: 1 sync per N bytes written (configurable balance)
- `SyncNone`: No explicit syncs (fastest, least safe)

### File Size Management

WAL files have a configurable maximum size (default 64MB):
- Full files are closed and new ones created
- This prevents individual files from growing too large
- Facilitates easier backup and cleanup

## Common Usage Patterns

### Basic Usage

```go
// Create a new WAL
cfg := config.NewDefaultConfig("/path/to/data")
myWAL, err := wal.NewWAL(cfg, "/path/to/data/wal")
if err != nil {
    log.Fatal(err)
}

// Append operations
seqNum, err := myWAL.Append(wal.OpTypePut, []byte("key"), []byte("value"))
if err != nil {
    log.Fatal(err)
}

// Ensure durability
if err := myWAL.Sync(); err != nil {
    log.Fatal(err)
}

// Close the WAL when done
if err := myWAL.Close(); err != nil {
    log.Fatal(err)
}
```

### Using Batches for Atomicity

```go
// Create a batch
batch := wal.NewBatch()
batch.Put([]byte("key1"), []byte("value1"))
batch.Put([]byte("key2"), []byte("value2"))
batch.Delete([]byte("key3"))

// Write the batch atomically
startSeq, err := myWAL.AppendBatch(batch.ToEntries())
if err != nil {
    log.Fatal(err)
}
```

### WAL Recovery

```go
// Handler function for each recovered entry
handler := func(entry *wal.Entry) error {
    switch entry.Type {
    case wal.OpTypePut:
        // Apply Put operation
        memTable.Put(entry.Key, entry.Value, entry.SequenceNumber)
    case wal.OpTypeDelete:
        // Apply Delete operation
        memTable.Delete(entry.Key, entry.SequenceNumber)
    }
    return nil
}

// Replay all WAL files in a directory
if err := wal.ReplayWALDir("/path/to/data/wal", handler); err != nil {
    log.Fatal(err)
}
```

## Trade-offs and Limitations

### Write Amplification

The WAL doubles write operations (once to WAL, once to final storage):
- This is a necessary trade-off for durability
- Can be mitigated through batching and appropriate sync modes

### Recovery Time

Recovery time is proportional to the size of the WAL:
- Large WAL files or many operations increase startup time
- Mitigated by regular compaction that makes old WAL files obsolete

### Corruption Resilience

While the WAL can recover from some corruption:
- Severe corruption at the start of a file may render it unreadable
- Header corruption can cause loss of subsequent records
- Partial sync before crash can lead to truncated records

These limitations are managed through:
- Regular WAL rotation
- Multiple independent WAL files
- Robust error handling during recovery