kevo/docs/stats.md
Jeremy Tregunna 0637c40a40
Some checks failed
Go Tests / Run Tests (1.24.2) (push) Failing after 5m4s
feat: big refactor cleaning up the engine code
2025-04-23 22:45:16 -06:00

12 KiB

Statistics Package Documentation

The stats package implements a comprehensive, atomic, thread-safe statistics collection system for the Kevo engine. It provides a centralized way to track metrics across all components with minimal performance impact and contention.

Overview

Statistics collection is a critical aspect of database monitoring, performance tuning, and debugging. The stats package is designed to collect and provide access to various metrics with minimal overhead, even in highly concurrent environments.

Key responsibilities of the stats package include:

  • Tracking operation counts (puts, gets, deletes, etc.)
  • Measuring operation latencies (min, max, average)
  • Recording byte counts for I/O operations
  • Tracking error occurrences by category
  • Maintaining timestamps for the last operations
  • Collecting WAL recovery statistics
  • Providing a thread-safe, unified interface for all metrics

Architecture

Core Components

The statistics system consists of several well-defined components:

┌───────────────────────────────────────────┐
│            AtomicCollector                │
├───────────────┬──────────────┬────────────┤
│  Operation    │  Latency     │  Error     │
│  Counters     │  Trackers    │  Counters  │
└───────────────┴──────────────┴────────────┘
  1. AtomicCollector: Thread-safe implementation of the Collector interface
  2. OperationType: Type definition for various operation categories
  3. LatencyTracker: Component for tracking operation latencies
  4. RecoveryStats: Specialized structure for WAL recovery metrics

Implementation Details

AtomicCollector

The AtomicCollector is the core component and implements the Collector interface:

type AtomicCollector struct {
    // Operation counters using atomic values
    counts       map[OperationType]*atomic.Uint64
    countsMu     sync.RWMutex // Only used when creating new counter entries

    // Timing measurements for last operation timestamps
    lastOpTime   map[OperationType]time.Time
    lastOpTimeMu sync.RWMutex // Only used for timestamp updates

    // Usage metrics
    memTableSize      atomic.Uint64
    totalBytesRead    atomic.Uint64
    totalBytesWritten atomic.Uint64

    // Error tracking
    errors     map[string]*atomic.Uint64
    errorsMu   sync.RWMutex // Only used when creating new error entries

    // Performance metrics
    flushCount        atomic.Uint64
    compactionCount   atomic.Uint64

    // Recovery statistics
    recoveryStats RecoveryStats
    
    // Latency tracking 
    latencies    map[OperationType]*LatencyTracker
    latenciesMu  sync.RWMutex // Only used when creating new latency trackers
}

The collector uses atomic variables and minimal locking to ensure thread safety while maintaining high performance.

Operation Types

The package defines standard operation types as constants:

type OperationType string

const (
    OpPut        OperationType = "put"
    OpGet        OperationType = "get"
    OpDelete     OperationType = "delete"
    OpTxBegin    OperationType = "tx_begin"
    OpTxCommit   OperationType = "tx_commit"
    OpTxRollback OperationType = "tx_rollback"
    OpFlush      OperationType = "flush"
    OpCompact    OperationType = "compact"
    OpSeek       OperationType = "seek"
    OpScan       OperationType = "scan"
    OpScanRange  OperationType = "scan_range"
)

These standardized types enable consistent tracking across all engine components.

Latency Tracking

The LatencyTracker maintains runtime statistics about operation latencies:

type LatencyTracker struct {
    count   atomic.Uint64
    sum     atomic.Uint64 // sum in nanoseconds
    max     atomic.Uint64 // max in nanoseconds
    min     atomic.Uint64 // min in nanoseconds (initialized to max uint64)
}

It tracks:

  • Count of operations
  • Sum of all latencies (for calculating averages)
  • Maximum latency observed
  • Minimum latency observed

All fields use atomic operations to ensure thread safety.

Recovery Statistics

Recovery statistics are tracked in a specialized structure:

type RecoveryStats struct {
    WALFilesRecovered    atomic.Uint64
    WALEntriesRecovered  atomic.Uint64
    WALCorruptedEntries  atomic.Uint64
    WALRecoveryDuration  atomic.Int64 // nanoseconds
}

These metrics provide insights into the recovery process after engine startup.

Key Operations

Operation Tracking

The TrackOperation method increments the counter for the specified operation type:

func (c *AtomicCollector) TrackOperation(op OperationType) {
    counter := c.getOrCreateCounter(op)
    counter.Add(1)

    // Update last operation time
    c.lastOpTimeMu.Lock()
    c.lastOpTime[op] = time.Now()
    c.lastOpTimeMu.Unlock()
}

This method is used for basic operation counting without latency tracking.

Latency Tracking

The TrackOperationWithLatency method not only counts operations but also records their duration:

func (c *AtomicCollector) TrackOperationWithLatency(op OperationType, latencyNs uint64) {
    // Track operation count
    counter := c.getOrCreateCounter(op)
    counter.Add(1)

    // Update last operation time
    c.lastOpTimeMu.Lock()
    c.lastOpTime[op] = time.Now()
    c.lastOpTimeMu.Unlock()

    // Update latency statistics
    tracker := c.getOrCreateLatencyTracker(op)
    tracker.count.Add(1)
    tracker.sum.Add(latencyNs)

    // Update max (using compare-and-swap pattern)
    // ...

    // Update min (using compare-and-swap pattern)
    // ...
}

This provides detailed timing metrics for performance analysis.

Error Tracking

Errors are tracked by category using the TrackError method:

func (c *AtomicCollector) TrackError(errorType string) {
    // Get or create error counter
    // ...
    
    counter.Add(1)
}

This helps identify problematic areas in the engine.

Byte Tracking

Data volumes are tracked with the TrackBytes method:

func (c *AtomicCollector) TrackBytes(isWrite bool, bytes uint64) {
    if isWrite {
        c.totalBytesWritten.Add(bytes)
    } else {
        c.totalBytesRead.Add(bytes)
    }
}

This distinguishes between read and write operations.

Recovery Tracking

Recovery statistics are managed through specialized methods:

func (c *AtomicCollector) StartRecovery() time.Time {
    // Reset recovery stats
    c.recoveryStats.WALFilesRecovered.Store(0)
    c.recoveryStats.WALEntriesRecovered.Store(0)
    c.recoveryStats.WALCorruptedEntries.Store(0)
    c.recoveryStats.WALRecoveryDuration.Store(0)
    
    return time.Now()
}

func (c *AtomicCollector) FinishRecovery(startTime time.Time, filesRecovered, entriesRecovered, corruptedEntries uint64) {
    c.recoveryStats.WALFilesRecovered.Store(filesRecovered)
    c.recoveryStats.WALEntriesRecovered.Store(entriesRecovered)
    c.recoveryStats.WALCorruptedEntries.Store(corruptedEntries)
    c.recoveryStats.WALRecoveryDuration.Store(time.Since(startTime).Nanoseconds())
}

These provide structured insight into the startup recovery process.

Retrieving Statistics

Full Statistics Retrieval

The GetStats method returns a complete map of all collected statistics:

func (c *AtomicCollector) GetStats() map[string]interface{} {
    stats := make(map[string]interface{})

    // Add operation counters
    c.countsMu.RLock()
    for op, counter := range c.counts {
        stats[string(op)+"_ops"] = counter.Load()
    }
    c.countsMu.RUnlock()

    // Add timing information
    c.lastOpTimeMu.RLock()
    for op, timestamp := range c.lastOpTime {
        stats["last_"+string(op)+"_time"] = timestamp.UnixNano()
    }
    c.lastOpTimeMu.RUnlock()

    // Add performance metrics
    stats["memtable_size"] = c.memTableSize.Load()
    stats["total_bytes_read"] = c.totalBytesRead.Load()
    stats["total_bytes_written"] = c.totalBytesWritten.Load()
    stats["flush_count"] = c.flushCount.Load()
    stats["compaction_count"] = c.compactionCount.Load()

    // Add error statistics
    c.errorsMu.RLock()
    errorStats := make(map[string]uint64)
    for errType, counter := range c.errors {
        errorStats[errType] = counter.Load()
    }
    c.errorsMu.RUnlock()
    stats["errors"] = errorStats

    // Add recovery statistics
    // ...

    // Add latency statistics
    // ...

    return stats
}

This provides a comprehensive view of the engine's operations and performance.

Filtered Statistics

For targeted analysis, the GetStatsFiltered method allows retrieving only statistics with a specific prefix:

func (c *AtomicCollector) GetStatsFiltered(prefix string) map[string]interface{} {
    allStats := c.GetStats()
    filtered := make(map[string]interface{})

    for key, value := range allStats {
        // Add entries that start with the prefix
        if len(prefix) == 0 || startsWith(key, prefix) {
            filtered[key] = value
        }
    }

    return filtered
}

This is useful for examining specific types of operations or components.

Performance Considerations

Atomic Operations

The statistics collector uses atomic operations extensively to minimize contention:

  1. Lock-Free Counters:

    • Most increments and reads use atomic operations
    • No locking during normal operation
  2. Limited Lock Scope:

    • Locks are only used when creating new entries
    • Read locks for retrieving complete statistics
  3. Read-Write Locks:

    • Uses sync.RWMutex to allow concurrent reads
    • Writes (rare in this context) obtain exclusive access

Memory Efficiency

The collector is designed to be memory-efficient:

  1. Lazy Initialization:

    • Counters are created only when needed
    • No pre-allocation of unused statistics
  2. Map-Based Storage:

    • Only tracks operations that actually occur
    • Compact representation for sparse metrics
  3. Fixed Overhead:

    • Predictable memory usage regardless of operation volume
    • Low per-operation overhead

Integration with the Engine

The statistics collector is integrated throughout the engine's operations:

  1. EngineFacade Integration:

    • Central collector instance in the EngineFacade
    • All operations tracked through the facade
  2. Manager-Specific Statistics:

    • Each manager contributes component-specific stats
    • Combined by the facade for a complete view
  3. Centralized Reporting:

    • The GetStats() method merges all statistics
    • Provides a unified view for monitoring

Common Usage Patterns

Tracking Operations

// Track a basic operation
collector.TrackOperation(stats.OpPut)

// Track an operation with latency
startTime := time.Now()
// ... perform operation ...
latencyNs := uint64(time.Since(startTime).Nanoseconds())
collector.TrackOperationWithLatency(stats.OpGet, latencyNs)

// Track bytes processed
collector.TrackBytes(true, uint64(len(key)+len(value))) // write
collector.TrackBytes(false, uint64(len(value))) // read

// Track errors
if err != nil {
    collector.TrackError("read_error")
}

Retrieving Statistics

// Get all statistics
allStats := collector.GetStats()
fmt.Printf("Put operations: %d\n", allStats["put_ops"])
fmt.Printf("Total bytes written: %d\n", allStats["total_bytes_written"])

// Get filtered statistics
txStats := collector.GetStatsFiltered("tx_")
for k, v := range txStats {
    fmt.Printf("%s: %v\n", k, v)
}

Limitations and Future Enhancements

Current Limitations

  1. Fixed Metric Types:

    • Predefined operation types
    • No dynamic metric definition at runtime
  2. Simple Aggregation:

    • Basic counters and min/max/avg latencies
    • No percentiles or histograms
  3. In-Memory Only:

    • No persistence of historical metrics
    • Resets on engine restart

Potential Enhancements

  1. Advanced Metrics:

    • Latency percentiles (e.g., p95, p99)
    • Histograms for distribution analysis
    • Moving averages for trend detection
  2. Time Series Support:

    • Time-bucketed statistics
    • Historical metrics retention
    • Rate calculations (operations per second)
  3. Metric Export:

    • Prometheus integration
    • Structured logging with metrics
    • Periodic stat dumping to files