12 KiB
Statistics Package Documentation
The stats
package implements a comprehensive, atomic, thread-safe statistics collection system for the Kevo engine. It provides a centralized way to track metrics across all components with minimal performance impact and contention.
Overview
Statistics collection is a critical aspect of database monitoring, performance tuning, and debugging. The stats package is designed to collect and provide access to various metrics with minimal overhead, even in highly concurrent environments.
Key responsibilities of the stats package include:
- Tracking operation counts (puts, gets, deletes, etc.)
- Measuring operation latencies (min, max, average)
- Recording byte counts for I/O operations
- Tracking error occurrences by category
- Maintaining timestamps for the last operations
- Collecting WAL recovery statistics
- Providing a thread-safe, unified interface for all metrics
Architecture
Core Components
The statistics system consists of several well-defined components:
┌───────────────────────────────────────────┐
│ AtomicCollector │
├───────────────┬──────────────┬────────────┤
│ Operation │ Latency │ Error │
│ Counters │ Trackers │ Counters │
└───────────────┴──────────────┴────────────┘
- AtomicCollector: Thread-safe implementation of the Collector interface
- OperationType: Type definition for various operation categories
- LatencyTracker: Component for tracking operation latencies
- RecoveryStats: Specialized structure for WAL recovery metrics
Implementation Details
AtomicCollector
The AtomicCollector
is the core component and implements the Collector
interface:
type AtomicCollector struct {
// Operation counters using atomic values
counts map[OperationType]*atomic.Uint64
countsMu sync.RWMutex // Only used when creating new counter entries
// Timing measurements for last operation timestamps
lastOpTime map[OperationType]time.Time
lastOpTimeMu sync.RWMutex // Only used for timestamp updates
// Usage metrics
memTableSize atomic.Uint64
totalBytesRead atomic.Uint64
totalBytesWritten atomic.Uint64
// Error tracking
errors map[string]*atomic.Uint64
errorsMu sync.RWMutex // Only used when creating new error entries
// Performance metrics
flushCount atomic.Uint64
compactionCount atomic.Uint64
// Recovery statistics
recoveryStats RecoveryStats
// Latency tracking
latencies map[OperationType]*LatencyTracker
latenciesMu sync.RWMutex // Only used when creating new latency trackers
}
The collector uses atomic variables and minimal locking to ensure thread safety while maintaining high performance.
Operation Types
The package defines standard operation types as constants:
type OperationType string
const (
OpPut OperationType = "put"
OpGet OperationType = "get"
OpDelete OperationType = "delete"
OpTxBegin OperationType = "tx_begin"
OpTxCommit OperationType = "tx_commit"
OpTxRollback OperationType = "tx_rollback"
OpFlush OperationType = "flush"
OpCompact OperationType = "compact"
OpSeek OperationType = "seek"
OpScan OperationType = "scan"
OpScanRange OperationType = "scan_range"
)
These standardized types enable consistent tracking across all engine components.
Latency Tracking
The LatencyTracker
maintains runtime statistics about operation latencies:
type LatencyTracker struct {
count atomic.Uint64
sum atomic.Uint64 // sum in nanoseconds
max atomic.Uint64 // max in nanoseconds
min atomic.Uint64 // min in nanoseconds (initialized to max uint64)
}
It tracks:
- Count of operations
- Sum of all latencies (for calculating averages)
- Maximum latency observed
- Minimum latency observed
All fields use atomic operations to ensure thread safety.
Recovery Statistics
Recovery statistics are tracked in a specialized structure:
type RecoveryStats struct {
WALFilesRecovered atomic.Uint64
WALEntriesRecovered atomic.Uint64
WALCorruptedEntries atomic.Uint64
WALRecoveryDuration atomic.Int64 // nanoseconds
}
These metrics provide insights into the recovery process after engine startup.
Key Operations
Operation Tracking
The TrackOperation
method increments the counter for the specified operation type:
func (c *AtomicCollector) TrackOperation(op OperationType) {
counter := c.getOrCreateCounter(op)
counter.Add(1)
// Update last operation time
c.lastOpTimeMu.Lock()
c.lastOpTime[op] = time.Now()
c.lastOpTimeMu.Unlock()
}
This method is used for basic operation counting without latency tracking.
Latency Tracking
The TrackOperationWithLatency
method not only counts operations but also records their duration:
func (c *AtomicCollector) TrackOperationWithLatency(op OperationType, latencyNs uint64) {
// Track operation count
counter := c.getOrCreateCounter(op)
counter.Add(1)
// Update last operation time
c.lastOpTimeMu.Lock()
c.lastOpTime[op] = time.Now()
c.lastOpTimeMu.Unlock()
// Update latency statistics
tracker := c.getOrCreateLatencyTracker(op)
tracker.count.Add(1)
tracker.sum.Add(latencyNs)
// Update max (using compare-and-swap pattern)
// ...
// Update min (using compare-and-swap pattern)
// ...
}
This provides detailed timing metrics for performance analysis.
Error Tracking
Errors are tracked by category using the TrackError
method:
func (c *AtomicCollector) TrackError(errorType string) {
// Get or create error counter
// ...
counter.Add(1)
}
This helps identify problematic areas in the engine.
Byte Tracking
Data volumes are tracked with the TrackBytes
method:
func (c *AtomicCollector) TrackBytes(isWrite bool, bytes uint64) {
if isWrite {
c.totalBytesWritten.Add(bytes)
} else {
c.totalBytesRead.Add(bytes)
}
}
This distinguishes between read and write operations.
Recovery Tracking
Recovery statistics are managed through specialized methods:
func (c *AtomicCollector) StartRecovery() time.Time {
// Reset recovery stats
c.recoveryStats.WALFilesRecovered.Store(0)
c.recoveryStats.WALEntriesRecovered.Store(0)
c.recoveryStats.WALCorruptedEntries.Store(0)
c.recoveryStats.WALRecoveryDuration.Store(0)
return time.Now()
}
func (c *AtomicCollector) FinishRecovery(startTime time.Time, filesRecovered, entriesRecovered, corruptedEntries uint64) {
c.recoveryStats.WALFilesRecovered.Store(filesRecovered)
c.recoveryStats.WALEntriesRecovered.Store(entriesRecovered)
c.recoveryStats.WALCorruptedEntries.Store(corruptedEntries)
c.recoveryStats.WALRecoveryDuration.Store(time.Since(startTime).Nanoseconds())
}
These provide structured insight into the startup recovery process.
Retrieving Statistics
Full Statistics Retrieval
The GetStats
method returns a complete map of all collected statistics:
func (c *AtomicCollector) GetStats() map[string]interface{} {
stats := make(map[string]interface{})
// Add operation counters
c.countsMu.RLock()
for op, counter := range c.counts {
stats[string(op)+"_ops"] = counter.Load()
}
c.countsMu.RUnlock()
// Add timing information
c.lastOpTimeMu.RLock()
for op, timestamp := range c.lastOpTime {
stats["last_"+string(op)+"_time"] = timestamp.UnixNano()
}
c.lastOpTimeMu.RUnlock()
// Add performance metrics
stats["memtable_size"] = c.memTableSize.Load()
stats["total_bytes_read"] = c.totalBytesRead.Load()
stats["total_bytes_written"] = c.totalBytesWritten.Load()
stats["flush_count"] = c.flushCount.Load()
stats["compaction_count"] = c.compactionCount.Load()
// Add error statistics
c.errorsMu.RLock()
errorStats := make(map[string]uint64)
for errType, counter := range c.errors {
errorStats[errType] = counter.Load()
}
c.errorsMu.RUnlock()
stats["errors"] = errorStats
// Add recovery statistics
// ...
// Add latency statistics
// ...
return stats
}
This provides a comprehensive view of the engine's operations and performance.
Filtered Statistics
For targeted analysis, the GetStatsFiltered
method allows retrieving only statistics with a specific prefix:
func (c *AtomicCollector) GetStatsFiltered(prefix string) map[string]interface{} {
allStats := c.GetStats()
filtered := make(map[string]interface{})
for key, value := range allStats {
// Add entries that start with the prefix
if len(prefix) == 0 || startsWith(key, prefix) {
filtered[key] = value
}
}
return filtered
}
This is useful for examining specific types of operations or components.
Performance Considerations
Atomic Operations
The statistics collector uses atomic operations extensively to minimize contention:
-
Lock-Free Counters:
- Most increments and reads use atomic operations
- No locking during normal operation
-
Limited Lock Scope:
- Locks are only used when creating new entries
- Read locks for retrieving complete statistics
-
Read-Write Locks:
- Uses
sync.RWMutex
to allow concurrent reads - Writes (rare in this context) obtain exclusive access
- Uses
Memory Efficiency
The collector is designed to be memory-efficient:
-
Lazy Initialization:
- Counters are created only when needed
- No pre-allocation of unused statistics
-
Map-Based Storage:
- Only tracks operations that actually occur
- Compact representation for sparse metrics
-
Fixed Overhead:
- Predictable memory usage regardless of operation volume
- Low per-operation overhead
Integration with the Engine
The statistics collector is integrated throughout the engine's operations:
-
EngineFacade Integration:
- Central collector instance in the EngineFacade
- All operations tracked through the facade
-
Manager-Specific Statistics:
- Each manager contributes component-specific stats
- Combined by the facade for a complete view
-
Centralized Reporting:
- The
GetStats()
method merges all statistics - Provides a unified view for monitoring
- The
Common Usage Patterns
Tracking Operations
// Track a basic operation
collector.TrackOperation(stats.OpPut)
// Track an operation with latency
startTime := time.Now()
// ... perform operation ...
latencyNs := uint64(time.Since(startTime).Nanoseconds())
collector.TrackOperationWithLatency(stats.OpGet, latencyNs)
// Track bytes processed
collector.TrackBytes(true, uint64(len(key)+len(value))) // write
collector.TrackBytes(false, uint64(len(value))) // read
// Track errors
if err != nil {
collector.TrackError("read_error")
}
Retrieving Statistics
// Get all statistics
allStats := collector.GetStats()
fmt.Printf("Put operations: %d\n", allStats["put_ops"])
fmt.Printf("Total bytes written: %d\n", allStats["total_bytes_written"])
// Get filtered statistics
txStats := collector.GetStatsFiltered("tx_")
for k, v := range txStats {
fmt.Printf("%s: %v\n", k, v)
}
Limitations and Future Enhancements
Current Limitations
-
Fixed Metric Types:
- Predefined operation types
- No dynamic metric definition at runtime
-
Simple Aggregation:
- Basic counters and min/max/avg latencies
- No percentiles or histograms
-
In-Memory Only:
- No persistence of historical metrics
- Resets on engine restart
Potential Enhancements
-
Advanced Metrics:
- Latency percentiles (e.g., p95, p99)
- Histograms for distribution analysis
- Moving averages for trend detection
-
Time Series Support:
- Time-bucketed statistics
- Historical metrics retention
- Rate calculations (operations per second)
-
Metric Export:
- Prometheus integration
- Structured logging with metrics
- Periodic stat dumping to files