WIP: Primary-Replica Replication experiment #1

Closed
jer wants to merge 13 commits from replication into master
Owner

I'm experimenting adding:

  • Hooks to key areas (WAL entry, batch append, memtable flush, sstable compaction success, etc.)
  • Primary-replica replication.
  • Lamport clock to handle positioning of events

The goal is to allow users to hook into key events in the system with their own custom logic. In the case of replication, ferrying data to other nodes in the cluster is priority.

Since Kevo is a single-writer model, the primary-replica model fits best, because writes happen to the primary, and reads can happen to the replica therefore not causing the reads to delay during writes. I know that Kevo supports reading data during a write if the data is in the WAL, but the data won't always be in the WAL. It's also not a great solution, just a handy one for short term unavailability due to writes.

I'm going to try to avoid actually sending any data over a network right now, and just print things to the screen for the experiment; this will achieve the purpose of understanding and validating if it's working correctly

I'm experimenting adding: - Hooks to key areas (WAL entry, batch append, memtable flush, sstable compaction success, etc.) - Primary-replica replication. - Lamport clock to handle positioning of events The goal is to allow users to hook into key events in the system with their own custom logic. In the case of replication, ferrying data to other nodes in the cluster is priority. Since Kevo is a single-writer model, the primary-replica model fits best, because writes happen to the primary, and reads can happen to the replica therefore not causing the reads to delay during writes. I know that Kevo supports reading data during a write if the data is in the WAL, but the data won't always be in the WAL. It's also not a great solution, just a handy one for short term unavailability due to writes. I'm going to try to avoid actually sending any data over a network right now, and just print things to the screen for the experiment; this will achieve the purpose of understanding and validating if it's working correctly
jer added 2 commits 2025-04-21 04:30:33 +00:00
jer added 1 commit 2025-04-21 05:19:13 +00:00
feat: add hooks for transaction commit, sstable creation, compaction, engine lifecycle
All checks were successful
Go Tests / Run Tests (1.24.2) (pull_request) Successful in 9m36s
141b132719
jer added 1 commit 2025-04-21 05:37:12 +00:00
test: add tests around hooks
All checks were successful
Go Tests / Run Tests (1.24.2) (pull_request) Successful in 9m38s
82b4d7b338
jer added 1 commit 2025-04-21 06:48:25 +00:00
feat: remove old replication implementation and refactor the architecture
All checks were successful
Go Tests / Run Tests (1.24.2) (pull_request) Successful in 9m37s
acccb4969e
jer force-pushed replication from acccb4969e to 9fb40779c7 2025-04-26 16:03:33 +00:00 Compare
jer added 2 commits 2025-04-26 17:43:21 +00:00
feat: implement replication hook point in WAL and Lamport clocks…
Some checks failed
Go Tests / Run Tests (1.24.2) (pull_request) Failing after 5m1s
c0bfd835f7
- Add Lamport clock implementation for logical timestamps
- Define ReplicationHook interface in WAL package
- Extend WAL to use Lamport clock for timestamps
- Add notification hooks for WAL entries and batches
- Update WAL initialization to support replication
- Add tests for replication hooks and clocks
jer added 6 commits 2025-04-26 19:06:38 +00:00
- Add WAL replicator component with entry capture, buffering, and subscriptions
- Implement WAL entry serialization with checksumming
- Add batch serialization for network-efficient transfers
- Implement proper concurrency control with mutex protection
- Add utility functions for entry size estimation
- Create comprehensive test suite
- Add WALApplier component for applying entries on replica nodes
- Implement logical timestamp ordering with Lamport clocks
- Add support for handling out-of-order entry delivery
- Add error handling and recovery mechanisms
- Implement comprehensive testing for all applier functions
Ensure proper Lamport clock integration across batch operations and sequence number handling:
- Add Lamport clock support to Batch.Write for consistent replication timestamps
- Fix potential sequence number inconsistencies in WAL operations
- Update WAL tests to properly verify sync mode behaviors
- Fix Sequence numbering to appropriately handle Lamport clock timestamps
- Add replication-specific interfaces to pkg/transport
- Create ReplicationClient and ReplicationServer interfaces
- Add replication message types for WAL entries and bootstrap
- Create Protobuf schema for replication in proto/kevo/replication.proto
- Update transport registry to support replication components
feat: implement replication transport layer
All checks were successful
Go Tests / Run Tests (1.24.2) (pull_request) Successful in 9m49s
5963538bc5
This commit implements the replication transport layer as part of Phase 2 of the replication plan.
Key components include:

- Add protocol buffer definitions for replication services
- Implement WALReplicator extension for processor management
- Create replication service server implementation
- Add replication client and server transport implementations
- Implement storage snapshot interface for bootstrap operations
- Standardize package naming across replication components
jer added 4 commits 2025-04-26 21:56:35 +00:00
This commit adds comprehensive reliability features to the replication transport layer:

- Add retry logic with exponential backoff for all network operations
- Implement circuit breaker pattern to prevent cascading failures
- Add reconnection handling with automatic recovery
- Implement proper timeout handling for all network operations
- Add comprehensive logging for connection issues
- Improve error handling with temporary error classification
- Enhance stream processing with automatic recovery
- Add checksums for WAL entries and WAL entry batches
- Implement robust retry and circuit breaker patterns for reliability
- Add comprehensive tests for message processing and reliability features
- Enhance error handling and timeout management
- Add access control system for replica authorization
- Implement persistence of replica information
- Add stale replica detection
- Create comprehensive tests for replica registration
- Update ReplicationServiceServer to use new components
refactor: improve bootstrap API with proper interface-based design for testing
All checks were successful
Go Tests / Run Tests (1.24.2) (pull_request) Successful in 9m50s
374d0dde65
- Convert anonymous interface dependency in BootstrapManager to the new EntryApplier interface
- Update service layer code to use interfaces instead of concrete types
- Fix tests to properly verify bootstrap behavior
- Extend test coverage with proper root cause analysis for failing tests
- Fix persistence tests in replica_registration to explicitly handle delayed persistence
Author
Owner

Closing this, will favour a new PR with a different approach as this one got very unwieldy.

Closing this, will favour a new PR with a different approach as this one got very unwieldy.
jer closed this pull request 2025-04-27 22:18:02 +00:00
jer deleted branch replication 2025-04-27 22:18:05 +00:00
All checks were successful
Go Tests / Run Tests (1.24.2) (pull_request) Successful in 9m50s

Pull request closed

Sign in to join this conversation.
No reviewers
No Label
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: jer/kevo#1
No description provided.