Day 12/21: System Design(Database Replication)

What is Database Replication?

4 min readMar 21, 2025

--

Database replication is the process of copying and maintaining database records across multiple servers to improve availability, fault tolerance, and performance.

Why is Replication Important?

  • High Availability — If one database fails, another replica can take over.
  • Scalability — Distribute read queries across multiple replicas.
  • Fault Tolerance — Prevents data loss by keeping multiple copies.
  • Lower Latency — Users in different regions can access a nearby replica for faster responses.

Types of Database Replication

A. Master-Slave (Primary-Replica) Replication:

  • A single master (primary) handles all write operations.
  • Read queries can be distributed across multiple slave (replica) nodes.
  • MySQL replication uses a master to write data and replicas to handle read queries.
  • If the master fails, manual failover or automatic election is needed.
  • Replication lag — Slaves might not always have the latest data.

B. Master-Master Replication:

  • Multiple masters handle both read and write operations.
  • Used in multi-region deployments.
  • Cassandra and CockroachDB support master-master replication.
  • Data conflicts — If two masters update the same row, conflict resolution is needed.

C. Peer-to-Peer Replication:

  • All nodes are equal and can handle read and write requests.
  • No single point of failure.
  • DynamoDB, Apache Cassandra use peer-to-peer replication.
  • Higher network overhead since every node must sync changes.

D. Logical vs. Physical Replication

  • Logical Replication — Only specific changes (rows, tables) are replicated.
  • Example: PostgreSQL Logical Replication.
  • Physical Replication — Entire database files are replicated (binary logs).
  • Example: MySQL binlog replication.

Replication Strategies

A. Synchronous vs. Asynchronous Replication

  • Synchronous: Data is written to all replicas before confirming success.
  • Stronger consistency but slower.
  • Example: Financial systems where data accuracy is critical.
  • Asynchronous: Data is written to replicas after confirmation.
  • Faster but may have replication lag.
  • Example: Social media platforms like Twitter (eventual consistency is acceptable).

B. Quorum-Based Replication

  • Used in distributed databases (e.g., Cassandra, MongoDB).
  • Requires a majority of nodes to acknowledge a write before considering it successful.
  • Example: If a system has 5 nodes and the quorum is set to 3, at least 3 nodes must confirm a write.

Challenges in Database Replication

  • Replication Lag — Slaves may fall behind the master.
  • Data Conflicts — In multi-master replication, conflicting writes need resolution.
  • Network Overhead — More replicas mean more data synchronization traffic.
  • Failover Handling — When a master fails, automatic failover should promote a replica.

Strong Consistency vs. Eventual Consistency

  • Strong Consistency: Every read receives the most recent write.
  • Ensured by synchronous replication and techniques like Paxos, Raft.
  • Used in banking transactions, stock trading where correctness is critical.
  • Eventual Consistency: Reads may return stale data, but all replicas eventually converge.
  • Used in social media feeds, caching layers, NoSQL databases (Cassandra, DynamoDB).

Advanced Replication Techniques

A. Multi-Region Replication

  • Reduces latency for global users by having replicas in different regions.
  • Improves disaster recovery (if a region fails, another takes over).
  • Examples:
  • Google Spanner — Uses TrueTime API to synchronize global replicas.
  • AWS Aurora Global Databases — Replicates data across multiple AWS regions.

B. Change Data Capture (CDC)

  • Captures every change in the database (INSERT, UPDATE, DELETE) and propagates it to other systems (analytics, caching, data warehouses).
  • Used for: Streaming architectures (Kafka, Debezium).
  • Keeping NoSQL stores updated from SQL databases.
  • Real-time analytics

C. Leaderless Replication (Dynamo-Style)

  • No single master, nodes coordinate writes using quorum (N/W/R model).
  • Used in highly available distributed NoSQL systems like Amazon DynamoDB, Apache Cassandra.
  • Trade-offs: Better availability but requires conflict resolution strategies.

D. Read-Your-Writes Consistency

  • Ensures that a user always reads their latest writes, even in eventually consistent systems.
  • Implemented using session consistency or client-side tracking.
  • Example: Facebook’s post updates — You always see your own post instantly.

Advanced Database Partitioning (Sharding)

A. Sharding Strategies

  • Hash-Based Sharding — Data is split based on a hash function.
  • Range-Based Sharding — Data is divided based on value ranges (e.g., user IDs 1–1000, 1001–2000).
  • Directory-Based Sharding — A lookup table is used to find the correct shard.
  • Geographical Sharding — Different regions have their own data partitions.

B. Resharding Challenges

  • Data movement overhead when adding/removing shards.
  • Skewed workloads if some shards are overloaded.
  • Shard rebalance techniques (consistent hashing).

Distributed Transactions and Concurrency Control

A. Two-Phase Commit (2PC)

  • Used in distributed databases to ensure ACID properties.
  • Prepare Phase — All nodes confirm they are ready to commit.
  • Commit Phase — If all nodes agree, the transaction is committed.
  • Slow due to waiting for all nodes.
  • Single point of failure (coordinator node).

B. Three-Phase Commit (3PC)

  • Similar to 2PC but avoids blocking in case of failures by adding a pre-commit phase.
  • Better fault tolerance than 2PC but more network overhead.

C. Distributed Concurrency Control

  • MVCC (Multi-Version Concurrency Control) — Used in PostgreSQL, ensures non-blocking reads.
  • Pessimistic vs. Optimistic Locking — Trade-offs between avoiding conflicts vs. reducing lock contention.

High-Availability Database Architectures

A. Active-Active vs. Active-Passive Failover

  • Active-Active: Multiple database instances handle read & write requests.
  • Used in Cassandra, Google Spanner.
  • Active-Passive: Only the primary handles writes, secondary is a standby.
  • Used in MySQL replication, PostgreSQL streaming replication.

B. RAFT and Paxos (Consensus Algorithms)

  • Used for leader election and ensuring consistency in distributed databases.
  • RAFT is simpler and more commonly used (etcd, CockroachDB).
  • Paxos is harder to implement but more robust (Google Chubby, Spanner).

C. Read Replicas for High Throughput

  • Scaling reads using read-only replicas.
  • Amazon RDS allows multiple read replicas to distribute traffic.
  • PostgreSQL allows streaming replication for read-heavy workloads.

Understanding advanced replication, consistency models, and high-availability architectures is critical for designing scalable and reliable database systems. Choosing the right trade-offs between performance, availability, and consistency depends on use case requirements and business needs.

I’ll be posting and stay consistent in both my learning followed by daily pushups. Thank you!

Follow my journey:
Medium: https://ankittk.medium.com/
Instagram: https://www.instagram.com/ankitengram/

--

--

Ankit Kumar
Ankit Kumar

No responses yet