Day 5/21: System Design(Sharding)
Database sharding is a technique used to split large databases into smaller, faster, and more manageable parts. As systems grow, a single database cannot handle high traffic, large storage, and heavy queries, so sharding helps distribute the load across multiple servers.
1. What is Database Sharding?
- Sharding is the process of breaking a large database into smaller pieces (shards).
- Each shard contains a subset of data and is stored on a separate database instance.
- Improves performance, scalability, and fault tolerance in large-scale applications.
Example:
- Instagram stores user profiles in multiple shards to handle billions of users efficiently.
2. Why is Sharding Needed?
Performance Bottlenecks
- A single database cannot handle millions of read/write operations efficiently.
- Sharding reduces query execution time by distributing the load.
Scalability Issues
- Traditional databases have hardware limitations (CPU, RAM, storage).
- Sharding scales horizontally by adding more database servers instead of upgrading a single machine.
High Availability & Fault Tolerance
- If one shard fails, only a small part of data is affected.
- Ensures that the system remains operational even during failures.
Cost Optimization
- Instead of upgrading to expensive high-performance machines, cheaper multiple servers can be used.
3. Types of Sharding
Horizontal Sharding (Range-Based Sharding)
- Data is divided based on ranges of values (e.g., User IDs 1–10K, 10K-20K).
- Each shard contains a subset of rows from the table.
Example:
- User data in a social media app:
- Shard 1: Users 1–100K
- Shard 2: Users 100K — 200K
Pros:
- Simple to implement.
- Queries can be optimized for a specific range.
Cons:
- Uneven load if some ranges are more popular (e.g., new users more active)
Vertical Sharding (Feature-Based Sharding)
- Data is split by features or columns, storing different data types separately.
Example:
- An e-commerce website stores:
- Shard 1: User details
- Shard 2: Order history
- Shard 3: Payment transactions
Pros:
- Reduces query complexity.
- Optimized storage and indexing.
Cons:
- Joins between shards are complex.
Hash-Based Sharding
- A hash function assigns data to shards.
- Ensures even distribution and avoids hotspots.
Example:
- Twitter sharding users based on user ID hash values.
Pros:
- Balances load across shards.
- Avoids overloading specific servers.
Cons:
- Harder to retrieve range-based queries.
Directory-Based Sharding
- A lookup table maps records to shards instead of using a fixed logic.
Example:
- Custom routing for specific business customers in a SaaS product.
Pros:
- Full control over data placement.
Cons:
- The directory can become a single point of failure.
4. Challenges of Sharding
Complex Querying
- Queries that require data from multiple shards need cross-shard joins, which slow down performance.
Rebalancing Issues
- If one shard grows too large, resharding is difficult and expensive.
Increased Maintenance
- More shards mean more operational complexity in managing failures and backups.
Referential Integrity Issues
- Foreign key constraints become difficult across shards.
5. Sharding in Large-Scale Systems
Sharding in MySQL
- MySQL does not have built-in sharding, but it can be implemented using proxy layers or application logic.
- Example: E-commerce applications using MySQL for order history sharding.
Sharding in MongoDB
- MongoDB supports automatic sharding, distributing documents across multiple nodes.
- Example: Healthcare systems storing patient records in MongoDB shards.
Sharding in PostgreSQL
- PostgreSQL supports partitioning and foreign data wrappers for sharding.
- Example: Fintech companies storing transaction data in PostgreSQL shards.
Resharding: Handling Growth
When a shard becomes too large, resharding is required to rebalance the load.
Resharding Strategies:
- Add New Shards & Reassign Data: Move some data to a new shard.
- Split Large Shards: Divide an oversized shard into smaller ones.
- Migrate Data to a Different Database Architecture: Switch to a more scalable database solution.
Example:
- Amazon DynamoDB automatically resizes shards based on data growth.
7. Best Practices for Sharding
- Use a Sharding Strategy that Fits Your Data Model. Choose range-based, hash-based, or directory-based sharding based on query patterns.
- Monitor and Balance Shards Regularly. Prevent overloading a single shard by distributing data evenly.
- Optimize Indexing Per Shard. Each shard should have its own optimized indexes to speed up queries.
- Implement a Query Routing Layer. Use a middleware or proxy layer to direct queries to the correct shard.
- Backup Shards Separately. Each shard should have independent backup and recovery plans.
- Plan for Resharding from the Start. Design the system to easily move data between shards when needed.
A well-designed sharding strategy balances data distribution, avoids bottlenecks, and ensures system reliability.
I’ll be posting daily to stay consistent in both my learning followed by daily pushups. Thank you!
Follow my journey:
Medium: https://ankittk.medium.com/
Instagram: https://www.instagram.com/ankitengram/