Computer Science

Database Sharding

Database sharding is a technique used to horizontally partition a database into smaller, more manageable pieces called shards. Each shard contains a subset of the data and is stored on a separate server. This approach can improve performance and scalability of large databases.

Written by Perlego with AI-assistance

Related key terms

Horizontal vs Vertical Scaling

2 Key excerpts on "Database Sharding"

eBook - ePub
Parallel Computing Architectures and APIs
IoT Big Data Stream Processing
- Vivek Kale(Author)
- 2019(Publication Date)
- Chapman and Hall/CRC
  (Publisher)
To increase the throughput of transactions from the database, it is possible to have multiple copies of the database. A common replication method is master–slave replication. The master and slave databases are replicas of each other. All writes go to the master and the master keeps the slaves in sync. However, reads can be distributed to any database. Since this configuration distributes the reads among multiple databases, it is a good technology for read-intensive workloads. For write-intensive workloads, it is possible to have multiple masters, but consistency must be ensured if multiple processes update different replicas simultaneously is a complex problem. Additionally, due to the necessity of writing to all masters and effecting the synchronization overhead between the masters rapidly, time to write increases becomes a limiting overhead.

19.2.3 Row Partitioning or Sharding

In cloud technology, sharding is used to refer to the technique of partitioning a table among multiple independent databases by row. However, the partitioning of data by row in relational databases is not new and is referred to as horizontal partitioning in parallel database technology. The distinction between sharding and horizontal partitioning is that horizontal partitioning is done transparently to the application by the database, whereas sharding is explicit partitioning done by the application. However, the two techniques have started converging, since traditional database vendors have started offering support for more sophisticated partitioning strategies. Since sharding is similar to horizontal partitioning, we will first discuss different horizontal partitioning techniques. It can be seen that a good sharding technique depends on both the organization of the data and the type of queries expected.
The different techniques of sharding are as follows:

1. Round-robin partitioning: The round-robin method distributes the rows in a round-robin fashion over different databases. As an example, we could partition the transaction table into multiple databases so that the first transaction is stored in the first database, the second in the second database, and so on. The advantage of round-robin partitioning is its simplicity. However, it also suffers from the disadvantage of losing associations (say) during a query, unless all databases are queried. Hash partitioning and range partitioning do not suffer from the disadvantage of losing record associations.
Sign up to read
Learn more about book
eBook - ePub
Big Data
Concepts, Technology, and Architecture
- Balamurugan Balusamy, Nandhini Abirami R, Seifedine Kadry, Amir H. Gandomi(Authors)
- 2021(Publication Date)
- Wiley
  (Publisher)
Sharding is the process of partitioning very large data sets into smaller and easily manageable chunks called shards. The partitioned shards are stored by distributing them across multiple machines called nodes. No two shards of the same file are stored in the same node, each shard occupies separate nodes, and the shards spread across multiple nodes collectively constitute the data set.

Figure 2.6 a shows that a 1 GB data block is split up into four chunks each of 256 MB. When the size of the data increases, a single node may be insufficient to store the data. With sharding more nodes are added to meet the demands of the massive data growth. Sharding reduces the number of transaction each node handles and increases throughput. It reduces the data each node needs to store.

Figure 2.5
Distribution model.

Figure 2.6
(a) Sharding. (b) Sharding example.

Figure 2.6 b shows an example as how a data block is split up into shards across multiple nodes. A data set with employee details is split up into four small blocks: shard A, shard B, shard C, shard D and stored across four different nodes: node A, node B, node C, and node D. Sharding improves the fault tolerance of the system as the failure of a node affects only the block of the data stored in that particular node.

2.2.2 Data Replication

Replication is the process of creating copies of the same set of data across multiple servers. When a node crashes, the data stored in that node will be lost. Also, when a node is down for maintenance, the node will not be available until the maintenance process is over. To overcome these issues, the data block is copied across multiple nodes. This process is called data replication, and the copy of a block is called replica. Figure 2.7 shows data replication.

Replication makes the system fault tolerant since the data is not lost when an individual node fails as the data is redundant across the nodes. Replication increases the data availability as the same copy of data is available across multiple nodes. Figure 2.8
Sign up to read
Learn more about book

Learn about this page

Index pages curate the most relevant extracts from our library of academic textbooks. They’ve been created using an in-house natural language model (NLM), each adding context and meaning to key research topics.

Explore more topic indexes

View all