Balancing even distribution and query performance
When using distributed databases, we generally encounter recommendations saying to distribute the data as even as possible. Imagine that we use a hash function to choose in which server we should store a record in. If we use an UUID as the primary key, that would be easy, and they will (in general) be evenly distributed. That works really well, and when you need, for example, to get the record #1, the database will reach few nodes to get that data (depending on the consistency level, some systems may query multiple nodes to detect newer versions).
That’s an ideal scenario, and it would translate to an user story in the lines of “Get the article with ID #1“), but in most cases we need queries such as “Get all posts from author #2”). In that case, we won’t be able to get it by its ID, and we would have to do a query, which will need to reach every single record in every node, compare the author_id field and return it. That’s a slow operation when compared to getting an index by the primary key, and now we reach a very important fact about distributed databases: Data needs to be evenly distributed, but queries must reach as few nodes as possible. That may seem controversial, but let’s detail it.
In contrast to traditional relational databases, we generally list our query requirements first and then architect our tables and keys. We should do it in a way that the data is evenly distributed (if, for example, we partition the data by year, we would have a node with practically all of the accesses, and other nodes with almost no access) and close together for querying (for example, if we want to get all of the posts for a user, ideally all of those posts would be in the same node).
To fix this, one solution is to use the column relevant for the query as the partition key. In most distributed database providers we can choose multiple columns to compose the partition key and primary key and, in general, the primary key contains the partition key. In that case, we could create the table posts_by_author
with the primary key being (author_id, id)
and the partition key being the first part of the primary key (author_id
). In that case, we could query all posts in the partition identified by the author_id
and that will reach a small number of nodes (since all of the data will be in just a few partitions and nodes). This also works for the case where we want to get one specific post: we could get a record by its author id and id.
Denormalization and multiple tables with duplicated data
Even with the database architecture previous in the last section, if we want to get a specific post, we would have to know both the author_id and id of it. That’s not always the case, since it wont be convenient for some cases. In that case, it’s actually impossible to get a record only by its ID using the primary key, and the solution to that is the nightmare of all of the software engineers and architects in the world: Having 2+ tables with 90%+ duplication just to fit those two user stories. And that’s an actual solution that’s recommended in some cases. It’s gonna be very efficient since we could reach just a few nodes for our requests and the data will be evenly distributed, and - yes - that means you’re gonna have to update the post and author data in multiple databases.
Sorting
In most distributed databases you should also use part (or the entire) primary key as an sorting index that matches your user stories’s queries while keeping efficient data retrieval and distribution.
(Non-primary) indicies
If you’ve worked with any database, while reading the previous section you might have thought about using indicies for non-primary columns that we still query against. That would’ve been a good solution for the problem, but that may add up costs, or reduce performance. The implementation of non-primary indicies differ from database providers, but some of them actually duplicate the original table prepending the index column content’s to the primary key.
Partitioning data based on tenants
If you separate the data of each customer based on a column in the database, the Tenant ID or Customer ID it may seem like a good partition key, since it might be evenly distributed. That’s true if you have a lot of customers with low usage each. If you have one huge customer (in terms of resource consumption) and multiple small customers, you’ll end up with hot partitions.
Bucketing
When we are sure that the partition key we chose is ideal, but it doesn’t allow an even distribution, we can use a technique called bucketing. With this technique, we create a partition key composed of two columns, one being the actual column that we chose, and the other one being the date of the record. You can choose the granularity of the date range to prevent hot spots. Discord, for example, uses buckets to group records by date and improve data distribution.
The types of keys
While studying distributed databases, for me the hardest part other than architecting the data to fit the queries was understanding the multiple key names. Every record must have a primary key, but we have some other very important keys existing in most distributed databases, which are:
- Primary Key: One (Simple Key) or more columns (Composite Key) that we use to uniquely identify a record in a table. That value must be unique in the database. In general, we use it when we want to get a single record from its unique identifier.
- Simple Primary Key: The primary key composed by a single column, for example, the ID of a record. Even the Primary Key being a single column, that single column could be composed of one or more columns.
- Composite Primary Key: When we use more than one field to compose the Primary Key, generally the first field is the Partition Key, and the other fields form the Clustering or Sort key.
- Partition Key: Part of (or the entire, when using Simple Primary Keys) the Primary Key that’s used to determine in which node that record will be stored.
- Clustering Key or Sort Key: The rest of the Primary Key (or the Primary Key itself when using Simple Primary Keys) that sorts the records within a single node.
Those names and specifications could be different between database providers, but - in general - most providers use this structure.
I consider that completely understanding all of the keys is half of the knowledge necessary to design distributed databases, and if your new to the distributed database, this explanation will probably not be enough to completely understand it, so I recommend searching for example of actual data modeling using distributed databases.
Remember that, for efficient queries, you’ll either:
- Query a single record by providing the complete Primary Key
- Query multiple records by filtering (before any other filters) by the Partition Key
- Query multiple records by filtering (before any other filters) by the Partition Key and sort it by the Clustering Key or Sort Key.
General recommendations when working with distributed databases
- Don’t use JOINs
- Denormalize the data
- Query as few tables as possible. Ideally, one.
- In place of complex queries, use wide tables.
Reads, writes and replication
When accessing distributed databases from a client, we commonly list some servers as possible entry points. In some systems, any server can receive and respond an request, and the server that does it it’s called Coordinator. That server, upon receiving a request, determines where that data is stored (it could be stored on the server itself) and does that communication. In a distributed system, there’s no central server, and any one can operate for reads and writes.
Most applications use an replication factor of 3+, meaning that at least 3 servers must have that data written or read. That makes sense if we want to reduce the risk of data loss or big inconsistencies. For example, if we chose to write with a replication of 3, the coordinator server would choose 3 servers to write that data (of of them could be the server itself), coordinate the requests and wait for an acknowledge of each one. After all of them acknowledges, the row is defined as inserted.
To prevent inconsistencies in the data, it’s also possible to ser a replication factor for queries, and the same strategy mentioned before applies. That option is useful, for example, if we’re reading a sensitive information that could have changed in another server, but was not yet replicated to the server that was chosen to serve the query.
It should be noted that the replication and consistency operations are considered for a specific partition, so if you require that the read is consistent across all possible versions, the database checks all of the nodes that have that partition.
Having multiple replicas of the data, we could replicate a change to 3 nodes, but only require that 2 of the confirm the insertion. That’s why when reading, one might want to query multiple nodes for consistency. Of course not all of the information needs to be returned from all nodes in a read, just a checksum of that data. The data is eventually consistent, since sometime all of the replicas will have the correct and updated data available.
Extra
- To write this article, I’ve studied the following distributed databases: Apache Cassandra, ScyllaDB and AWS DynamoDB.
- This post is a knowledge dump that that I did while studying that specific topic, so It’s not that well structured as an article.