Designing an effective schema for Apache HBase is a non-trivial problem. KijiSchema is a new framework that makes it easier to build applications on top of HBase. It provides a schema management layer that takes a lot of the guesswork out of designing HBase schemas. Kiji is optimized for ease of access for entity-centric schemas and tuned for good performance. In the rest of this article, we’ll learn about some of the features that KijiSchema provides.
As with most database management systems, HBase suffers from what is known as a leaky abstraction. In an ideal world, a DBMS would provide a perfect abstraction. A DBMS should allow the DBA to design a schema in optimized for efficient access by application developers. In the real world, however, effectively optimizing a DBMS requires an intimate understanding of the physical properties of the database. In a relational database, a leaky abstraction means that when a DBA selects indexes and determines how to shard or partition a table, these choices impact the performance of the database. HBase requires similar upfront planning when selecting a row key or naming columns and column families.
One of the more serious consequences of HBase’s leaky abstraction is the appearance of hotspots. When one server receives a disproportionate number of requests compared to the rest of the cluster, we call it a hotspot. The result is extremely poor cluster utilization. The result in often painfully slow response times from the hot server and may be followed by a cascade of failures due to overloaded servers.
What Causes Hotspots?
In order to better understand hotspots, we have to look closer at how data is assigned to particular Region Servers. Each Region Server serves up a number of Regions, which are assigned to the Region Server by the HBase Master. A particular Region contains all the data for a range of keys. Periodically, HBase will determine if Regions have gotten too large, and split them, if they have exceeded the size limit. Each split Region will end up responsible for a portion of the original Region’s key range.
Consider an HBase table designed to store Wikipedia articles, which uses the article name as its row key. In the English language, words tend to start with certain letters disproportionately to the rest of the alphabet. For example, 16.671% of English words begin with the letter ‘T’, while only 0.034% of English words begin with ‘Z’. For the purposes of this discussion, assume that the total disk size of the articles for each letter is exactly the same as would be the case if the ‘Z’ articles were very large, and the ‘T’ articles were all very small. The result would be that the Region containing ‘T’ articles would contain a huge amount of row keys, compared to the ‘Z’ Region.
Now assume that a MapReduce job which runs frequently scans all the row keys that start with ‘T’. The result is that the Region Server that serves the ‘T’ Region will end up receiving a huge amount of data requests, while other Region Servers will sit idle.
Another example of a hotspot occurs when writing data. This time, consider a schema with a transaction table that uses a numeric transaction ID as the row key. Every day, we perform a bulk load of all the new transactions that were received during the previous day. As is often the case, our data all comes in sorted by transaction ID. When we load the data into HBase each day, the new data will always have a transaction ID greater than any other ID in the table.
The consequence of this is that our loads always write to one Region at a time. If we try to write a large amount of data, this can cause the entire process to bottleneck on a single server.
Dealing with Hotspots
The root cause of the hotspot in both of these cases is a poorly chosen row key. Due to the way HBase writes to particular Regions, it is important evenly distribute the key space. In the Wikipedia example, the optimal distribution would a similar number of articles start with the letter ‘T’ in each Region. Had that been the case, the MapReduce jobs would send read requests to all of the Region Servers, rather than just one.
How do we achieve an even distribution of keys among the Regions? Considering the transaction table example, if we knew that our keyspace had 1,000,000 possible keys, and we have a cluster with 10 Region Servers, we could split our keyspace into 10 segments by taking the transaction ID modulo 10, and using that result as a prefix for the key.
However, this solution is only possible because we know that the keyspace is monotonically increasing up to a maximum number and we know the number of Region Servers. This same approach doesn’t work nearly as well for the Wikipedia example.
Hash It Out
The best overall approach for achieving an even distribution of the keyspace is to use a hash function. A good hash function is deterministic, so that applying the function to a row key returns the same result every time. The hash function should also provide uniformity, so that every possible result of the hash function will occur with a similar probability. Another important feature of a good hash function is that it minimize hash collisions, where two row keys hash to the same result. Uniformity and minimal collisions result in the best distribution of row keys.
One downside to hashing row keys is that it impairs an application that needs to scan a range of rows. The MapReduce job that reads all articles starting with the letter ‘T’ would need to scan the entire table if the records were stored using hashed keys. However, for applications that don’t need to do scan ranges of keys, row key hashing is an extremely effective and efficient way to avoid hotspots. The next release of KijiSchema includes support for composite row keys, allowing developers to specify hierarchical row keys, making it possible to scan over ranges of rows.
KijiSchema has built-in functionality for hashing row keys using MD5 hashing. This feature makes it easy for developers to avoid the performance impacts of hotspots. The following KijiSchema DDL demonstrates how to enable row key hashing on a Kiji table:
CREATE TABLE articles WITH DESCRIPTION 'Wikipedia articles' ROW KEY FORMAT HASH PREFIXED(4) WITH LOCALITY GROUP default WITH DESCRIPTION 'main storage' ( FAMILY info WITH DESCRIPTION 'basic information' ( name "string" WITH DESCRIPTION 'article name', text "string") );
In the current version of KijiSchema, if an application needs to perform scans,
ROW KEY FORMAT RAW can be used and developers must manage row keys in the application.
HBase performance varies greatly depending on how developers design a table. It is important to consider how HBase distributes regions and application requirements for range scans when selecting row keys. KijiSchema makes it easier to design schemas with built in row key hashing. Because of built-in row key hashing, KijiSchema avoids hotspots and results in better cluster utilization.
Jon Natkins (@nattyice) is a Field Engineer at WibiData. Formerly, he was a software engineer at Cloudera, and has contributed to a variety of projects in the Apache Hadoop ecosystem. He holds an Sc.B in Computer Science from Brown University.