Timestamps in HBase are powerful, but they tend to trip up developers and users. While working on a variety of projects on HBase and Kiji, I’ve seen many ways that naively interacting with versioning in HBase can have unexpected consequences. This blog post will help others avoid some of the most common pitfalls that I have found, particularly when explicitly setting timestamps for data.
Fully specifying a cell in HBase requires 4 components: rowkey, column family, column qualifier, and a version (a.k.a., the timestamp). When writing to a cell, you have the option of explicitly setting the version or letting HBase manage it for you. Though managing timestamps yourself is allowed, it is cautioned against in the HBase reference manual:
Caution: the version timestamp is used internally by HBase for things like time-to-live calculations. It’s usually best to avoid setting this timestamp yourself. Prefer using a separate timestamp attribute of the row, or have the timestamp a part of the rowkey, or both.
If you choose to manage version timestamps yourself, it is important to understand how timestamps are used internally by HBase.
The default method of versioning cells in HBase is for the RegionServer to assign a version using a UNIX epoch timestamp measured in milliseconds. If HBase is backing a live application, this version for a cell is very convenient because it is both the time that HBase recorded the data and when the user generated the data. Sometimes though, particularly in the case of importing historical data into HBase, you may want to explicitly set the version of a cell to be the timestamp at which you originally recorded for the data.
There are a number of potential gotchas when using timestamps in HBase/Kiji. Here are the ones on the top of my list and accompanying solutions:
Problem: Setting to time-to-live for a HBase column family (Kiji locality group) too small.
Example: I once saw a project where a bulk import of historical data in a Kiji table seemed to fail because while the MapReduce job succeeded, the time to live property of the Kiji locality group was converted to Java timestamps incorrectly, and data in the table only had seconds to live.
Solution: When setting the time-to-live for a locality group, it is important to verify that your defined time-to-live is consistent with the timestamps you are setting.
Problem: Deleting values will mask future values written to a timestamp older than when the delete action was recorded. The delete will stop masking values once a major compaction is completed.
Example: While working on backing up and restoring metadata in Kiji, I discovered after deleting and restoring the original metadata from a backup, I could not read any values back from the table: the delete operation was masking values written after the delete, but with a timestamp previous to it.
Solution: Do not delete data unless you *really* need to. If you perform a delete, trigger a major compaction, and wait for it to finish. A problem here is that there is no way to check if the major compaction you have triggered is complete. The solution I came to for restoring metadata from a backup was to create an entirely new HBase table to write the meta information back into. This is clearly not a universal solution, but in the case of a developer needing to restore from backup, it is reasonable to assume we can drop and recreate their old table.
Problem: You can overwrite data unknowingly by writing to the same row, family, qualifier, and version.
Example: While working with Wikipedia editing data, I accidentally clobbered around 100,000 records because these editing events were were recorded at the exact same millisecond. It isn’t likely, but it happens.
Solution: Distinguish the values being written to the same timestamp by altering either the row key or column qualifier to make the (r, f, q, t) coordinates distinct.
Timestamps are powerful but can be tricky. Knowing how versions are used internally by HBase can help you avoid common pitfalls in working with versioned data. The common problems pointed out here are a good start for getting a handle on versioning in HBase. For more information on how HBase uses versions internally, checkout HBase: The Definitive Guide.