One of the biggest challenges in building a stable software system is maintaining backward compatibility with previous data formats.
Systems like Kiji can help you maintain backward compatibility with your own data, but there are still some challenges involved in using all the features of Apache Avro while maintaining backward compatibility. This blog post will describe a common problem associated with storage and retrieval of Avro data, how we get around this problem in Kiji’s own data structures, and how you can use this technique as well.
A Subtle Problem with SpecificRecords
Like many applications that depend on Avro, Kiji uses its “SpecificRecord” API to generate custom classes representing data structures based on Avro IDL files. A custom SpecificRecord class would contain Java member fields for each field declared in the IDL file, getters, setters, as well as the Avro schema associated with the record.
For example, the following Avro IDL defines a record that holds a point:
record Point {
float x;
float y;
string label;
}
This will be compiled to a Java class named Point that can deserialize, manipulate, and serialize Point instances.
Suppose you’ve used a system like Kiji to save some Point instances that you can manipulate later, and that you had the following method in your code:
/** Create a new point with x and y fields doubled. */
public Point doubleXYDistance(Point input) {
// Initialize with all fields of the input Point
Point.Builder builder = Point.newBuilder(input);
builder.setX(builder.getX() * 2.0);
builder.setY(builder.getY() * 2.0);
return builder.build();
}
The call to Point.newBuilder() is initialized with fields from the input Point, so fields such as label are set to whatever the input record had.
Then suppose that in the next version of your program, you change the Avro IDL like so:
record Point {
float x;
float y;
float z = 0.0;
string label;
}
When this is recompiled you’ll have a “z” field, which will be gracefully added to existing records as you read them (when resolving the writer schema with this new reader schema). No problem. The doubleXYDistance() function will do the correct thing with z values as well; initializing the output Point instance with a z-value of 0 if the input Point lacks a z value, and using the input Point’s z value otherwise.
The main compatibility challenge occurs when you write out new records (with x, y, and z) and then an older application (compiled with the “2-d” version of Point) wants to manipulate them.
The older application can gracefully read 3-d points as 2-d points, discarding the z value. But consider this function:
void modifyPoint() {
Point in = loadPointCellFromKiji();
Point out = doubleXYDistance(in);
persistPointToKijiCell(out);
}
When this writes a new Point object back to Kiji, it will write out a 2-d point! While the best thing would be for the system to accommodate additional newer fields gracefully in this process, the next best thing is to at least detect that you are operating on data that is too new, and refuse to do so before you corrupt the persistent information store.
Protocol Versioning
Kiji has this problem with TableLayoutDesc objects that represent Kiji table layouts as an Avro data structure. The KijiSchema shell modifies existing layouts by reading the TableLayoutDesc from the Kiji meta table, creating a modified copy, and persisting it back. We use Avro’s SpecificRecord interface to capture this.
One option to deal with this would be to use Avro’s GenericRecord API; it would allow us to faithfully copy all the fields of the input record into the output. If the new schema is backward compatible with the old schema (as is the case in those two versions of the Point record), we could work with the writer schema interpretation of the data directly.
But by using GenericRecord, we’d lose type-safe functions like TableLayoutDesc.setName(String tableName) (or Point.setX(float x)); we’d have to use the GenericRecord.set(String fieldName, Object value) method, which is a much more error-prone API to operate.
Instead, Kiji uses something we call protocol versioning to get around this issue. Each record type that we intend to operate on has a field named version which is a string. The Kiji TableLayoutDesc Avro record (what your JSON table layout files interact with) includes a version field which today you should set to "layout-1.1".
In Kiji, a protocol version includes a protocol name and a version number in major.minor.revision format. The protocol name is a sanity check on what kind of record this version pertains to. For example, Kiji table layouts all contain a protocol name of “layout”; this prevents the most basic error of trying to parse a similar but unrelated JSON record object in a place where it sholdn’t be.
The version number is a standard version number that follows semantic versioning: the major version changes when an incompatible change is introduced; the minor version for a compatible new feature; and the revision for a bug fix.
When we parse a JSON-formatted Avro record from a file, or load a serialized object into a SpecificRecord, we use the ProtocolVersion class to parse the version field of the record. We can then use this to compare the minimum and maximum protocol versions tolerated by this version of Kiji itself to the record.
For example, the KijiSchema shell version 1.0.0 supports a maximum ProtocolVersion of layout-1.1.0 for table layouts. It would refuse to operate on a Kiji TableLayoutDesc with version = "layout-1.5.0". This TableLayoutDesc object might contain new fields that the SpecificRecord subclass compiled into KijiSchema Shell 1.0.0 would discard. The user would be directed to use a newer version of the shell to operate on the newer TableLayoutDesc record.
Data Interpretation Semantics
The notion of a protocol version goes beyond controlling the literal names and types of fields read and written by a given object. A protocol version might also govern the semantics of how an object is used.
For example, in Kiji, table layouts represented by TableLayoutDesc SpecificRecord objects are parsed by the KijiTableLayout class. This class performs a number of validity checks on the underlying TableLayoutDesc. As we want to guide users to stronger validity constraints, we need to make sure that newer versions of Kiji don’t refuse to operate with older persisted table layouts that fail to meet the new validity standards; a backwards-compatible mode must be employed to “grandfather in” any existing table layouts.
KijiTableLayout parses the ProtocolVersion from TableLayoutDesc.version and uses this to guide what semantics it imposes on table layouts. New table layouts that take advantage of the latest features might declare a higher-numbered protocol version that requires the new validity constraints. Old table layouts that use an older protocol version (like the current standard layout-1.1) would not be affected by these new rules. Of course, if you want to upgrade your table layout in a way that takes advantage of new features in the future, you’ll need to increment your protocol version — and possibly make other necessary adjustments to continue to conform to the current layout validation rules.
In the example above, you could imagine changing a protocol version in the Point example to denote that you now regard x and y values as being in meters instead of inches.
Protocol versions elsewhere in Kiji
We use the ProtocolVersion class in a number of places. For instance, if you write a JSON file that controls how bulk imports map fields of an input CSV file into columns of a Kiji table, this JSON file also contains the protocol version "import-1.0".
You can use this same JSON file to perform recurring imports from a data source indefinitely, while you periodically upgrade to new versions of KijiMR and the KijiMR library. While we might add new features that improve the bulk import process, an existing JSON import control file that declares a protocol version of import-1.0 will be parsed and interpreted consistently.
A Versioned Point
How do you prevent yourself from falling into a similar trap yourself? Let’s return to the Point example from earlier. If from the start you had declared:
record Point {
float x;
float y;
string label;
string version; // must be point-1.0
}
Then your application could rule out a large class of errors if you write:
private static final ProtocolVersion MAX_VER = ProtocolVersion.parse("point-1.0");
void modifyPoint() {
Point in = loadPointCellFromKiji();
if (MAX_VER.compareTo(ProtocolVersion.parse(in.getVersion()) < 0) {
throw new RuntimeException("Input point is too new! Please upgrade the point modifier.");
}
Point out = doubleXYDistance(in);
persistPointToKijiCell(out);
}
The ProtocolVersion class implements Comparable<ProtocolVersion>, so you can compare them easily without needing to worry about string parsing or comparison yourself.
Then when you declare your 3-d point:
record Point {
float x;
float y;
float z = 0.0;
string label;
string version; // must be point-1.1
}
... you would also change MAX_VER to "point-1.1". The newer version of your software would work correctly with 3-d points, and the older version of your software would prevent itself from modifying them.
Note that in the Avro schema above, we don't declare the version field with a default value. We don't want Avro to automatically use the reader's default value here: we must use whatever version was persisted.
If you're retrofitting this behavior onto an older Avro data structure, you may need to gracefully handle a "null" default value for a new version field and treat it as the same as your "1.0.0" version.
Conclusions
Data compatibility is a difficult challenge to manage, and systems that make use of convenient and type-safe precompiled classes may be vulnerable to accidentally discarding or misinterpreting if they are not written carefully.
With Kiji, data compatibility is one of our primary concerns. We believe Kiji makes it easier to safely store large volumes of data in an evolvable fashion. But how applications manipulate and interpret that data can have a big impact on the final result. Avro's Generic and SpecificRecord interfaces offer different tradeoffs in the same flexibility/safety space, with GenericRecords offering you greater flexibility, while SpecificRecords offer a type-safe Java-friendly API.
Kiji uses SpecificRecord instances combined with this versioned protocol pattern to address this concern with its own data structures.
Curious to learn more about ProtocolVersion?