Posted by & filed under BentoBox, Releases.

Announcing the “Albacore” BentoBox v1.0.4

We are pleased to announce an updated version of the BentoBox SDK for Kiji. This system includes several upgraded components and will make it easier to deliver new releases of Kiji software to you as well. See the previous announcement here.

Albacore v1.0.4 includes the following software:

  • KijiSchema 1.0.3
  • KijiSchema shell 1.0.0
  • KijiMR 1.0.0-rc62
  • KijiMR Library 1.0.0-rc61
  • Kiji Hive Adapter 0.3.0 *UPDATED*
  • KijiExpress 0.3.0 *UPDATED*
  • KijiREST 0.1.0 *NEW*
  • Example code: phonebook and music recommendation tutorials

This BentoBox is powered by Hadoop and HBase via CDH 4.1. It is built around the latest version of KijiSchema.

New: KijiREST

This version of the BentoBox is the first to contain KijiREST, a REST interface for interacting with KijiSchema.
KijiREST is in the rest/ directory of the BentoBox. Its README includes instructions on configuring and running a REST service..

Notable updates to Kiji Hive Adapter and KijiExpress

A new release of Kiji Hive Adapter and KijiExpress have been provided.

  • There is now a KijiExpress Shell. A Scala shell preloaded for KijiExpress can now be run with the command “express shell --local” or “express shell --hdfs“. Once a pipe is fully specified from input to output, it can be run with “pipe.run”.
  • The Kiji Hive Adapter now allows the ability to read EntityId in Hive. Thanks to Jeff Kolesky for the patch!
  • Various additional bug fixes.

Conclusions

We’re excited about the momentum building behind the BentoBox. New components, improved software stability, and a smoother upgrade process will help enable more powerful big data applications to be built with the BentoBox SDK.

Ready to get started? Download the BentoBox today!

Posted by & filed under BentoBox, Releases.

Announcing the “Albacore” BentoBox v1.0.3

We are pleased to announce an updated version of the BentoBox SDK for Kiji. This system includes several upgraded components and will make it easier to deliver new releases of Kiji software to you as well. See the previous announcement here

Albacore v1.0.3 includes the following software:

  • KijiSchema 1.0.3
  • KijiSchema shell 1.0.0
  • KijiMR 1.0.0-rc62
  • KijiMR Library 1.0.0-rc61
  • Kiji Hive adapter 0.2.0
  • Kiji Express 0.2.0 *NEW*
  • Example code: phonebook and music recommendation tutorials

This BentoBox is powered by Hadoop and HBase via CDH 4.1. It is built around the latest version of KijiSchema.

New: KijiExpress

This version of the BentoBox is the first to contain KijiExpress, a Scala domain-specific language (DSL) for analyzing and modeling data stored in tables managed by KijiSchema. KijiExpress can be used to author complex flows of MapReduce jobs using a concise and expressive API.

KijiExpress is in the express/ directory of the BentoBox. Its README includes instructions on running express jobs and scripts. A KijiExpress version of the music recommendation tutorial is also available in the examples/express-music directory.

Notable updates to KijiSchema, KijiMR, the KijiMR Library, and the Kiji Hive Adapter

A new release of KijiSchema, KijiMR, and the KijiMR library has been provided. This includes the following updates and changes:

  • bin/kiji supports loading Hadoop distro-specific dependencies by inferring from "bin/hadoop version" or $KIJI_HADOOP_DISTRO_VER.
  • Added the ability to retrieve entity id from a KijiTableKeyValueStore reader. Call KijiTableKeyValueStore.getTableForReader(reader) to get the underlying KijiTable object.
  • Reorganize KijiMR project structure for multiple Hadoop distributions. KijiMR jars are now placed in $KIJI_MR_HOME/lib/distribution/hadoop2/. You can now optionally specify hadoop2 on the KijiMR artifact to make your distribution requirement explicit.
  • Added a graph library to the KijiMR Library for developing item-category association mining models within Kiji
  • Kiji Hive Adapter now properly decodes complex Avro types, Avro unions, and map type families.

Conclusions

We’re excited about the momentum building behind the BentoBox. New components, improved software stability, and a smoother upgrade process will help enable more powerful big data applications to be built with the BentoBox SDK.

Ready to get started? Download the BentoBox today!

Posted by & filed under Uncategorized.

One of the biggest challenges in building a stable software system is maintaining backward compatibility with previous data formats.

Systems like Kiji can help you maintain backward compatibility with your own data, but there are still some challenges involved in using all the features of Apache Avro while maintaining backward compatibility. This blog post will describe a common problem associated with storage and retrieval of Avro data, how we get around this problem in Kiji’s own data structures, and how you can use this technique as well.

A Subtle Problem with SpecificRecords

Like many applications that depend on Avro, Kiji uses its “SpecificRecord” API to generate custom classes representing data structures based on Avro IDL files. A custom SpecificRecord class would contain Java member fields for each field declared in the IDL file, getters, setters, as well as the Avro schema associated with the record.

For example, the following Avro IDL defines a record that holds a point:

record Point {
  float x;
  float y;
  string label;
}

This will be compiled to a Java class named Point that can deserialize, manipulate, and serialize Point instances.

Suppose you’ve used a system like Kiji to save some Point instances that you can manipulate later, and that you had the following method in your code:

/** Create a new point with x and y fields doubled. */
public Point doubleXYDistance(Point input) {
  // Initialize with all fields of the input Point
  Point.Builder builder = Point.newBuilder(input); 
  builder.setX(builder.getX() * 2.0);
  builder.setY(builder.getY() * 2.0);
  return builder.build();  
}

The call to Point.newBuilder() is initialized with fields from the input Point, so fields such as label are set to whatever the input record had.

Then suppose that in the next version of your program, you change the Avro IDL like so:

record Point {
  float x;
  float y;
  float z = 0.0;
  string label;
}

When this is recompiled you’ll have a “z” field, which will be gracefully added to existing records as you read them (when resolving the writer schema with this new reader schema). No problem. The doubleXYDistance() function will do the correct thing with z values as well; initializing the output Point instance with a z-value of 0 if the input Point lacks a z value, and using the input Point’s z value otherwise.

The main compatibility challenge occurs when you write out new records (with x, y, and z) and then an older application (compiled with the “2-d” version of Point) wants to manipulate them.

The older application can gracefully read 3-d points as 2-d points, discarding the z value. But consider this function:

void modifyPoint() {
  Point in = loadPointCellFromKiji();
  Point out = doubleXYDistance(in);
  persistPointToKijiCell(out);
}

When this writes a new Point object back to Kiji, it will write out a 2-d point! While the best thing would be for the system to accommodate additional newer fields gracefully in this process, the next best thing is to at least detect that you are operating on data that is too new, and refuse to do so before you corrupt the persistent information store.

Protocol Versioning

Kiji has this problem with TableLayoutDesc objects that represent Kiji table layouts as an Avro data structure. The KijiSchema shell modifies existing layouts by reading the TableLayoutDesc from the Kiji meta table, creating a modified copy, and persisting it back. We use Avro’s SpecificRecord interface to capture this.

One option to deal with this would be to use Avro’s GenericRecord API; it would allow us to faithfully copy all the fields of the input record into the output. If the new schema is backward compatible with the old schema (as is the case in those two versions of the Point record), we could work with the writer schema interpretation of the data directly.

But by using GenericRecord, we’d lose type-safe functions like TableLayoutDesc.setName(String tableName) (or Point.setX(float x)); we’d have to use the GenericRecord.set(String fieldName, Object value) method, which is a much more error-prone API to operate.

Instead, Kiji uses something we call protocol versioning to get around this issue. Each record type that we intend to operate on has a field named version which is a string. The Kiji TableLayoutDesc Avro record (what your JSON table layout files interact with) includes a version field which today you should set to "layout-1.1".

In Kiji, a protocol version includes a protocol name and a version number in major.minor.revision format. The protocol name is a sanity check on what kind of record this version pertains to. For example, Kiji table layouts all contain a protocol name of “layout”; this prevents the most basic error of trying to parse a similar but unrelated JSON record object in a place where it sholdn’t be.

The version number is a standard version number that follows semantic versioning: the major version changes when an incompatible change is introduced; the minor version for a compatible new feature; and the revision for a bug fix.

When we parse a JSON-formatted Avro record from a file, or load a serialized object into a SpecificRecord, we use the ProtocolVersion class to parse the version field of the record. We can then use this to compare the minimum and maximum protocol versions tolerated by this version of Kiji itself to the record.

For example, the KijiSchema shell version 1.0.0 supports a maximum ProtocolVersion of layout-1.1.0 for table layouts. It would refuse to operate on a Kiji TableLayoutDesc with version = "layout-1.5.0". This TableLayoutDesc object might contain new fields that the SpecificRecord subclass compiled into KijiSchema Shell 1.0.0 would discard. The user would be directed to use a newer version of the shell to operate on the newer TableLayoutDesc record.

Data Interpretation Semantics

The notion of a protocol version goes beyond controlling the literal names and types of fields read and written by a given object. A protocol version might also govern the semantics of how an object is used.

For example, in Kiji, table layouts represented by TableLayoutDesc SpecificRecord objects are parsed by the KijiTableLayout class. This class performs a number of validity checks on the underlying TableLayoutDesc. As we want to guide users to stronger validity constraints, we need to make sure that newer versions of Kiji don’t refuse to operate with older persisted table layouts that fail to meet the new validity standards; a backwards-compatible mode must be employed to “grandfather in” any existing table layouts.

KijiTableLayout parses the ProtocolVersion from TableLayoutDesc.version and uses this to guide what semantics it imposes on table layouts. New table layouts that take advantage of the latest features might declare a higher-numbered protocol version that requires the new validity constraints. Old table layouts that use an older protocol version (like the current standard layout-1.1) would not be affected by these new rules. Of course, if you want to upgrade your table layout in a way that takes advantage of new features in the future, you’ll need to increment your protocol version — and possibly make other necessary adjustments to continue to conform to the current layout validation rules.

In the example above, you could imagine changing a protocol version in the Point example to denote that you now regard x and y values as being in meters instead of inches.

Protocol versions elsewhere in Kiji

We use the ProtocolVersion class in a number of places. For instance, if you write a JSON file that controls how bulk imports map fields of an input CSV file into columns of a Kiji table, this JSON file also contains the protocol version "import-1.0".

You can use this same JSON file to perform recurring imports from a data source indefinitely, while you periodically upgrade to new versions of KijiMR and the KijiMR library. While we might add new features that improve the bulk import process, an existing JSON import control file that declares a protocol version of import-1.0 will be parsed and interpreted consistently.

A Versioned Point

How do you prevent yourself from falling into a similar trap yourself? Let’s return to the Point example from earlier. If from the start you had declared:

record Point {
  float x;
  float y;
  string label;
  string version; //  must be point-1.0
}

Then your application could rule out a large class of errors if you write:

private static final ProtocolVersion MAX_VER = ProtocolVersion.parse("point-1.0");

void modifyPoint() {
  Point in = loadPointCellFromKiji();
  if (MAX_VER.compareTo(ProtocolVersion.parse(in.getVersion()) < 0) {
    throw new RuntimeException("Input point is too new! Please upgrade the point modifier.");
  }
  Point out = doubleXYDistance(in);
  persistPointToKijiCell(out);
}

The ProtocolVersion class implements Comparable<ProtocolVersion>, so you can compare them easily without needing to worry about string parsing or comparison yourself.

Then when you declare your 3-d point:

record Point {
  float x;
  float y;
  float z = 0.0;
  string label;
  string version; //  must be point-1.1
}

... you would also change MAX_VER to "point-1.1". The newer version of your software would work correctly with 3-d points, and the older version of your software would prevent itself from modifying them.

Note that in the Avro schema above, we don't declare the version field with a default value. We don't want Avro to automatically use the reader's default value here: we must use whatever version was persisted.

If you're retrofitting this behavior onto an older Avro data structure, you may need to gracefully handle a "null" default value for a new version field and treat it as the same as your "1.0.0" version.

Conclusions

Data compatibility is a difficult challenge to manage, and systems that make use of convenient and type-safe precompiled classes may be vulnerable to accidentally discarding or misinterpreting if they are not written carefully.

With Kiji, data compatibility is one of our primary concerns. We believe Kiji makes it easier to safely store large volumes of data in an evolvable fashion. But how applications manipulate and interpret that data can have a big impact on the final result. Avro's Generic and SpecificRecord interfaces offer different tradeoffs in the same flexibility/safety space, with GenericRecords offering you greater flexibility, while SpecificRecords offer a type-safe Java-friendly API.

Kiji uses SpecificRecord instances combined with this versioned protocol pattern to address this concern with its own data structures.

Curious to learn more about ProtocolVersion?

Posted by & filed under KijiMR, KijiSchema, Releases.

New versions of KijiSchema (1.0.2) and KijiMR (rc61) have been released.

This release is primarily incremental improvements and bug fixes to the APIs of both modules. In the next few weeks, users can expect another version of BentoBox incorporating both artifacts along with updated tutorials. Users who want to use them right now can update the versions in their pom.xml files (ask on the kiji user mailing list for help).

Notable changes

  • Row filter improvements have been made in both modules. RowFilters will now be correctly serialized into KijiMR jobs (Thanks to Jeff Kolesky for contributing to this).
  • KijiTableKeyValueStore reader now uses EntityId keys instead of strings. This is an incompatible change from previous versions of the class in KijiMR.
  • Bug fixes for KijiPager and KijiMR job set-up.
  • The deprecated getEntityId() method has been removed from ProducerContext. Producers needing the entity id should retrieve it with KijiRowData.getEntityId() in their produce() methods instead.

As always, full release notes can be found in RELEASE_NOTES.txt at the top level of both modules.

Posted by & filed under Uncategorized.

Announcing the “Albacore” BentoBox

We are pleased to announce that you can now download a new version of the BentoBox SDK for Kiji. This system includes several upgraded components and will make it easier to deliver new releases of Kiji software to you as well.

We are shifting how we version and release BentoBoxes: approximately each quarter, a new minor version (1.0 -> 1.1 -> 1.2) of the BentoBox will be released. Each such version has a name — in keeping with our overall theme, they’ll be named after sushi. This BentoBox release is called “Albacore.”

Albacore includes the following software:

  • KijiSchema 1.0.1
  • KijiSchema shell 1.0.0
  • KijiMR 1.0.0-rc6
  • KijiMR Library 1.0.0-rc6
  • Kiji Hive adapter 0.1.1
  • Example code: phonebook and music recommendation tutorials

This BentoBox is powered by Hadoop and HBase via CDH 4.1. It is built around the latest version of KijiSchema, which is effectively the same as the KijiSchema 1.0.0 release made last week.

BentoBox improvements

The BentoBox system has been revamped with a new bento upgrade command. When a new version of the BentoBox is available, just type bento upgrade and the new SDK will be downloaded and installed in place (don’t worry, we make a backup of your old BentoBox and data first).

BentoBox instances will periodically check for available updates and notify you when they’re available. You can then do an in-place upgrade with a single command.

The BentoBox now contains several components. To help you keep track of which version of each you’re running, the bento-version.txt file will provide you with a manifest.

Component upgrades

Going forward, each minor version of the BentoBox may include new minor versions of stable software we release (e.g., BentoBox 1.1 may contain KijiSchema 1.1 or 1.2, not 1.0).

Periodic revisions of the BentoBox (1.0.1, 1.0.2, etc) will include only revisions of stable components like KijiSchema (1.0.1, 1.0.2, 1.0.3, etc.). Experimental components like the Kiji Hive adapter, or “release candidate” components like KijiMR may be upgraded more aggressively.

New: A Hive Adapter for Kiji

This version of the BentoBox is the first to contain the new Kiji Hive adapter. This plugin for Hive allows you to run HiveQL queries against tables managed by KijiSchema.

The Kiji Hive adapter is in the hive-adapter/ directory of the BentoBox. Its README includes some example queries you can run against the table generated by the music recommendation tutorial.

Updates to KijiMR and KijiMR Library

A new release of KijiMR and the KijiMR library has been provided. This includes the following updates and changes:

  • A new XML bulk importer is included (XMLBulkImporter).
  • You can run bulk imports from the KijiSchema shell now. See the KijiMR bulk importers user guide for more information.
  • The MapReduceJob class has been renamed to KijiMapReduceJob.
  • Your implementations of KijiProducer should now use KijiTableContext.getEntityIdFactory() to create entity ids; ProducerContext.getEntityId() is now deprecated and will be removed soon.
  • Some other minor improvements or changes to the API have been made; see the release notes for full details.

Conclusions

We’re excited about the momentum building behind the BentoBox. New components, improved software stability, and a smoother upgrade process will help enable more powerful big data applications to be built with the BentoBox SDK.

Ready to get started? Download the BentoBox today!

Posted by & filed under Uncategorized.

KijiSchema 1.0.0 is the first version of KijiSchema to drop the “rc” label. From this point forward, we expect to impose significantly more strict requirements on our commitment to API compatibility.

KijiSchema is the first such module to be subject to these requirements; over the coming months, we expect to go through a similar process of more volatile changes followed by a stabilization period for KijiMR. The result of that process will be KijiMR 1.0.0. The next version of KijiMR to be released is 1.0.0-rc6.

There are several important considerations for compatibility. This document will discuss:

  • API version compatibility within KijiSchema and other post-1.0.0 modules
  • Version compatibility between modules (e.g., KijiSchema and KijiMR)
  • Our plan for new module release versioning
  • The future of Kiji Bento Box releases

This document is somewhat long, but forms a contract of sorts, so you should read it. We want to be upfront about our expectations for everyone. This document is written both for the developers of Kiji, and for its users; by adhering to the standards and requirements stated here, we hope to minimize disruption in existing projects within and depending on Kiji while continuing to enable new features, and empower users to determine for themselves when they want to upgrade between versions of individual Kiji modules.

Read on to learn more.
Read more »

Posted by & filed under Uncategorized.

We are proud to announce the immediate availability of KijiSchema 1.0.0.

You can use this in your Maven projects, or download it from http://www.kiji.org/getstarted/.

A BentoBox containing KijiSchema 1.0.0 (and the next release of KijiMR and some other goodies) will be available within a couple weeks.

KijiSchema 1.0.0 represents the first stable API for this component; subsequent versions of KijiSchema will contain new features or improvements but compatibility with existing releases will be preserved. A longer message about this subject and the precise specifications of “compatible” and “stable” will be released tomorrow.

KijiSchema 1.0.0 includes a few important new features and changes since rc5. The degree of incompatible changes should be very modest.

  • The close() method (deprecated in rc5) has been removed from KijiTable. You must now use KijiTable.release(). This is the biggest incompatible change in this release that most users will see.
  • A new KijiBufferedWriter class allows you to buffer and transmit a number of writes to a Kiji table in batch for improved performance.
  • Documentation (especially Javadoc) has been improved in a number of cases. We have attempted to specify more precisely how most of our core classes work.
  • Stability annotations (Experimental, Evolving, or Stable) have been added to all classes.
  • As a bug fix, get() and scan() operations on empty tables with no column families now return empty responses/scanners, rather than null.
  • A few other bug fixes and tune-ups throughout.
  • Note that many classes marked @Deprecated in the KijiSchema release candidates have been removed, but others persist. These classes may be removed soon–in particular, the org.kiji.schema.mapreduce package. Please use KijiMR for your MapReduce needs. (If you upgrade to KijiSchema 1.0.0, your code should compile without any deprecation warnings.)

All users are encouraged to promptly upgrade to KijiSchema 1.0.0. Usage of any “release candidate” editions of KijiSchema is heavily discouraged, as they are not compatible with the ongoing world.

Posted by & filed under KijiMR.

One of the most important considerations when using any kind of data store is how to get your data into the store. While abstraction layers like KijiSchema are useful for avoiding the need to understand the underlying implementation details of the HBase in the random access use case, bulk importing data often involves huge datasets that warrants special consideration from both a performance and an operational perspective. In addition, we’ll also walk through a real world use case of bulk importing some actual data.

While we have a variety of canned bulk importers described in the bulk importer userguide for parsing and importing various common file formats, there are other important job configuration parameters for how the imports are executed.

Option 1: Using the KijiSchema PUT API

In direct writing mode, KijiMR will create a MapReduce job that reads over the input and generates the desired put(s) associated with each entry in the data. These puts are directly applied to a Kiji table as the job progresses.

Benefits:

  • Ability to run this safely on an active cluster.
  • The data that has been loaded is available immediately while the job is in progress.
  • Drawbacks:

  • If this job fails partway for any reason, we’ll have partially loaded data.
  • This would involve generating writes for every entity, which may result in potentially significant write load on the cluster.
  • Since Kiji doesn’t yet support atomic writes to the same row(but this is coming soon), bulk importers will generate a write for each individual cell, even if they are associated with the same row. This would result in greater disk utilization.
  • Option 2: Bulk Loading using HFiles

    For bulk loading, Kiji can create a MapReduce job that reads over the input and generates HFiles containing the rows that can be loaded directly into HBase using their bulk load functionality. Once the job completes, the HFiles can be loaded into the Kiji table to allow them to be accessed.

    Benefits:

  • Atomicity for committing on the job. Either we get everything or we get nothing.
  • Orders of magnitude faster (see comparison below).
  • Drawbacks:

  • As a result of the ordering of the StoreFiles, this could trigger a compaction on running clusters, which could be painful on large clusters. See the commentary in HBASE-3404 for more information.
  • My Recommendation

    For a new cluster, the recommended practice is to bulk load via HFiles. This allows for the initial backlog of data to be imported quickly, while avoiding the compaction issues above(since there’s no data in the cluster). If this bulk import job is for an existing Kiji instance, using the direct puts will allow the existing cluster to continue to respond to requests while still accepting the new data.

    The Test Drive

    As part of my testing of this new functionality, I tried bulk importing some of the sample data from the Kaggle Blue Book for Bulldozers challenge. This data came in a CSV file with a header that contained 401,126 lines including the header. The header line contains the names of the 53 fields. The rest of this example assumes that you have a bento box and a Kiji instance installed.

    Specifying the layout to create a Kiji table with this many columns would have required a rather long-winded DDL. Since we need at least one row for each column family:qualifier, this would involve a lot of tedious copy and pasting. To accelerate this process, I wrote a little script: generate-ddl.sh that takes in said CSV file, parses the header, and auto-generates a default DDL assuming that every field is a string. Then the user could simply modify this generated DDL and produce the layout that they are looking for. Once we are happy with this, we can create the table using the kiji-schema-shell:

    ./generate-ddl.sh Train.csv > Train.ddl
    kiji-schema-shell --file=Train.ddl

    The CSVBulkImporter requires an import descriptor that defines the mapping from the source fields in the CSV to the destination Kiji columns. Being that this is also dependent on the fields, I’ve written another little script generate-import-descriptor.sh that takes in said CSV file, parses the header, and autogenerates a default import descriptor JSON file.

    ./generate-import-descriptor.sh Train.csv > Train.json
    hadoop fs -copyFromLocal Train.json /

    Finally we need to trim the header from the data file that we wish to bulk-import, and copy it over to HDFS so that the MapReduce job can get at it.

    tail -n +2 Train.csv > Train-no-header.csv
    hadoop fs -copyFromLocal Train-no-header.csv /

    With Option 1: Using the KijiSchema PUT API

    We can use Kiji to create a bulk importer job whose output is a kiji table(note the –output parameter).

    kiji bulk-import \
      -Dkiji.import.text.column.header_row=`head -1 Train.csv` \
      -Dkiji.import.text.input.descriptor.path=/Train.json \
      --importer=org.kiji.mapreduce.lib.bulkimport.CSVBulkImporter \
      --output="format=kiji table=kiji://.env/default/train nsplits=1" \
      --input="format=text file=/Train-no-header.csv"

    This took 78 minutes on a bento cluster running on my local machine, and consumed 3.2g disk space.

    With Option 2: Bulk loading via HFiles

    Alternately, we can use Kiji to create a bulk importer whose output is a kiji table. Note that the main difference here is that the –output parameter has changed to using the hfile format and that we specify the destination HFile. Note: we still need a table to know what layout the HFile should use.

    kiji bulk-import \
      -Dkiji.import.text.column.header_row=`head -1 data/Train.csv` \
      -Dkiji.import.text.input.descriptor.path=/Train.json \
      --importer=org.kiji.mapreduce.lib.bulkimport.CSVBulkImporter \
      --output="format=hfile table=kiji://.env/default/train nsplits=1 file=hdfs://localhost:8020/train.bulkload" \
      --input="format=text file=/Train-no-header.csv"

    Finally once these files get created, they can be bulk loaded with the bulk-load tool:

    kiji bulk-load \
      --hfile=hdfs://localhost:8020/train.bulkload/part-r-00000.hfile \
      --table=kiji://.env/default/train

    This took 2 minutes on a bento cluster running on my local machine, and consumed 483.4m disk space. This is vastly (50x) faster than the individual PUTs on my little laptop.

    Now all of this data has been loaded into Kiji to do whatever you might like for post processing!

    Performance Stats

    Method Processing Time (min.) Memory Usage
    Using KijiSchema PUTs 78 3.20 GB
    Bulk Loading using HFiles 2 0.47 GB

    Above are the results for bulk importing of data into uncompressed Kiji tables. As you can see there’s nearly a 40x difference between bulk import time, and a 6x difference in the disk utilization. Using HFiles to bulk import data is more performant than doing direct writes, but there are potential operational difficulties on a running cluster with the possibly of an (expensive) compaction looming when the data files are loaded.

    Addendum: Timestamps

    In Common Pitfallfalls of Timestamps in HBase, we discuss many of the potential pain points of dealing with timestamps within HBase. While in general we don’t recommend manually setting timestamps, in the case of the initial bulk importing of data, it’s often beneficial to backfill the timestamps based on the data to be imported. This way initial imported data can have the same behavior as newly added data in your application.

    Posted by & filed under KijiSchema, Releases.

    A new version of KijiSchema (1.0.0-rc5) has been released!

    This version of KijiSchema continues the maturation process of the KijiSchema framework’s API. The improvements from the previous version are primarily internal and incremental. Wherever possible, previous functionality has been kept and marked as deprecated rather than removed. Users should see fewer dramatic user-facing alterations than the rc4 release. Further, this will be the final release candidate before the stable 1.0.0 release. Between rc5 and 1.0.0 changes should be limited to critical changes, improving functionality of the DDL shell and CLI, and tidying up the code base.

    This release is also synchronized with our first official release of the KijiMR module, which lets users write MapReduce jobs to bulk load and analyze data stored in KijiSchema tables. See the associated blogpost and userguide for more information about KijiMR. KijiMR is a younger module and can be expected to undergo a few iterations before its own 1.0.0 release.

    Major new features

    We added several new features to KijiSchema in this version:

    • Support for formatted EntityIds (i.e. composite keys of multiple components) in all command line tools has had its coverage improved. Formatted EntityIds were added programmatically in rc4. This change makes such EntityIds usable for all KijiSchema command line tools.
    • Support for formatted EntityIds in the DDL shell has also been added.
    • kiji get and kiji scan have been added, more logically handling functionality which previously existed as part of the kiji ls command.
    • kiji ls has changed including gaining the ability to list columns in a table, while the ability to extract data stored in rows has been moved into the get and scan tools mentioned above.
    • A CLI tool for manipulating the KijiSchema SystemTable has been added. The SystemTable stores instance-wide settings and is now included in metadata backups.
    • CLI tools now accept relative KijiURIs if the user is using kiji://.env as their zookeeper quorum. Instance, table, and column names must still be specified. For example, default/table_foo is shorthand for kiji://.env/default/table_foo.
    • AtomicKijiPutter, providing the ability to apply multiple puts atomically to a single EntityId has been added. Use table.getWriterFactory().openAtomicWriter() to obtain one.

    We made a few API changes that application developers should be aware of:

    • New backup format — The API remains the same, but Metadata backup and its associated commandline tool now also saves information from the SystemTable (containing instance-wide settings) and uses a new format. You should re-backup your tables using the new version of the API.
    • The kiji ls command can no longer perform get and scan — As mentioned above, new tools kiji get and kiji scan can be used for reading data stored in KijiSchema tables from scripts and the command
      line.
    • KijiDataRequestValidator’s constructor has been hidden and validate() changed — Framework developers should use the factory method validatorForLayout() to get a validator which can validate KijiDataRequests against a table layout.
    • KijiRowKeySplitter is now a non-static class — Use KijiRowKeysplitter.get() to obtain an instance.
    • KijiRowData.getValues(“family”) and .getMostValues(“family”) now
      only work on map-type column families
      — This change is important to be able to return something more expressive than a map of Objects.
    • KijiTable.close() has been deprecated — Applications should use `KijiTable.release()` instead.
    • Kiji.modifyTableLayout() no longer takes a table name — Instead, the name is read from the TableLayoutDesc parameter. The older version of modifyTableLayout() still exists but has been
      deprecated.
    • Kiji.createTable() now takes a TableLayoutDesc object — The older version of createTable() that takes a KijiTableLayout still exists but has been deprecated.

      Using new Kiji CLI features

      As mentioned above, the functionality of listing row data has been migrated out of kiji ls and into the get and scan tools. kiji ls has also gained the ability to list columns. Here are some example commands using these tools:

      List all kiji instances:

      kiji ls kiji://.env
        kiji ls kiji://localhost:2181
        kiji ls kiji://{host1,host2}:2181
        kiji ls # Relative KijiURI
      

      List all kiji tables in kiji://.env/default:

      kiji ls kiji://.env/default
        kiji ls default # Relative KijiURI
      

      List all columns in a kiji table ‘table_foo’:

      kiji ls default/table_foo
        kiji ls kiji://.env/default/table_foo
      

      For retrieving data from a single row where one knows the EntityId, use the kiji get tool. This commend retrieves displays the columns info:email and derived:domain from row identified by EntityId “bar” in table table_foo:

      kiji get default/table_foo/info:email,derived:domain \
          --entity-id=bar
      

      For getting more than one row, you can use the scan tool. This command scans through up to 10 rows, starting from the first row, and print columns info:email and derived:domain of table table_foo:

      kiji scan \
          kiji://.env/default/table_foo/info:email,derived:domain \
          --max-rows=10
      

      Scans also support start and limit ranges. This command scans through table table_foo form row start-row 0×50 (included) to limit-row 0xe0 (excluded):

      kiji scan \
          kiji://.env/default/table_foo/info:email,derived:domain \
          --start-row=hex:50 \
          --limit-row=hex:e0
      

      Command line tools will now accept formatted EntityIds for tables whose layout uses them. For example, the following command inserts a value into the users table, assuming that the table uses formatted EntityIds consisting of two strings (first and last name in this case):

      kiji put --target=default/users/info:state --entity-id="[ 'Kimball', 'Aaron' ]" --value='"CA"'
      

      In summary…

      KijiSchema’s rc5 release is an important step in the continued maturation of its functionality and API stabilization. Check back in a few weeks for the stable 1.0.0 release.

      Ready to try Kiji? Go download the BentoBox today!

    Posted by & filed under Releases.

    The Kiji ecosystem has grown with the addition of a new module, KijiMR.

    The Kiji framework is a collection of components that offer developers a handle on building Big Data Applications. In addition to the first release, KijiSchema, we are now proud to announce the availability of a second component: KijiMR.

    KijiMR allows KijiSchema users to use MapReduce techniques including machine learning algorithms and complex analytics to develop many kinds of applications using data in KijiSchema.

    Read on to learn more about the major features included in KijiMR and how you can use them.
    Read more »