Posted by & filed under Kiji Hive Adapter.

Building a Hive Adapter

Traditional Business Intelligence tools often rely on SQL to connect to their data stores. Kiji, which is built on top of Apache Hadoop and Apache HBase, provides substantial capabilities for storing and processing the data within. However, because the data is exposed via a Java API, it can often be time consuming and cumbersome to execute the kinds of analysis (particularly ad hoc queries) that businesses need to perform to stay competitive and glean insight from their data. The Kiji Hive Adapter allows for HiveQL queries to be run across Kiji tables in correlation with Apache Hive, which provides a SQL-like interface on top of MapReduce.

The Kiji Hive Adapter provides an implementation of a StorageHandler, which dictates to Hive which SerDe (or Serializer/Deserializer) to use for processing records. In Hive, a SerDe is an implementation of two interfaces:

  • The Deserializer interface handles decoding data that is written to HDFS and converting it into to Hive objects.
  • The Serializer interface handles encoding data from Hive objects and writing it into the backing data store.
  • Currently the Hive adapter only supports the Deserializer portion of the interface, providing read access to Kiji tables. In the near future, a Serializer implementation will be added to allow for write access (KIJIHIVE-20).

    Considerations

    Normally, when tables are created within Hive from raw data, Hive assumes control over the data itself. While this may be mostly transparent to users, it does have the side effect of being deleted from the data warehouse if/when the data is deleted from Hive. For this Hive adapter, this behavior is undesireable, so we use external tables instead to express that data should remain being maintained by Kiji.

    Kiji Row Expressions

    HBase’s data model is a multi-map strucuture, which doesn’t quite map cleanly for many SQL style use cases. So in order to make relevant data more accessible, we introduce the concept a Kiji Row Expression to address data within Kiji. As Kiji tables are addressed via a family and (optional) qualifier, this naturally translates into the row expressions.

    Row Expression Hive Type Description
    family MAP<STRING, ARRAY<STRUCT<TIMESTAMP, cell>>> All values in a map type family
    family[n] MAP<STRING, STRUCT<TIMESTAMP, cell>> nth most recent value from a map type family
    family:qualifier ARRAY<STRUCT<TIMESTAMP, cell>> All values from a family:qualifier combo
    family:qualifier[n] STRUCT<TIMESTAMP, cell> nth most recent value from a family:qualifier combo

    Beyond the base family and family:qualifier expressions, there is also the ability to address cell indexes ordered by timestamp via [n]. For example, if you want the most recent version of a cell only, we can append the expression with [0]. Or alternately if we want the oldest version of a cell, we can use the special index [-1] to signify that you want the original version of a cell.

    Types within the Kiji Hive Adapter

    As data in Kiji is stored using Avro serialization, we must incorporate mapping from the Avro types into the equivalent Hive types. For primitives this generally follows the obvious mapping – an Avro “String” maps to a Hive STRING. It is worth noting that a nullable Avro type does not need to be specifically nullable in Hive, as they are already nullable by default.

    For complex types, we build a mapping(within the AvroTypeAdapter class) from the Avro data to the Hive type recursively until we get to primitives as follows:

    Avro Type Hive Type
    record { T f1; U f2; … } STRUCT<f1: T, f2: U>
    map<T> MAP<STRING, T>
    array<T> ARRAY<T>
    union {T, U, …} UNIONTYPE<T, Y, …>

    To use these row expressions within Hive, we specify them as a comma separated list value of the kiji.columns parameter in the SERDEPROPERTIES section within a CREATE EXTERNAL TABLE statement.

    Creating a Hive view of a Kiji table

    Using the row expressions above, we can thus create a materialization of a Kiji table within Hive via a create external statement such as(for the kiji music example):

    CREATE EXTERNAL TABLE users (
      user STRING,
      track_plays ARRAY<STRUCT<ts: TIMESTAMP, value: STRING>>,
      next_song_rec STRUCT<ts: TIMESTAMP, value: STRING>
    )
    STORED BY 'org.kiji.hive.KijiTableStorageHandler'
    WITH SERDEPROPERTIES (
      'kiji.columns' = ':entity_id[0],info:track_plays,info:next_song_rec[0]'
    )
    TBLPROPERTIES (
      'kiji.table.uri' = 'kiji://.env/kiji_music/users'
    );
    

    Tables materialized through the Kiji Hive Adapter should be able to utilize the all read functionality of the HiveQL language. The Kiji Hive Adapter README.md file contains several examples of queries that can be run on top of the data that’s written by the music recommendation example project. For cases where HiveQL is insufficient or inelegant, Hive UDFs can be written to extend the functionality.

    Next time: Exploratory data analysis using Hive.

    Leave a Reply

    • (will not be published)