Posted by & filed under KijiSchema.

In this economic landscape, every company is striving to create a 360º view of the entities their organization is built on. Companies need to better understand the past, present and future of each entity — be it a user, customer, account, machine or sensor — in order to make their business successful.

Building a 360º view is very difficult with traditional relational systems, but fortunately new strategies have emerged. Systems like Kiji aim to simplify the development of 360º views. We can now store data in an entity-centric manner, making it possible to create a dynamic profile and enable the delivery of personalized experiences in real time.


What’s wrong with traditional data management?

Snowflake ERD

Entity-relationship diagrams, or ERDs, are used to describe databases by graphically representing the entities stored, along with their attributes, and how they relate to each other. The relationships connecting the entities within an ERD represent the complex series of joins that have to be executed in order to describe all of an entity’s attributes. These costly joins are performed at query time to extract a full view of the entity.

A fairly common pattern in data warehousing systems is to organize data into a star schema. Star schemas have one central fact table containing transactional information. In the context of an e-commerce website, these transactions are likely purchases, or shopping cart manipulations. For a retail bank, these include credits and deductions from accounts. Surrounding the central fact table are dimensions such as product SKU information or geographic location data. These dimension tables are joined to the fact table to provide a more complete view of the details of a transaction. There are a few challenges in using this method. First, the requisite joins are often very expensive to compute. Second, this type of schema centers around transactions, rather than the users generating those transactions.

Another major challenge is that these data warehouses only store historical data, which only emcompasses a portion of the entire entity profile. In most modern enterprise data architectures, the information about the current state of an entity resides in a separate operational datastore. Achieving the 360º view requires combining data from both systems. Maintaining these disparate datastores results in millions of dollars being spent on ETL solutions to cleanse data and move it back and forth between operational and historical data storage systems.


Kiji provides Entity-Centric Storage

Kiji Row

To provide the entire and current profile of an entity, a different type of storage system is necessary. It is imperative that this system has the ability to store all of the data about a single entity in one central, convenient location. The system should store all the data on an individual entity, both current and historical events. The system should be able to store complex, rich data types in order to avoid having to build additional database tables for the purpose of storing different types of information. Furthermore, the system needs to be able to store and query that data in real time to ensure the application can act on the information and provide recommendations in a timeframe that is still relevant to the end user.

Kiji builds upon Apache Avro, Apache HBase and Apache Hadoop to address precisely these requirements. By leveraging the underlying open-source architectures, Kiji utilizes the advantages of massive scalability of the shared-nothing Hadoop design. MapReduce is used to train and score predictive models, as well as extract features within models. Kiji uses Avro to serialize rich, application-defined data into a format that can be stored effectively by HBase and enable application schema evolution. Using HBase, Kiji also makes it easy to both update and query entities.

One of the critical features that HBase provides is an additional time dimension along which data can be stored. HBase allows applications to store and query multiple versions of a particular data cell. Kiji leverages this functionality to store event streams and timeseries data in the same row as an entity’s current state. For example, a column in a Kiji table might be a stream of all the purchases, clicks and cart modifications a user has performed on a website.

Kiji completes the 360º view by enabling the deployment of predictive models that can be scored in real time. This allows Applications to use machine learning techniques to predict future states and events for the entities stored in Kiji, which can be used to provide extremely relevant recommendations, offers or other types of personalization. Now websites can deliver the content users want when they want it. Ecommerce platforms can recommend the items their customers are actively searching for, increasing profit margins and customer satisfaction.


Conclusion

These days, there is a breadth of information available on every entity an organization interacts with and 360º views are critical to better understanding an application’s individual entities. Pioneers like Amazon, Facebook, Netlfix and Google have successfully demonstrated the power and competitive advantage of real-time personalization via entity-centric storage.

Frameworks like Kiji provide the tools and libraries necessary to build Big Data Applications that can store the diverse data required for a 360º view, analyze the data in a meaningful way and use that analysis to improve the application experience. The rise of entity-centric storage has made it economically viable and efficient for organizations to build Big Data Applications that provide a competitive advantage and effectively execute predictive analysis to provide relevant content to end users in a timely manner.

Leave a Reply

  • (will not be published)