β

Announcing Kylin: Extreme OLAP Engine for Big Data

eBay Tech Blog 338 阅读
sql

We are very excited to announce that eBay has released to the open-source community our distributed analytics engine: Kylin (http://kylin.io). Designed to accelerate analytics on Hadoop and allow the use of SQL-compatible tools, Kylin provides a SQL interface and multi-dimensional analysis (OLAP) on Hadoop to support extremely large datasets.

Kylin is currently used in production by various business units at eBay. In addition to open-sourcing Kylin, we are proposing Kylin as an Apache Incubator project.

Background

The challenge faced at eBay is that our data volume has become bigger while our user base has become more diverse. Our users – for example, in analytics and business units – consistently ask for minimal latency but want to continue using their favorite tools, such as Tableau and Excel.

So, we worked closely with our internal analytics community and outlined requirements for a successful product at eBay:

  1. Sub-second query latency on billions of rows
  2. ANSI-standard SQL availability for those using SQL-compatible tools
  3. Full OLAP capability to offer advanced functionality
  4. Support for high cardinality and very large dimensions
  5. High concurrency for thousands of users
  6. Distributed and scale-out architecture for analysis in the TB to PB size range

We quickly realized nothing met our exact requirements externally – especially in the open-source Hadoop community. To meet our emergent business needs, we decided to build a platform from scratch. With an excellent team and several pilot customers, we have been able to bring the Kylin platform into production as well as open-source it.

Feature highlights

Kylin is a platform offering the following features for big data analytics:

  • Job management and monitoring
  • Compression and encoding to reduce storage
  • Incremental refresh of cubes
  • Leveraging of the HBase coprocessor for query latency
  • Approximate query capability for distinct counts (HyperLogLog)
  • Easy-to-use Web interface to manage, build, monitor, and query cubes
  • Security capability to set ACL at the cube/project level
  • Support for LDAP integration

The fundamental idea

The idea of Kylin is not brand new. Many technologies over the past 30 years have used the same theory to accelerate analytics. These technologies include methods to store pre-calculated results to serve analysis queries, generate each level’s cuboids with all possible combinations of dimensions, and calculate all metrics at different levels.

For reference, here is the cuboid topology:

cuboid_topo

When data becomes bigger, the pre-calculation processing becomes impossible – even with powerful hardware. However, with the benefit of Hadoop’s distributed computing power, calculation jobs can leverage hundreds of thousands of nodes. This allows Kylin to perform these calculations in parallel and merge the final result, thereby significantly reducing the processing time.

From relational to key-value

As an example, suppose there are several records stored in Hive tables that represent a relational structure. When the data volume grows very large – 10+ or even 100+ billions of rows – a question like “how many units were sold in the technology category in 2010 on the US site?” will produce a query with a large table scan and a long delay to get the answer. Since the values are fixed every time the query is run, it makes sense to calculate and store those values for further usage. This technique is called Relational to Key-Value (K-V) processing. The process will generate all of the dimension combinations and measured values shown in the example below, at the right side of the diagram. The middle columns of the diagram, from left to right, show how data is calculated by leveraging MapReduce for the large-volume data processing.

rational_to_kv

Kylin is based on this theory and is leveraging the Hadoop ecosystem to do the job for huge volumes of data:

  1. Read data from Hive (which is stored on HDFS)
  2. Run MapReduce jobs to pre-calculate
  3. Store cube data in HBase
  4. Leverage Zookeeper for job coordination

Architecture

The following diagram shows the high-level architecture of Kylin.

kylin_arch

This diagram illustrates how relational data becomes key-value data through the Cube Build Engine offline process. The yellow lines also illustrate the online analysis data flow. The data requests can originate from SQL submitted using a SQL-based tool, or even using third-party applications via Kylin’s RESTful services. The RESTful services call the Query Engine, which determines if the target dataset exists. If so, the engine directly accesses the target data and returns the result with sub-second latency. Otherwise, the engine is designed to route non-matching dataset queries to SQL on Hadoop, enabled on a Hadoop cluster such as Hive.

Following are descriptions of all of the components the Kylin platform includes.

In Kylin, we are leveraging an open-source dynamic data management framework called Apache Calcite to parse SQL and plug in our code. The Calcite architecture is illustrated below. (Calcite was previously called Optiq, which was written by Julian Hyde and is now an Apache Incubator project.)

calcite

Kylin usage at eBay

At the time of open-sourcing Kylin, we already had several eBay business units using it in production. Our largest use case is the analysis of 12+ billion source records generating 14+ TB cubes. Its 90% query latency is less than 5 seconds. Now, our use cases target analysts and business users, who can access analytics and get results through the Tableau dashboard very easily – no more Hive query, shell command, and so on.

What’s next

Open source

Kylin has already been open-sourced to the community. To develop and grow an even stronger ecosystem around Kylin, we are currently working on proposing Kylin as an Apache Incubator project. With distinguished sponsors from the Hadoop developer community supporting Kylin, such as Owen O'Malley (Hortonworks co-founder and Apache member) and Julian Hyde (original author of Apache Calcite, also with Hortonworks), we believe that the greater open-source community can take Kylin to the next level.

We welcome everyone to contribute to Kylin. Please visit the Kylin web site for more details: http://kylin.io.

To begin with, we are looking for open-source contributions not only in the core code base, but also in the following areas:

  1. Shell Client
  2. RPC Server
  3. Job Scheduler
  4. Tools

For more details and to discuss these topics further, please follow us on twitter @KylinOLAP and join our Google group: https://groups.google.com/forum/#!forum/kylin-olap

Summary

Kylin has been deployed in production at eBay and is processing extremely large datasets. The platform has demonstrated great performance benefits and has proved to be a better way for analysts to leverage data on Hadoop with a more convenient approach using their favorite tool. We are pleased to open-source Kylin. We welcome feedback and suggestions, and we look forward to the involvement of the open-source community.

sql
作者:eBay Tech Blog
Where e-commerce meets world-class technology
原文地址:Announcing Kylin: Extreme OLAP Engine for Big Data, 感谢原作者分享。

发表评论