How to get Started with Apache Druid

Apache Druid is a very powerful, scalable Online Analytical Processing (OLAP) database system. However getting started with it can be a little intimidating for some. Here are some suggestions on how to get started.

Apache Druid logo

What is Apache Druid good for?

Apache Druid has several key advantages as an OLAP solution:

It is excellent for dealing with event like data with timestamps
It has SQL like query syntax, making for easier adoption
It can ingest both data at rest (from Parquet, CSV, JSON, Avro and ORC) and streaming data (from Apache Kafka or Amazon Kinesis) even as part of one dataset. This allows you to merge historic data with real time analytics easily
It is based around a set of microservices, making it highly scalable and resilient
Its flat table structure, column orientation and optimised indexing mean that it is high performance and low latency
It is a “batteries included” solution which includes a user friendly web front end, and an API, alongside its highly performant database
It supports Apache DataSketches allowing for extremely rapid approximate summarisation
It has excellent documentation
It is open source, maintained by the Apache Software Foundation

What are Apache Druid’s limitations?

The key limitation that Druid has is that written segments are designed to be immutable. This allows for indexing optimisation, but it means that Druid is entirely unsuitable for transactional processing. It is also unsuitable for mutable data since entire segments would continually need to be deleted and replaced. Another minor quibble about data ingestion is that while Parquet is supported, Databricks Delta files are not.

Druid is at its strongest if you design your data schema so that you can retrieve the information you want with simple single table queries. While it does support common table expressions, joins and many other SQL features, these forms of processing tend to come at a performance penalty. Also because the intermediate results need to be stored in memory you will likely need a more powerful Druid cluster to process such queries than you would with a flatter structure.

Druid favours flat denormalised data which can mean it can have a higher storage requirement than other solutions. Also its schema need to be planned in advance, unlike a document database. Druid schemas can be modified, but this will require reprocessing of existing data.

For more complex data there are other solutions around which might be more performant such as ClickHouse. However ClickHouse has been moving from an Open Source model to a hybrid open core model which may be off-putting if you wish to deploy independently. Another open source possibility is StarRocks, a new highly performant OLAP solution.

The best way to get started with Apache Druid

Apache Druid can be a little tricky to get to grips with. For a start Druid has quite a complex architecture. There are five core services: the Overlord, Coordinator, Historical, MiddleManager, and Broker. Then in addition there is an optional Router. It also uses an external file store (generally blob storage), a metadata store (PostGreSQL, MySQL or Derby) and Apache ZooKeeper for coordination. This is a lot of moving parts to get started with!

Four completed Apache Druid certification badges

Fortunately there are an excellent set of training materials available from a company called Imply located at https://learn.imply.io/. These training materials cover the basics of setup, data ingestion (both bulk and streaming), data modeling, and use of Druid’s built in metrics and logging. Best of all they provide a containerised training environment with a data generator and a Kafka streaming service. You can find this at this GitHub repo. This allows you to get practical hands on experience. Better still, Imply provide a comprehensive set of workbooks illustrating common use cases. You can access all of this for free with certifications supported by Skilljar. The folks at Imply have put considerable thought into this training material. It is a great way to get up to speed with how to use Apache Druid.

How to get Started with Apache Druid

What is Apache Druid good for?

What are Apache Druid’s limitations?

The best way to get started with Apache Druid

Published by justinmatters

Leave a Reply Cancel reply