Most people shouldn't. Clickhouse or other column-store data warehouses (redshif...

DevKoala · on April 15, 2020

You are right about Clickhouse, but other data warehouses are not optimized for the same use case of Druid and Clickhouse, OLAP.

For example, RedShift and BigQuery cannot be put behind a user facing backend. BigQuery has a default limit of 50 concurrent queries, if that's your user limit, perfect. RedShift takes seconds for queries that Druid and Clickhouse can answer in milliseconds.

manigandham · on April 15, 2020

All data warehouses are designed for OLAP, that's their purpose. It doesn't require low latency though.

Redshift is an always running cluster of scale-out distributed postgres forked by AWS so it can and does return results in milliseconds, very similar to Clickhouse although still not as advanced in performance techniques.

Bigquery is a completely managed model that uses far greater scale-out architecture designed for througput (petabytes in seconds) rather than latency, although it has real-time streaming, BI Engine (memory cache) and materialized views so you can get pretty close today.

Snowflake is another option that runs on top of major clouds using instances to access object storage and also has low latency when your cluster is running.

DevKoala · on April 15, 2020

Not all data warehouses are designed for OLAP. As the performance of these systems increase, it is possible to satisfy OLAP use cases with some warehouses such a BQ, but they are still not ideal as your number of concurrent queries scale and your response time requirements remain constant. The BQ team certainly doesn't consider their product an OLAP solution, neither the RedShift team. AWS is actually working on their own OLAP solution, Timestream; that is how they pitch it, as their OLAP solution.

BigQuery's petabyte aggregations in seconds is a false claim. I just launched a query to count the number of records of a 1.4 PB dataset and it has taken over a minute to just schedule the query, I have 1000+ compute units available too. I have been able to perform a similar query in seconds on top of Druid; the cluster had a run rate of $25k/month at the time, while I believe we pay 40k+/month just on our BQ deal. Maybe under ideal conditions BigQuery can run such a query, but I have never seen it do such things.

I have experience with Snowflake, it is great if you are in AWS, but I wouldn't choose it over BigQuery. I feel that Snowflake's strength is in their data sharing protocol, which allows us to tap into the datasets of partners without major data transfer costs. We use it for that reason only.

manigandham · on April 15, 2020

You're using a strange definition of OLAP here when it's really not tied to performance or concurrency. It only means you can send an ad-hoc query and get back results directly in response, rather than waiting on pre-computed or batched processing happening somewhere else.

Redshift and BigQuery are both OLAP products and their documentation and whitepapers state the same. The performance profiles of both are very different but that's due to the architecture. Redshift is an always running cluster of EC2 instances running their forked scale-out Postgres. BigQuery is an entirely different engine based on Dremel which focuses on parallelization as much as possible along with Google's proprietary datacenters that have petabits of internal bandwidth to support it. AWS Timestream is an entirely different product for time-series specific data, and basically

The only reason Druid is faster is because it pre-aggregates and indexes data. You can aggregate tables yourself in Redshift and use the right primary keys to get most of the same performance. BigQuery just launched materialized views and is working on scheduling improvements and in-memory caching as well. Snowflake now runs in Azure and GCP and can do sub-second responses when the warehouse is running and data is cached or the tables are smaller.

DevKoala · on April 16, 2020

I wasn’t able to achieve the same performance with RedShift aggregates, I tried that first before I decided to migrate from RedShift to Druid back in 2014. We deal with dozens of dimensions per event and no combination of distribution keys in Redshift was able to give up the same performance over arbitrary scans+ aggregations.

Druid is not only fast because it pre-aggregates, but the memory structure is designed for scans over hundreds of dimensions.

Materializing views in BigQuery is just one DAG task. Unless you don’t have something like Airflow on your stack, I don’t see how it is worth mentioning. We are talking about denormalized data, time series data.

I am speaking from experience with each one of these products. Perhaps I did it all wrong, but we certainly achieved the objectives we were after.

esprehn · on April 15, 2020

I think the limit on BigQuery is 100 concurrent queries. Also you can use batch mode if you're willing to wait a little bit.

https://cloud.google.com/bigquery/quotas#query_jobs

DevKoala · on April 15, 2020

We have negotiated a flat rate deal with Google and have access to a higher concurrent query limit than 100, and even then, it is not the right tool for my use case. BigQuery is also far more expensive.

Our users expect responses in the milliseconds. The user experience is seriously degraded when you approach tens of seconds or batch queries.

leetbulb · on April 14, 2020

1) Most people handling large data streams are already using Kafka or similar. If you aren't, Druid's has pretty wide support for event ingress, including HTTP.

2) If you're looking to provide this type of analytics in the first place, you probably do want this. Being able to execute extremely fast, granular, ad-hoc queries over highly dimensional data is very powerful. I designed a reporting frontend that really took advantage of this and I always felt guilty when people would complimented me how fast it drilled down in to 10+ dimensions.

3) There are plenty of mature libraries for the Druid API.

4) This is almost inherent with any OLAP system. Although, even with high-cardinality data, Druid performs extremely well. Either way, you should have a backing data warehouse and offline, batch jobs if you need to perform BI analytics on row-by-row / non-aggregated data. Remember, OLAP sits on top of a data warehouse.

5) Yes, but technically you just need a key that increments.

6) Not sure if this is an argument against Druid. I've found the S3 deep storage / tiering to be very efficient and powerful, especially because you can create and push segments directly to storage and not run it through an indexer. S3 is also just an object store protocol specification now. There are lots of people who run S3-compatible object stores in-house. HDFS is also natively supported and another widely used storage backend in this space. Also, there are plenty of community extensions for other object stores.

7) Again, OLAP. Data flowing in to your OLAP layer should already be denormalized and ready for indexing. Also, you can join data in to your existing indices with Hadoop, etc. Druid support joins and lookups, although I've never used them. ClickHouse and other similar systems also don't do very well with joins. Maybe we have different definitions of "complex analysis," but in my experience, you can do some pretty crazy stuff with queries including writing your own JS functions, and if you're really dedicated, you can write your own extensions.

One thing that I feel like a lot of people miss is that Druid is specifically an OLAP layer designed for large scale, and I mean beyond Netflix size scale (they use Druid). Every individual component of Druid is designed to scale out independently and play nicely with your existing infrastructure. Similar to ES, you have nodes that have specific roles such as event ingress / real-time compute, query handlers, query brokers, historical data / cache / compute, etc. Then you also have a bunch of supporting architecture provided for you for (re)indexing data, coordinating the entire cluster, etc. Druid is huge, not an AIO (OLAP, OLTP, DWH, etc) analytics solution, and it takes more than one person to run a larger cluster, even though I did it for a few years.

manigandham · on April 14, 2020

Relational databases with joins and full SQL support are still unmatched in flexibility, and functionality like materialized views, aggregation pipelines (and table engines for Clickhouse) allows you to do everything that Druid does with aggregated summaries while still having all the other querying abilities.

Druid has a slight edge in data scale-out and indexed seeks but modern data warehouses are adding similar tiering features, along with with field indexing, full-text search, nested data records, and even in-memory rowstores for OLTP support.

They're all converging on the same feature set and eventually Druid will just become another data warehouse option, although I'd still recommend Clickhouse or MemSQL at that point.

sanderjd · on April 15, 2020

Question for everyone in this thread: what resources do you use to follow this stuff? Are there blogs or conferences or ... that I should follow if I'm interested in this topic?

manigandham · on April 15, 2020

Distributed SQL data warehouses? Start with the papers:

- C-Store (column-oriented database storage and processing): http://db.lcs.mit.edu/projects/cstore/vldb.pdf and https://sookocheff.com/post/databases/c-store/

- BigQuery (the original Dremel-based query engine): https://cloud.google.com/files/BigQueryTechnicalWP.pdf

- Snowflake paper: https://www.snowflake.com/resource/sigmod-2016-paper-snowfla...

- Andy Pavlo's CMU distributed OLAP databases lecture: https://www.youtube.com/watch?v=dPMc7FZ3Gqo&t=1s

- MemSQL architecture: https://www.youtube.com/watch?v=HoompXCG5Mk

Start there and just read more about architecture of the various products. You'll learn the basics of distributed nodes that store data in column-oriented tables partitioned by a key and how to run queries over them using various performance techniques. Then you can dive into the unique differences of the various databases.

sanderjd · on April 15, 2020

No sorry, I didn't mean where to read about how these things are designed (though these are useful and interesting links for that too, thanks!). What I'm asking about is what to follow (blogs, conferences, etc.) to keep up with new developments in the space: new products, experiences with them, reviews of their trade offs in practical use, that kind of thing. But I do appreciate the response!

manigandham · on April 16, 2020

HN is a great resource with plenty of important news posted, along with the HighScalability blog [1] and SoftwareEngineeringDaily podcast [2].

1. http://highscalability.com/

2. https://softwareengineeringdaily.com/

Other than that, reading the blogs of the various vendors is how I keep up (using Feedly with RSS). The modern projects are Redshift, BigQuery, Snowflake, Azure DW, MemSQL, Clickhouse, YellowBrick with older projects being Vertica, Teradata, Greenplum. It's also useful to follow the "new" distributed SQL projects like CockroachDB, Citus, TiDB, Vitess, Yugabyte.

sanderjd · on April 16, 2020

Thank you this is super useful to me!

polskibus · on April 28, 2020

I recommend watching Andy Pavlo's courses:

https://15445.courses.cs.cmu.edu/fall2019/

https://15721.courses.cs.cmu.edu/spring2020/

qiuyesuifeng · on April 16, 2020

An architecture of having both row-based database and column-based data might also be interesting to explore. This post explains how they work together. https://pingcap.com/blog/delivering-real-time-analytics-and-...

shaklee3 · on April 15, 2020

Clickhouse has had native Kafka ingest support for over a year.