People who've used Clickhouse or other OLAP databases in production & at scale, ...

mason55 · on May 28, 2021

Really depends what you're trying to do with your OLAP database.

If you're using it purely for reporting purposes then you really don't want to interconnect it with your OLTP database to support real-time queries. Reason number one is that you don't want some unexpected analytics workload to suddenly impact your production Postgres. Reason number two is that your analytics data model and your OLTP data model are frequently different. Usually your OLAP model needs to know what the value of a given dimension was at the time the event occurred, but if you're linking directly to Postgres then you can only see what the values are right now.

You can also go the other direction if you need the current values of OLTP data joined with your Clickhouse data: schedule an ETL out of Clickhouse and back into your OLTP database and aggregate the data to a reasonable level such that Posgres can handle it without a problem.

What's your actual use case? I'd normally consider a requirement to join OLAP & OLTP data in real time to be a "design smell". I don't mean that there's no value in things like fdw or easier ways to move data around, but you should consider using it to help with the ETL process and not as a real-time interconnect. Keep OLTP & OLAP workloads separate and both of your DBs will be happier.

momothereal · on May 28, 2021

Alright, here is an example. My data is a stream of events, each row has the values (event_id, person_id, date). This is the table with 40M rows, with inserts between 2/s and 10/s.

person_id is a foreign ID for a "person table" that fits well in the relational model. A "person" has various attributes (name, DOB, email) as well as a one-to-many relation to "groups". Groups are a collection of users with additional attributes (e.g. group name).

Now, what if I want to answer questions like:

   - how many total events for all persons in group X
   - who are the top 10 users in number of events in group X
   - which are the top 10 groups in number of events

In this case, the person/group tables are part of the core business logic, they aren't specific to the events table. It doesn't make sense to store it in Clickhouse. Also, this person/group data gets updated sparsely, but "freshness" should be kept at a minimum (< 30secs).

The simple approach to the first question would be

   - Get all the user IDs in group X
   - Filter events by those user IDs

But what if there are tens of thousands of users in group X? And hundreds of groups? Are megabyte-long queries supported in Clickhouse?

mason55 · on May 28, 2021

Some things to think about

1. Are the questions you're asking completely ad hoc? Or can you mostly define them ahead of time? If it's the former then you should be looking at getting your OLTP data into Clickhouse. If it's the latter then you should be looking to aggregate the data to various levels and get it out of Clickhouse.

All three of your sample questions lend themselves quite nicely to pre-aggregation. I'm sure your actual questions are more complex, but what I'd do to address all three of your examples is every night I'd roll up the raw events into (person_id, date, event_count) and send it back over to Postgres. Then every week you roll up the previous seven days into (person_id, week, event_count). Each month you roll the weeks up and each year you roll the months up. If you need the data more frequently than daily then you can go down to hourly or whatever it is you need.

Now you've got your data back into Postgres but at a reasonable granularity. Depending on the cardinality of the user-*group relationship you might have to do some magic to pre-aggregate that if the join is big as well, which could turn into a challenge as group membership changes (you'd need to re-aggregate all your group metrics any time group membership changed) but it's still better than trying to join across the Clickhouse/Postgres boundary.

If you really do need to support totally ad hoc questions all the time then you should figure out how to get your Postgres data into Clickhouse. If the data really gets update infrequently then it shouldn't be a problem to get changes in user/group membership into Clickhouse quickly, then you can do all your joins and analysis completely in Clickhouse.

2. Do you really want the current group memberships? Or do you want the group memberships at the time the event occurred? It's a subtle difference and there's not usually one right answer (or the answer is "I need both").

kirankn · on May 28, 2021

If you have something like Kafka before you insert your events into CH, you could consider enriching the event data by connecting with your Postgres data. That way, you still retain a flat CH table which should work just fine.

hashhar · on May 28, 2021

Take a look at query engines like Trino (formerly PrestoSQL) [https://trino.io/]. (Disclaimer: I'm a contributor to Trino).

I used it at a previous job to combine data from MongoDB, Kafka, S3 and Postgres to great effect. It tries to push-down as many operations as possible to the source too to improve performance.

Full ANSI SQL support over multiple number of backends (Kafka, Cassandra, Postgres, ClickHouse, S3 and many more).

The best part is it has a plugin ecosystem so you can very easily implement your own connectors and all the heavy lifting gets done by the core-engine while your plugin only has to abstract your backend to concepts that the engine can understand.

LaserToy · on May 28, 2021

The problem with Trino is that it is not that easy to scale to possible RPS of Clickhouse, it introduces tons a of latency and push downs are faaaar from perfect. Uber has a smart solution for Pinot, when they run it as a single node proxies

hashhar · on May 30, 2021

I wouldn't use Trino if you are looking to ONLY query a single database like Clickhouse or Postgres etc (unless you want an ANSI SQL abstraction over your choice of database). Obviously ClickHouse and Postgres will have lower latency when hit directly because you can bypass the analysis, planning, optimization and scheudling that Trino does.

It does federation better than ClickHouse and that's where it shines. Joins across disparate systems - even between relational and non-relational systems. And obviously for the MPP queries on distributed filesystems.

LaserToy · on May 31, 2021

I know. I worked on Presto :)

flakiness · on May 28, 2021

Offtopic, but I didn't realize there is prestosql(now trino)[1] and prestodb. Was only aware of prestodb side. I love the paper [3], which was apparently written before they left FB to form the new company.

  [1] https://trino.io/blog/2020/12/27/announcing-trino.html
  [2] https://prestodb.io/
  [3] https://trino.io/Presto_SQL_on_Everything.pdf

ckdarby · on May 28, 2021

Came here to say the same great things about Trino

hodgesrm · on May 28, 2021

ClickHouse can select and insert directly from/to remote MySQL [1] and PostgreSQL [2] tables. See MySQL and PostgreSQL database engines. It's a common way to access mutable dimension data as well as to pull data into ClickHouse for analysis.

I have not used PostgreSQL myself but the MySQL database engine works great. In some cases queries from ClickHouse to MySQL run faster than they do on MySQL itself. There are other engines as well, e.g., MongoDB.

[1] https://clickhouse.tech/docs/en/engines/table-engines/integr...

[2] https://clickhouse.tech/docs/en/engines/table-engines/integr...

pella · on May 28, 2021

there is a new Clickhouse Database Engine: "PostgreSQL"

"Allows to connect to databases on a remote PostgreSQL server. Supports read and write operations (SELECT and INSERT queries) to exchange data between ClickHouse and PostgreSQL."

https://clickhouse.tech/docs/en/engines/database-engines/pos...

unamedrus · on May 28, 2021

> I also need the metadata for the rows to be in something more robust like Postgres or Mongo.

How big is that metadata? If it's less than tens of millions rows, you can look at External Dictionaries support in clickhouse.

https://clickhouse.tech/docs/en/sql-reference/dictionaries/e...

https://altinity.com/blog/2020/5/19/clickhouse-dictionaries-...

wiredfool · on May 28, 2021

I ran across this recently: https://eng.uber.com/logging/

Uber's use of clickhouse to handle semi-structured data from logging.

fnord77 · on May 28, 2021

Don't know about CH, but with apache druid, you can "join" with a rdbms query via a LOOKUP sql function.

anhldbk · on May 28, 2021

How about using CDC with Apache Kafka?