ClickHouse: An open-source column-oriented database management system

momothereal · on May 28, 2021

People who've used Clickhouse or other OLAP databases in production & at scale, how do you "interconnect" it with relational data?

I'm currently experimenting with Clickhouse, because my dataflow is increasing in size (40M rows right now, doubling every month or so) and my current setup (MongoDB) is at its limits. I would like to migrate the 40M rows to CH, but I also need the metadata for the rows to be in something more robust like Postgres or Mongo. Would you have a microservice that does the queries between the OLAP and relational DBs and does the join manually, exposing that as some low-level API? Or is using the various FDW options (remote tables in Clickhouse, Postgres clickhouse_fdw, etc.) realistic in production?

Sorry for the ramble, it's late here. But I can clarify if necessary.

mason55 · on May 28, 2021

Really depends what you're trying to do with your OLAP database.

If you're using it purely for reporting purposes then you really don't want to interconnect it with your OLTP database to support real-time queries. Reason number one is that you don't want some unexpected analytics workload to suddenly impact your production Postgres. Reason number two is that your analytics data model and your OLTP data model are frequently different. Usually your OLAP model needs to know what the value of a given dimension was at the time the event occurred, but if you're linking directly to Postgres then you can only see what the values are right now.

You can also go the other direction if you need the current values of OLTP data joined with your Clickhouse data: schedule an ETL out of Clickhouse and back into your OLTP database and aggregate the data to a reasonable level such that Posgres can handle it without a problem.

What's your actual use case? I'd normally consider a requirement to join OLAP & OLTP data in real time to be a "design smell". I don't mean that there's no value in things like fdw or easier ways to move data around, but you should consider using it to help with the ETL process and not as a real-time interconnect. Keep OLTP & OLAP workloads separate and both of your DBs will be happier.

momothereal · on May 28, 2021

Alright, here is an example. My data is a stream of events, each row has the values (event_id, person_id, date). This is the table with 40M rows, with inserts between 2/s and 10/s.

person_id is a foreign ID for a "person table" that fits well in the relational model. A "person" has various attributes (name, DOB, email) as well as a one-to-many relation to "groups". Groups are a collection of users with additional attributes (e.g. group name).

Now, what if I want to answer questions like:

   - how many total events for all persons in group X
   - who are the top 10 users in number of events in group X
   - which are the top 10 groups in number of events

In this case, the person/group tables are part of the core business logic, they aren't specific to the events table. It doesn't make sense to store it in Clickhouse. Also, this person/group data gets updated sparsely, but "freshness" should be kept at a minimum (< 30secs).

The simple approach to the first question would be

   - Get all the user IDs in group X
   - Filter events by those user IDs

But what if there are tens of thousands of users in group X? And hundreds of groups? Are megabyte-long queries supported in Clickhouse?

mason55 · on May 28, 2021

Some things to think about

1. Are the questions you're asking completely ad hoc? Or can you mostly define them ahead of time? If it's the former then you should be looking at getting your OLTP data into Clickhouse. If it's the latter then you should be looking to aggregate the data to various levels and get it out of Clickhouse.

All three of your sample questions lend themselves quite nicely to pre-aggregation. I'm sure your actual questions are more complex, but what I'd do to address all three of your examples is every night I'd roll up the raw events into (person_id, date, event_count) and send it back over to Postgres. Then every week you roll up the previous seven days into (person_id, week, event_count). Each month you roll the weeks up and each year you roll the months up. If you need the data more frequently than daily then you can go down to hourly or whatever it is you need.

Now you've got your data back into Postgres but at a reasonable granularity. Depending on the cardinality of the user-*group relationship you might have to do some magic to pre-aggregate that if the join is big as well, which could turn into a challenge as group membership changes (you'd need to re-aggregate all your group metrics any time group membership changed) but it's still better than trying to join across the Clickhouse/Postgres boundary.

If you really do need to support totally ad hoc questions all the time then you should figure out how to get your Postgres data into Clickhouse. If the data really gets update infrequently then it shouldn't be a problem to get changes in user/group membership into Clickhouse quickly, then you can do all your joins and analysis completely in Clickhouse.

2. Do you really want the current group memberships? Or do you want the group memberships at the time the event occurred? It's a subtle difference and there's not usually one right answer (or the answer is "I need both").

kirankn · on May 28, 2021

If you have something like Kafka before you insert your events into CH, you could consider enriching the event data by connecting with your Postgres data. That way, you still retain a flat CH table which should work just fine.

hashhar · on May 28, 2021

Take a look at query engines like Trino (formerly PrestoSQL) [https://trino.io/]. (Disclaimer: I'm a contributor to Trino).

I used it at a previous job to combine data from MongoDB, Kafka, S3 and Postgres to great effect. It tries to push-down as many operations as possible to the source too to improve performance.

Full ANSI SQL support over multiple number of backends (Kafka, Cassandra, Postgres, ClickHouse, S3 and many more).

The best part is it has a plugin ecosystem so you can very easily implement your own connectors and all the heavy lifting gets done by the core-engine while your plugin only has to abstract your backend to concepts that the engine can understand.

LaserToy · on May 28, 2021

The problem with Trino is that it is not that easy to scale to possible RPS of Clickhouse, it introduces tons a of latency and push downs are faaaar from perfect. Uber has a smart solution for Pinot, when they run it as a single node proxies

hashhar · on May 30, 2021

I wouldn't use Trino if you are looking to ONLY query a single database like Clickhouse or Postgres etc (unless you want an ANSI SQL abstraction over your choice of database). Obviously ClickHouse and Postgres will have lower latency when hit directly because you can bypass the analysis, planning, optimization and scheudling that Trino does.

It does federation better than ClickHouse and that's where it shines. Joins across disparate systems - even between relational and non-relational systems. And obviously for the MPP queries on distributed filesystems.

LaserToy · on May 31, 2021

I know. I worked on Presto :)

flakiness · on May 28, 2021

Offtopic, but I didn't realize there is prestosql(now trino)[1] and prestodb. Was only aware of prestodb side. I love the paper [3], which was apparently written before they left FB to form the new company.

  [1] https://trino.io/blog/2020/12/27/announcing-trino.html
  [2] https://prestodb.io/
  [3] https://trino.io/Presto_SQL_on_Everything.pdf

ckdarby · on May 28, 2021

Came here to say the same great things about Trino

hodgesrm · on May 28, 2021

ClickHouse can select and insert directly from/to remote MySQL [1] and PostgreSQL [2] tables. See MySQL and PostgreSQL database engines. It's a common way to access mutable dimension data as well as to pull data into ClickHouse for analysis.

I have not used PostgreSQL myself but the MySQL database engine works great. In some cases queries from ClickHouse to MySQL run faster than they do on MySQL itself. There are other engines as well, e.g., MongoDB.

[1] https://clickhouse.tech/docs/en/engines/table-engines/integr...

[2] https://clickhouse.tech/docs/en/engines/table-engines/integr...

pella · on May 28, 2021

there is a new Clickhouse Database Engine: "PostgreSQL"

"Allows to connect to databases on a remote PostgreSQL server. Supports read and write operations (SELECT and INSERT queries) to exchange data between ClickHouse and PostgreSQL."

https://clickhouse.tech/docs/en/engines/database-engines/pos...

unamedrus · on May 28, 2021

> I also need the metadata for the rows to be in something more robust like Postgres or Mongo.

How big is that metadata? If it's less than tens of millions rows, you can look at External Dictionaries support in clickhouse.

https://clickhouse.tech/docs/en/sql-reference/dictionaries/e...

https://altinity.com/blog/2020/5/19/clickhouse-dictionaries-...

wiredfool · on May 28, 2021

I ran across this recently: https://eng.uber.com/logging/

Uber's use of clickhouse to handle semi-structured data from logging.

fnord77 · on May 28, 2021

Don't know about CH, but with apache druid, you can "join" with a rdbms query via a LOOKUP sql function.

anhldbk · on May 28, 2021

How about using CDC with Apache Kafka?

skunkworker · on May 28, 2021

I'm not sure why this got posted again on HN. It's been posted numerous times, though in the past the url used to be https://github.com/yandex/ClickHouse so it wouldn't have been caught as a duplicate.

Just searching on HB algolia there are numerous posts. eg:

https://hn.algolia.com/?q=clickhouse

alexpw · on May 28, 2021

Because analytics wasn't in the title and like you said, it's an often repeated post, I thought they were repositioning/retargeting themselves as more general purpose, or something noteworthy, which had me curious.

bdcravens · on May 28, 2021

tosh submits 10-20 articles a day (yet averages about 1 comment every 2-3 days). There's bound to be some dupes.

codingdave · on May 28, 2021

FWIW, I don't see a problem with that level of posting. They are simply sharing things they find interesting with the community. If the community disagrees, it does not get upvotes. I might think it spammy at 100 a day, but 10-20 just feels like an active participant. And it balances out those of us who never submit links, but do comment.

coolspot · on May 28, 2021

> tosh submits 10-20 articles a day

That’s kinda sad

tosh · on May 28, 2021

ebcode · on May 28, 2021

I appreciate you, tosh!

buro9 · on May 28, 2021

They may be using HN as if it's pinboard

des1nderlase · on May 28, 2021

I didn't know HN allows someone to posts that much. This would be a full-time job for a person. Hidden motives?

polote · on May 28, 2021

This is a common pattern. I wouldnt be surprise that 20% of posted links are from the same 10 users

konart · on May 28, 2021

Look at his profile. Dude is available on every platform you can imagine. Some sort of influencer maybe?

twic · on May 28, 2021

Maybe they've got a "submit to HN" bookmarklet, and just whack it whenever they come across something interesting.

weird-eye-issue · on May 28, 2021

Seems like spam. He posted this link earlier, for example: https://en.m.wikipedia.org/wiki/Stolen_base

Not sure what I'm missing but that doesn't seem interesting at all. Almost something a bot would pick randomly.

tosh · on May 28, 2021

I'm not very familiar with baseball so the concept of a "stolen base", the rules around it and how they changed over time were a fascinating read for me personally and I thought it might resonate here as well. It seems like it did not.

My heuristic for submitting is when I come across something (in this case ClickHouse) that I find useful or fascinating (a-ha moment) where I also think this might be of interest to others here (filter).

pibi · on May 28, 2021

Great software. We are managing terabytes of stocks data and realtime market scanners queries across all market (billions of books and timesales) with hundreds of concurrent requests.

We were using kdb before, but clickhouse is more scalable, way cheaper and much more easy to grasp for a newbie.

qeternity · on May 28, 2021

You’re ingesting real time data into Clickhouse?

joshstrange · on June 1, 2021

I know it's been a few days but what was the driver behind this question? My company uses CH and has near-real-time data being added (~5min delay from live), is this a problem and/or is there something about CH that doesn't play nice with this method? Or were you just interested in this particular case of inserting real time data?

pibi · on May 28, 2021

Actually not, but we are ingesting a whole day of EOD market data (thousands of symbols, millions of timesales, billions of books events) in less than a couple of hours. This is 4x faster than the pace of the realtime market (not counting the extended hours due to the limited volume).

tototom · on May 28, 2021

How much data? And which types of application/apis are connected to CH?

pibi · on May 28, 2021

Years of daily intraday market data. We have an ingestion engine written in c# and a nodejs backend to run the queries (with an in-house driver built on top of the HTTP APIs)

hawk_ · on May 28, 2021

any info on costs?

sixhobbits · on May 28, 2021

Their docs intro page is really nice and has a cool gif explanation of row vs column databases. I have never used a column db before and while I kind of knew some of that theory this made it really click for me.

https://clickhouse.tech/docs/en/

nojito · on May 28, 2021

Clickhouse is easily my favorite secret weapon.

Lowered our costs by ~50% and reduced data processing times by orders of magnitudes.

hawk_ · on May 28, 2021

compared to what? what's the scale of processing - data size, number of transactions etc...? and how much is the cost?

nojito · on May 28, 2021

Data size is in TBs and “transactions” are millions

Old system was a hodgepodge of oracle/mssql/postgres with data engineering glue connecting them.

marvinblum · on May 28, 2021

A really nice database. We use it to aggregate web analytical data in real-time for Pirsch [0] and it performs really well. Writing queries for it is close to relational databases, which made the switch from Postgres easier. User data is still managed in Postgres and losely coupled through an ID field.

[0] https://pirsch.io/

hocuspocus · on May 28, 2021

My team is (over|ab)using Elasticsearch and I've had my eye on ClickHouse for a while. However we're going to migrate everything to AWS and I wonder if RedShift could be a good alternative too, since it's now supporting JSON and semi-structured data apparently.

bayan1234 · on May 28, 2021

Having used Redshift considerably in the past and experienced significant pain with many aspects of it, I would stay away. There are other much better options, ideally Snowflake.

joshstrange · on June 1, 2021

My company has/in-in-the-process of moving from ES to CH and from my (limited) perspective it's been going well. I greatly prefer writing CH queries over ES. I can't speak to RedShift but I wanted to throw in the data point of an ES->CH move.

daniel_levine · on May 28, 2021

Altinity offers a managed service for ClickHouse on AWS https://altinity.com/cloud-database/

hocuspocus · on May 28, 2021

Thanks but my employer is becoming an AWS partner and we'll contractually need to spend a lot of money, meaning it's better if we use their SaaS offerings directly rather than going through the marketplace or a 3rd party.

lima · on May 28, 2021

That's a funny way of making engineering design decisions.

hocuspocus · on May 28, 2021

Some battles are above my pay-grade.

golondon · on May 28, 2021

Is there anyone using it in a schemaless way as in you provide a SaaS service and your customers can send you JSON with various different fields? As far as I remember Sentry uses it this way and people leverage arrays for such things?

fnord77 · on May 28, 2021

wish someone would do a comparo between recent versions of CH, Druid and Pinot. (There's a pretty detailed one on the web but it is a couple years old and there's new features in all of them)

gigatexal · on May 28, 2021

It’s here so we may as well comment on the DB. Perhaps if something wasn’t said someone can chime in here now.

imvetri · on May 28, 2021

Column oriented - a table.

nexuist · on May 28, 2021

See the docs: https://clickhouse.tech/docs/en/

> Different orders for storing data are better suited to different scenarios. The data access scenario refers to what queries are made, how often, and in what proportion; how much data is read for each type of query – rows, columns, and bytes; the relationship between reading and updating data; the working size of the data and how locally it is used; whether transactions are used, and how isolated they are; requirements for data replication and logical integrity; requirements for latency and throughput for each type of query, and so on.