People who've used Clickhouse or other OLAP databases in production & at scale, how do you "interconnect" it with relational data?
I'm currently experimenting with Clickhouse, because my dataflow is increasing in size (40M rows right now, doubling every month or so) and my current setup (MongoDB) is at its limits. I would like to migrate the 40M rows to CH, but I also need the metadata for the rows to be in something more robust like Postgres or Mongo. Would you have a microservice that does the queries between the OLAP and relational DBs and does the join manually, exposing that as some low-level API? Or is using the various FDW options (remote tables in Clickhouse, Postgres clickhouse_fdw, etc.) realistic in production?
Sorry for the ramble, it's late here. But I can clarify if necessary.
Really depends what you're trying to do with your OLAP database.
If you're using it purely for reporting purposes then you really don't want to interconnect it with your OLTP database to support real-time queries. Reason number one is that you don't want some unexpected analytics workload to suddenly impact your production Postgres. Reason number two is that your analytics data model and your OLTP data model are frequently different. Usually your OLAP model needs to know what the value of a given dimension was at the time the event occurred, but if you're linking directly to Postgres then you can only see what the values are right now.
You can also go the other direction if you need the current values of OLTP data joined with your Clickhouse data: schedule an ETL out of Clickhouse and back into your OLTP database and aggregate the data to a reasonable level such that Posgres can handle it without a problem.
What's your actual use case? I'd normally consider a requirement to join OLAP & OLTP data in real time to be a "design smell". I don't mean that there's no value in things like fdw or easier ways to move data around, but you should consider using it to help with the ETL process and not as a real-time interconnect. Keep OLTP & OLAP workloads separate and both of your DBs will be happier.
Alright, here is an example. My data is a stream of events, each row has the values (event_id, person_id, date). This is the table with 40M rows, with inserts between 2/s and 10/s.
person_id is a foreign ID for a "person table" that fits well in the relational model. A "person" has various attributes (name, DOB, email) as well as a one-to-many relation to "groups". Groups are a collection of users with additional attributes (e.g. group name).
Now, what if I want to answer questions like:
- how many total events for all persons in group X
- who are the top 10 users in number of events in group X
- which are the top 10 groups in number of events
In this case, the person/group tables are part of the core business logic, they aren't specific to the events table. It doesn't make sense to store it in Clickhouse. Also, this person/group data gets updated sparsely, but "freshness" should be kept at a minimum (< 30secs).
The simple approach to the first question would be
- Get all the user IDs in group X
- Filter events by those user IDs
But what if there are tens of thousands of users in group X? And hundreds of groups? Are megabyte-long queries supported in Clickhouse?
1. Are the questions you're asking completely ad hoc? Or can you mostly define them ahead of time? If it's the former then you should be looking at getting your OLTP data into Clickhouse. If it's the latter then you should be looking to aggregate the data to various levels and get it out of Clickhouse.
All three of your sample questions lend themselves quite nicely to pre-aggregation. I'm sure your actual questions are more complex, but what I'd do to address all three of your examples is every night I'd roll up the raw events into (person_id, date, event_count) and send it back over to Postgres. Then every week you roll up the previous seven days into (person_id, week, event_count). Each month you roll the weeks up and each year you roll the months up. If you need the data more frequently than daily then you can go down to hourly or whatever it is you need.
Now you've got your data back into Postgres but at a reasonable granularity. Depending on the cardinality of the user-*group relationship you might have to do some magic to pre-aggregate that if the join is big as well, which could turn into a challenge as group membership changes (you'd need to re-aggregate all your group metrics any time group membership changed) but it's still better than trying to join across the Clickhouse/Postgres boundary.
If you really do need to support totally ad hoc questions all the time then you should figure out how to get your Postgres data into Clickhouse. If the data really gets update infrequently then it shouldn't be a problem to get changes in user/group membership into Clickhouse quickly, then you can do all your joins and analysis completely in Clickhouse.
2. Do you really want the current group memberships? Or do you want the group memberships at the time the event occurred? It's a subtle difference and there's not usually one right answer (or the answer is "I need both").
If you have something like Kafka before you insert your events into CH, you could consider enriching the event data by connecting with your Postgres data. That way, you still retain a flat CH table which should work just fine.
Take a look at query engines like Trino (formerly PrestoSQL) [https://trino.io/]. (Disclaimer: I'm a contributor to Trino).
I used it at a previous job to combine data from MongoDB, Kafka, S3 and Postgres to great effect. It tries to push-down as many operations as possible to the source too to improve performance.
Full ANSI SQL support over multiple number of backends (Kafka, Cassandra, Postgres, ClickHouse, S3 and many more).
The best part is it has a plugin ecosystem so you can very easily implement your own connectors and all the heavy lifting gets done by the core-engine while your plugin only has to abstract your backend to concepts that the engine can understand.
The problem with Trino is that it is not that easy to scale to possible RPS of Clickhouse, it introduces tons a of latency and push downs are faaaar from perfect.
Uber has a smart solution for Pinot, when they run it as a single node proxies
I wouldn't use Trino if you are looking to ONLY query a single database like Clickhouse or Postgres etc (unless you want an ANSI SQL abstraction over your choice of database). Obviously ClickHouse and Postgres will have lower latency when hit directly because you can bypass the analysis, planning, optimization and scheudling that Trino does.
It does federation better than ClickHouse and that's where it shines. Joins across disparate systems - even between relational and non-relational systems.
And obviously for the MPP queries on distributed filesystems.
Offtopic, but I didn't realize there is prestosql(now trino)[1] and prestodb. Was only aware of prestodb side. I love the paper [3], which was apparently written before they left FB to form the new company.
ClickHouse can select and insert directly from/to remote MySQL [1] and PostgreSQL [2] tables. See MySQL and PostgreSQL database engines. It's a common way to access mutable dimension data as well as to pull data into ClickHouse for analysis.
I have not used PostgreSQL myself but the MySQL database engine works great. In some cases queries from ClickHouse to MySQL run faster than they do on MySQL itself. There are other engines as well, e.g., MongoDB.
there is a new Clickhouse Database Engine: "PostgreSQL"
"Allows to connect to databases on a remote PostgreSQL server. Supports read and write operations (SELECT and INSERT queries) to exchange data between ClickHouse and PostgreSQL."
I'm not sure why this got posted again on HN. It's been posted numerous times, though in the past the url used to be
https://github.com/yandex/ClickHouse so it wouldn't have been caught as a duplicate.
Just searching on HB algolia there are numerous posts.
eg:
Because analytics wasn't in the title and like you said, it's an often repeated post, I thought they were repositioning/retargeting themselves as more general purpose, or something noteworthy, which had me curious.
FWIW, I don't see a problem with that level of posting. They are simply sharing things they find interesting with the community. If the community disagrees, it does not get upvotes. I might think it spammy at 100 a day, but 10-20 just feels like an active participant. And it balances out those of us who never submit links, but do comment.
I'm not very familiar with baseball so the concept of a "stolen base", the rules around it and how they changed over time were a fascinating read for me personally and I thought it might resonate here as well. It seems like it did not.
My heuristic for submitting is when I come across something (in this case ClickHouse) that I find useful or fascinating (a-ha moment) where I also think this might be of interest to others here (filter).
Great software. We are managing terabytes of stocks data and realtime market scanners queries across all market (billions of books and timesales) with hundreds of concurrent requests.
We were using kdb before, but clickhouse is more scalable, way cheaper and much more easy to grasp for a newbie.
I know it's been a few days but what was the driver behind this question? My company uses CH and has near-real-time data being added (~5min delay from live), is this a problem and/or is there something about CH that doesn't play nice with this method? Or were you just interested in this particular case of inserting real time data?
Actually not, but we are ingesting a whole day of EOD market data (thousands of symbols, millions of timesales, billions of books events) in less than a couple of hours. This is 4x faster than the pace of the realtime market (not counting the extended hours due to the limited volume).
Years of daily intraday market data. We have an ingestion engine written in c# and a nodejs backend to run the queries (with an in-house driver built on top of the HTTP APIs)
Their docs intro page is really nice and has a cool gif explanation of row vs column databases. I have never used a column db before and while I kind of knew some of that theory this made it really click for me.
A really nice database. We use it to aggregate web analytical data in real-time for Pirsch [0] and it performs really well. Writing queries for it is close to relational databases, which made the switch from Postgres easier. User data is still managed in Postgres and losely coupled through an ID field.
My team is (over|ab)using Elasticsearch and I've had my eye on ClickHouse for a while. However we're going to migrate everything to AWS and I wonder if RedShift could be a good alternative too, since it's now supporting JSON and semi-structured data apparently.
Having used Redshift considerably in the past and experienced significant pain with many aspects of it, I would stay away. There are other much better options, ideally Snowflake.
My company has/in-in-the-process of moving from ES to CH and from my (limited) perspective it's been going well. I greatly prefer writing CH queries over ES. I can't speak to RedShift but I wanted to throw in the data point of an ES->CH move.
Thanks but my employer is becoming an AWS partner and we'll contractually need to spend a lot of money, meaning it's better if we use their SaaS offerings directly rather than going through the marketplace or a 3rd party.
Is there anyone using it in a schemaless way as in you provide a SaaS service and your customers can send you JSON with various different fields? As far as I remember Sentry uses it this way and people leverage arrays for such things?
wish someone would do a comparo between recent versions of CH, Druid and Pinot. (There's a pretty detailed one on the web but it is a couple years old and there's new features in all of them)
> Different orders for storing data are better suited to different scenarios. The data access scenario refers to what queries are made, how often, and in what proportion; how much data is read for each type of query – rows, columns, and bytes; the relationship between reading and updating data; the working size of the data and how locally it is used; whether transactions are used, and how isolated they are; requirements for data replication and logical integrity; requirements for latency and throughput for each type of query, and so on.
I'm currently experimenting with Clickhouse, because my dataflow is increasing in size (40M rows right now, doubling every month or so) and my current setup (MongoDB) is at its limits. I would like to migrate the 40M rows to CH, but I also need the metadata for the rows to be in something more robust like Postgres or Mongo. Would you have a microservice that does the queries between the OLAP and relational DBs and does the join manually, exposing that as some low-level API? Or is using the various FDW options (remote tables in Clickhouse, Postgres clickhouse_fdw, etc.) realistic in production?
Sorry for the ramble, it's late here. But I can clarify if necessary.