Upload data to ClickHouse

Vasily_Pribytkov · November 18, 2021, 2:33pm

Hello everyone,

We would like to switch storage from RedShift to ClickHouse.
From user perspective the main difference between this 2 storages is tables count. There will be 1 wide table in ClickHouse, while RedShift has multiple custom tables.
From SnowPlow pipeline perspective that’s means we need to skip the shredding step, but keep other functionality like deduplication.

Current solution is to upload data in RedShift, export data in Parquet format, then JOIN data and upload it to ClickHouse. But this workflow is too complicated, so I’m looking for another solution.
Is there any way to load data into ClickHouse right now? Or should we create our own loader to work with ClickHouse?

Best Regards,
Vasily

anton · November 19, 2021, 11:30am

Hi @Vasily_Pribytkov,

I think your slightly easier option would be to use Snowflake Transformer. Despite its name, it doesn’t really produce data specifically in Snowflake-format, but in plain JSON, like:

{"app_id": "angry-birds", "useragent": "Firefox", "contexts_com_acme_status_1": [...]}

So your pipeline would look like:

S3 → Snowflake Transformer (without Loader) → ClickHouse raw import (s3 table function) → ClickHouse structured

In the last step you basiacally need to extract all your self-describing JSON blobs into structured ClickHouse columns.

You can do the same thing with plain enriched data (without Snowflake Transformer), but in-ClickHouse transformations would probably look much more complicated.

These are all bad news. Here are some good news:

Native ClickHouse Loader is on our roadmap for 2022. We’ve built a very small proof-of-concept (pre-alpha, don’t use for production!) already, but it’s incapable of doing many basic things: it loads only atomic columns (no contexts nor self-describing events) and it loads data only from local filesysem (not Kinsis nor PubSub). If you still want to give it a go - I can build a fatjar for you.

Topic		Replies	Views
Snowplow and ClickHouse Storage targets	4	1023	November 6, 2023
Snowplow RDB Loader 3.0.0 released New releases	4	1998	May 3, 2022
Rerun storage loader from archived files Storage targets	3	1464	January 29, 2017
Using shredded data for loading into databricks as parquet format	6	652	October 7, 2022
StorageLoader error Storage targets	3	1579	December 13, 2016

Upload data to ClickHouse

Related Topics