Upload data to ClickHouse

Hello everyone,

We would like to switch storage from RedShift to ClickHouse.
From user perspective the main difference between this 2 storages is tables count. There will be 1 wide table in ClickHouse, while RedShift has multiple custom tables.
From SnowPlow pipeline perspective that’s means we need to skip the shredding step, but keep other functionality like deduplication.

Current solution is to upload data in RedShift, export data in Parquet format, then JOIN data and upload it to ClickHouse. But this workflow is too complicated, so I’m looking for another solution.
Is there any way to load data into ClickHouse right now? Or should we create our own loader to work with ClickHouse?

Best Regards,
Vasily

Hi @Vasily_Pribytkov,

I think your slightly easier option would be to use Snowflake Transformer. Despite its name, it doesn’t really produce data specifically in Snowflake-format, but in plain JSON, like:

{"app_id": "angry-birds", "useragent": "Firefox", "contexts_com_acme_status_1": [...]}

So your pipeline would look like:

S3 → Snowflake Transformer (without Loader) → ClickHouse raw import (s3 table function) → ClickHouse structured

In the last step you basiacally need to extract all your self-describing JSON blobs into structured ClickHouse columns.

You can do the same thing with plain enriched data (without Snowflake Transformer), but in-ClickHouse transformations would probably look much more complicated.

These are all bad news. Here are some good news:

Native ClickHouse Loader is on our roadmap for 2022. We’ve built a very small proof-of-concept (pre-alpha, don’t use for production!) already, but it’s incapable of doing many basic things: it loads only atomic columns (no contexts nor self-describing events) and it loads data only from local filesysem (not Kinsis nor PubSub). If you still want to give it a go - I can build a fatjar for you.

1 Like