On-Premise PostgreSQL storage. Still requires S3?


#1

Hello,

I’m interested in running as much of Snowplow on-premise as possible. Currently, I’m stuck on the step of configuring picking up of Enriched events and putting them into PostgreSQL storage.

Some of the documentation I’ve seen seems to imply that S3 is required for persisting the events to Storage.

Is there a way to swap-out using Amazon S3 buckets for an on-premise option, for communicating Enriched Events to persist them to long-term storage in PostgreSQL?

If s3 is required, has anyone tried using use a local Minio instance instead?

FYI, I’m using Kafka as the back-end, and have simple steps configured for

  • Javascript Tracker
  • Snowplow collector (scala), writing to Kafka
  • Snowplow stream enrich, reading from Kafka

Currently, the company I work for uses Mesosphere DC/OS. There is considerable investment in running their systems in DC/OS (locally hosted) and keeping data within their own data centers.

Snowplow use cases are as follows

  1. Web-click tracking and analysis - most traditional Snowplow use case.
    1a. Track/analyze users’ clicks, page navigation, changing of selected options, timings on pages and page items, etc. Perform analytics on this info.
  2. Serve as a “Messaging Framework” - Hence the interest in Kafka.
    2a. Audit Only Events - capture for audit only events from various applications and systems, for auditing and further analytics.
    2b. Serve as Central Messaging Hub / Broker, so for Actionable Events (events from one or more systems/apps that require further applications/systems to perform some processing ). These would also be audited. Here, minimal latency between putting an event in a collector and having the enriched event available in a Kafka topic will be critical.

I appreciate any comments and guidance.


#3

In theory you should be able to switch out S3 for Minio/Openstack though in practice I’ve never seen anyone do it. Out of interest what is your use case for on premise?


#4

Hello @dbh,

Few moments here.

  1. There’s no way right now you can put enriched data from S3 into any relational database. In vanilla batch pipeline we have additional Shred step that prepares enriched data for loading into Redshift and Postgres.
  2. Even with Shred step, Postgres right now lacks support of self-describing JSON - it loads only atomic.events table, which is most likely less than you want.
  3. S3 right now is hardcoded into RDB Loader, so it simply doesn’t know how to fetch data from other sources. This is obviously not going to remain in this state forever - we’re planning to add new cloud providers and storage targets, which inevitable will also open opportunities for on-premise solutions. But considering previous points this one is least of our problems.

All above make Postgres load with any object storage apart from S3 hardly feasible right now. But still we saw many efforts (1, 2, 3) on this forum to build on-premise pipeline using Kafka. I believe people usually end up with Kafka JDBC Connect, which is less persistent than object storage, but looking very promising.

Hope that helps.


#5

Thank you! I’ve added use case descriptions to the original post.
FYI - Usage of Mesosphere DC/OS may be similar to OpenStack.


#6

Thanks very much for the feedback! This is good to know and may shape the direction for the analytics part of the project.

FYI, I added some use case info to the original post.


#7

Check this link : https://www.confluent.io/blog/kafka-connect-sink-for-postgresql-from-justone-database/
https://github.com/justonedb/kafka-sink-pg-json