Is it bad to sink data from kinesis stream directly to postgres?

ChocoPowwwa · August 19, 2016, 9:15am

What I’m trying to do right now; I am trying to build a real-time pipeline.

A few days ago I was trying to sink events to elasticsearch but I got a problem.
Elasticsearch give me error code 429 ( too many requests), and I assume that elasticsearch has a problem with indexing data (If you can help me get out of this elasticsearch problem that would be nice)
and now come out with an alternative to store it to postgres, still I want to use the real time pipeline but I saw that Snowplow doesn’t have a kinesis-postgres sink.

So all I gotta do I gotta sink the kinesis to S3 (using lzo) and use the storage loader to push data to postgres, am I right? But I have considered that; isn’t that a waste of resources? If we can eliminate S3 that would be nice, isn’t it?

So then besides all of that, is it bad to sink data from kinesis stream directly to postgres?

Thank You,

alex · August 19, 2016, 4:55pm

Hey @ChocoPowwwa - sinking enriched events into Postgres from the Snowplow Kinesis pipeline isn’t officially supported, but you are welcome to give it a try. There’s a lot of interesting community experimentation with different storage targets for Snowplow right now.

Just a note that the enriched event format doesn’t exactly correspond to the format loaded into Redshift or Postgres via the batch pipeline - and the only component we have which can do the transformation is Scala Hadoop Shred; this is why we recommend a Lambda architecture for loading Redshift or Postgres currently: How to setup a Lambda architecture for Snowplow

mike · August 21, 2016, 11:55pm

There’s several ways about creating real time data but as @alex has mentioned above there’s some caveats to doing so. It’s not a bad idea to try and build some real time applications on top of Elasticsearch if possible so attempting to solve the 429 issue may be the easiest path. How are you currently scaling Elasticsearch nodes?

ChocoPowwwa · August 24, 2016, 8:58am

@alex thanks, i’ve been looking tough for that lambda architecture thread, but i still cannot understand the ucecase of lambda architecture for me, ( i guess i need time to learn lambda architecture ),

@mike i haven’t figure out how to scale properly, all i do now i’m using AWS Elasticsearch service, with 2 instance of m3.medium and 2 dedicated master node, and i run gatling ( which is slightly modified snowplow avalanche exponentialpeak.scala script with SimulationTime = 2 minute, BaselineUsers = 400, PeakUsers = 1500 ) to test how much the cluster can handle the request ( also the index have 25 primary shard and 1 number of replica)

but i haven’t run the test completely because i have network issue, so here is my gatling output ( https://drive.google.com/file/d/0Bx2_Ied-yt-_bEIyNldSZzFEUk0/view?usp=sharing ), but i’m really upset with the result, is elasticsearch have really slow at indexing?

so, how do you setup / scale your elasticsearch cluster?

Topic		Replies	Views
Cutting one step on real time pipeline : stream-collector > kinesis > elasticsearch AWS real-time pipeline	2	2171	April 12, 2016
Real-time pipeline AWS real-time pipeline	2	1829	May 24, 2018
Does the Kinesis LZO S3 Sink support reading from an "enriched" stream? AWS real-time pipeline	12	4019	May 4, 2018
Snowplow with AWS Elastic Search AWS real-time pipeline	2	1352	June 7, 2019
SendGrid+Snowplow+AWS S3&Redshift For engineers	30	3999	October 30, 2019

Is it bad to sink data from kinesis stream directly to postgres?

Related Topics