Is it bad to sink data from kinesis stream directly to postgres?


#1

What I’m trying to do right now; I am trying to build a real-time pipeline.

A few days ago I was trying to sink events to elasticsearch but I got a problem.
Elasticsearch give me error code 429 ( too many requests), and I assume that elasticsearch has a problem with indexing data (If you can help me get out of this elasticsearch problem that would be nice)
and now come out with an alternative to store it to postgres, still I want to use the real time pipeline but I saw that Snowplow doesn’t have a kinesis-postgres sink.

So all I gotta do I gotta sink the kinesis to S3 (using lzo) and use the storage loader to push data to postgres, am I right? But I have considered that; isn’t that a waste of resources? If we can eliminate S3 that would be nice, isn’t it?

So then besides all of that, is it bad to sink data from kinesis stream directly to postgres?

Thank You,


#2

Hey @ChocoPowwwa - sinking enriched events into Postgres from the Snowplow Kinesis pipeline isn’t officially supported, but you are welcome to give it a try. There’s a lot of interesting community experimentation with different storage targets for Snowplow right now.

Just a note that the enriched event format doesn’t exactly correspond to the format loaded into Redshift or Postgres via the batch pipeline - and the only component we have which can do the transformation is Scala Hadoop Shred; this is why we recommend a Lambda architecture for loading Redshift or Postgres currently: How to setup a Lambda architecture for Snowplow


#3

There’s several ways about creating real time data but as @alex has mentioned above there’s some caveats to doing so. It’s not a bad idea to try and build some real time applications on top of Elasticsearch if possible so attempting to solve the 429 issue may be the easiest path. How are you currently scaling Elasticsearch nodes?


#4

@alex thanks, i’ve been looking tough for that lambda architecture thread, but i still cannot understand the ucecase of lambda architecture for me, ( i guess i need time to learn lambda architecture ),

@mike i haven’t figure out how to scale properly, all i do now i’m using AWS Elasticsearch service, with 2 instance of m3.medium and 2 dedicated master node, and i run gatling ( which is slightly modified snowplow avalanche exponentialpeak.scala script with SimulationTime = 2 minute, BaselineUsers = 400, PeakUsers = 1500 ) to test how much the cluster can handle the request ( also the index have 25 primary shard and 1 number of replica)

but i haven’t run the test completely because i have network issue, so here is my gatling output ( https://drive.google.com/file/d/0Bx2_Ied-yt-_bEIyNldSZzFEUk0/view?usp=sharing ), but i’m really upset with the result, is elasticsearch have really slow at indexing?

so, how do you setup / scale your elasticsearch cluster?