Scala Kinesis Enrich

nakories · April 7, 2018, 7:08pm

I keep reading how you can drip-fed into Redshift.

"In 2014, Snowplow added an Amazon Kinesis stream to its service to capture and store data from client systems. The data is then drip-fed into Redshift for continuous real-time processing. " https://aws.amazon.com/solutions/case-studies/snowplow/

Yet I can’t find any documentation on this, at the moment I can only find elaticsearch so I am a it confused, can you feed direct into Redshift from the kenisis stream or not?

Alex

anton · April 9, 2018, 4:09am

Hello @nakories,

Unfortunately, we don’t support direct drip-feeding, but you might find interesting our Introduction to Lambda architecture and recent R102 Afontova Gora release that brings significant improvements into architecture described in first post.

Long-story short: you can setup a Real-time pipeline with Amazon Kinesis and Snowplow S3 Loader dumping enriched data to S3. Then you can setup EmrEtlRunner to only shred enriched data and load it straight to Redshift.

nakories · April 9, 2018, 10:38am

Thank you I now have the s3-loader running which is taking the stream from kenisis to a gzipped s3 file.

Just fighting with the EMReltRunner as it’s not playing ball. Well it’s running just not picking up any files to push to Redshift, I am not sure how it knows where to pick the files up from.

Do I need to shred before pushing to Redshift?

anton · April 9, 2018, 10:52am

Yes, you do. Shredding is a step dedicated for preparing enriched data (which can be considered as a canonical format) for loading into Redshift. As described in R102 release notes you need to add new enriched.stream bucket to your config.yml pointing to Kinesis output dir. EmrEtlRunner will stage this data for shredding.

nakories · April 9, 2018, 11:23am

Hi Anton,

I have a stream-collector pushing data to s3 :- this works
I have a stream-enricher pulling data from the above stream and pushing data to a new stream
I have the snowplow-s3-loader-0.6.0.jar Pulling data from the above stream and is gziping the data to an s3 bucket.

This is the step I am un clear on now.

I now need to run the snowplow-emr-etl-runner. I have a targets folder with the redshift database target.

So I need to shred the data,

I have added - stream: s3://pf-dol-my-out-bucket/enriched/good to my config as per the release notes

Should the etl runner pick up the zipped enriched files and then shred them. Storing them in the s3 shredded.good bucket which will then push up to redshift?

./snowplow-emr-etl-runner run -x staging,enrich,elasticsearch,archive_raw,analyze,archive_enriched,archive_shredded --config config.yml --resolver iglu_resolver.json --target targets

This is the etl command I am unning.

anton · April 9, 2018, 11:30am

Yes, this is all correct. One point however:

When you added enrich.stream bucket - you don’t need to explicitly skip enrich step anymore - in “Stream Enrich mode”, EmrEtlRunner simply “forgets” about Spark Enrich. I actually wouldn’t recommend skipping any of those steps unless you fully understand what they mean.

nakories · April 9, 2018, 12:27pm

Hi Anton,

Thank you, I have removed the -x arguments.

One more question is there a way of seeing what the errors are for the redshift storage option as at the moment it’s failing on that step but I am not sure why,

Alex

nakories · April 9, 2018, 1:06pm

An update, I can see the COPY query in redshift, so it must be connecting to it correctly however it’s now aborted.

anton · April 9, 2018, 1:22pm

EmrEtlRunner should fetch RDB Loader’s output and print it to stdout. If it didn’t fetch them then I doubt it really connected and probably there was a configuration error. You can also check out RDB Loader’s stdout in EMR console. Most likely this is something like non-existing Redshift table or JSONPaths mismatch.

nakories · April 9, 2018, 2:12pm

Hi Anton,

Thanks again for your help, it’s now failing on a step that it was able to work on before, reviewing what I might have done,

Thanks,

Alex

Topic		Replies	Views
Shredding to Redshift in the Scala Collector Flow AWS batch pipeline (Legacy)	2	1916	September 24, 2017
Storage load : Enrich to Redshift	8	1534	January 1, 2022
Shredding & loading enriched events in near-real-time Storage targets	13	4347	August 24, 2017
Enriched event stream into Redshift using Kinesis Firehose AWS real-time pipeline	7	5533	May 31, 2016
Is it possible to load data to Redshift after StreamEnricher? Storage targets	10	2611	September 12, 2018

Scala Kinesis Enrich

Related Topics