Scala Kinesis Enrich


#1

I keep reading how you can drip-fed into Redshift.

"In 2014, Snowplow added an Amazon Kinesis stream to its service to capture and store data from client systems. The data is then drip-fed into Redshift for continuous real-time processing. " https://aws.amazon.com/solutions/case-studies/snowplow/

Yet I can’t find any documentation on this, at the moment I can only find elaticsearch so I am a it confused, can you feed direct into Redshift from the kenisis stream or not?

Alex


EMrEtlRunner does it require EMR
#2

Hello @nakories,

Unfortunately, we don’t support direct drip-feeding, but you might find interesting our Introduction to Lambda architecture and recent R102 Afontova Gora release that brings significant improvements into architecture described in first post.

Long-story short: you can setup a Real-time pipeline with Amazon Kinesis and Snowplow S3 Loader dumping enriched data to S3. Then you can setup EmrEtlRunner to only shred enriched data and load it straight to Redshift.


#3

Thank you I now have the s3-loader running which is taking the stream from kenisis to a gzipped s3 file.

Just fighting with the EMReltRunner as it’s not playing ball. Well it’s running just not picking up any files to push to Redshift, I am not sure how it knows where to pick the files up from.

Do I need to shred before pushing to Redshift?


#4

Yes, you do. Shredding is a step dedicated for preparing enriched data (which can be considered as a canonical format) for loading into Redshift. As described in R102 release notes you need to add new enriched.stream bucket to your config.yml pointing to Kinesis output dir. EmrEtlRunner will stage this data for shredding.


#5

Hi Anton,

I have a stream-collector pushing data to s3 :- this works
I have a stream-enricher pulling data from the above stream and pushing data to a new stream
I have the snowplow-s3-loader-0.6.0.jar Pulling data from the above stream and is gziping the data to an s3 bucket.

This is the step I am un clear on now.

I now need to run the snowplow-emr-etl-runner. I have a targets folder with the redshift database target.

So I need to shred the data,

I have added - stream: s3://pf-dol-my-out-bucket/enriched/good to my config as per the release notes

Should the etl runner pick up the zipped enriched files and then shred them. Storing them in the s3 shredded.good bucket which will then push up to redshift?

./snowplow-emr-etl-runner run -x staging,enrich,elasticsearch,archive_raw,analyze,archive_enriched,archive_shredded --config config.yml --resolver iglu_resolver.json --target targets

This is the etl command I am unning.


#6

Yes, this is all correct. One point however:

When you added enrich.stream bucket - you don’t need to explicitly skip enrich step anymore - in “Stream Enrich mode”, EmrEtlRunner simply “forgets” about Spark Enrich. I actually wouldn’t recommend skipping any of those steps unless you fully understand what they mean.


#7

Hi Anton,

Thank you, I have removed the -x arguments.

One more question is there a way of seeing what the errors are for the redshift storage option as at the moment it’s failing on that step but I am not sure why,

Alex


#8

An update, I can see the COPY query in redshift, so it must be connecting to it correctly however it’s now aborted.


#9

EmrEtlRunner should fetch RDB Loader’s output and print it to stdout. If it didn’t fetch them then I doubt it really connected and probably there was a configuration error. You can also check out RDB Loader’s stdout in EMR console. Most likely this is something like non-existing Redshift table or JSONPaths mismatch.


#10

Hi Anton,

Thanks again for your help, it’s now failing on a step that it was able to work on before, reviewing what I might have done,

Thanks,

Alex