Loading data from s3 to Redshift after EmrEtlRunner

oruchovets · November 15, 2018, 6:41am

Hello. It is my first try with a snowplow.
I spent some time to get EmrEtlRunner working. I do see the results in the s3 bucker <bucket_name>/enriched/ archive (based on this post)

I am using the command for EmrEtlRunner and skip shred step (otherwise it keep on crashing)
./snowplow-emr-etl-runner -d --config config/config.yml --resolver iglu_resolver.json --skip shred

Question:
Can I load data from s3 enriched/archive to Redshift and not from shredding
It is not so clear what command/flag should I use to copy data from s3 to redshift (it is not clear from the documentation)
Configuring storage targets · snowplow/snowplow Wiki · GitHub

Thanks
Oleg.

mike · November 15, 2018, 8:03am

You can load archived data from a certain bucket but it needs to be shredded.

RDB loader is responsible for getting the shredded data from S3 to Redshift. It’s the rdb_load step of EMR ETL runner.

oruchovets · November 15, 2018, 8:32am

Ok.

in the previous post,

I had an issue to run EmrEtlRunner. Emr jobs are failed with

java.io.FileNotFoundException: File does not exist: hdfs:/local/snowplow/shredded-events
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517)

Shredding is failing with File does not exist: hdfs:/local/snowplow/shredded-event

    The Spark job step that is failing is the copy (using Amazon's S3DistCp utility) of shredded JSONs from your EMR cluster's HDFS file system back to Amazon S3, ready for loading into Redshift. Due to an unfortunate attribute of S3DistCp, it will fail if no files were output for shredding. Possible reasons for this:

You are not generating any custom contexts, nor unstructured events and have not enabled link click tracking. Solution: run EmrEtlRunner with --skip shred. Remove this --skip as/when you know that you do have JSONs to shred.

I used --skip shred but it looks I need to use shredding.

Here is my configuration:

;(function(p,l,o,w,i,n,g){if(!p[i]){p.GlobalSnowplowNamespace=p.GlobalSnowplowNamespace||[];
p.GlobalSnowplowNamespace.push(i);p[i]=function(){(p[i].q=p[i].q||[]).push(arguments)
};p[i].q=p[i].q||[];n=l.createElement(o);g=l.getElementsByTagName(o)[0];n.async=1;
n.src=w;g.parentNode.insertBefore(n,g)}}(window,document,"script","//d1fc8wv8zag5ca.cloudfront.net/2.9.2/sp.js","snowplow"));

window.snowplow('newTracker', 'cf', 'd2z6pco7ayls3z.cloudfront.net', { // Initialise a tracker - point to cloudfront that serves S3 bucket w/ pixel
  appId: 'web',
  cookieDomain: null,
  gaCookies: true
});

window.snowplow(‘enableActivityTracking’, 30, 10);
window.snowplow(‘enableLinkClickTracking’, null, true, true);
window.snowplow(‘trackPageView’);

What is the way to troubleshoot why there are no files for shredding?

Thanks
Oleg.

mike · November 15, 2018, 9:48pm

Have a look at your enriched data in S3. Are there contexts being attached in either the contexts or derived_contexts column? If you have gaCookies enabled in the tracker (and there is a GA cookie) you should have some, or certain enrichments (like user-agent-utils) will also generate their own context.

oruchovets · November 16, 2018, 5:28am

Hi @mike

here are the steps for my EmrEtlRunner
s-1VKPE4COF2TF9 Elasticity S3DistCp Step: Shredded S3 -> S3 Shredded Archive Completed 2018-11-16 00:11 (UTC+8) 1 minute
s-3CJFXC5M2XACH Elasticity S3DistCp Step: Enriched S3 -> S3 Enriched Archive Completed 2018-11-16 00:09 (UTC+8) 1 minute
s-YJA1ITPODNOE Elasticity S3DistCp Step: Raw S3 Staging -> S3 Archive Completed 2018-11-16 00:08 (UTC+8) 1 minute
s-15NIXQ4MPBEUG Elasticity S3DistCp Step: Shredded HDFS -> S3 Completed 2018-11-16 00:06 (UTC+8) 1 minute
s-NEUGVMSAKCR0 Elasticity Spark Step: Shred Enriched Events Completed 2018-11-15 23:37 (UTC+8) 28 minutes
s-J3SUND1NO7UV Elasticity S3DistCp Step: Enriched HDFS _SUCCESS -> S3 Completed 2018-11-15 23:36 (UTC+8) 1 minute
s-8HOWTZOUWYYD Elasticity S3DistCp Step: Enriched HDFS -> S3 Completed 2018-11-15 23:34 (UTC+8) 1 minute
s-H07FYDSTOAJO Elasticity Spark Step: Enrich Raw Events Completed 2018-11-15 23:09 (UTC+8) 25 minutes
s-13XSMKSLM5ZCY Elasticity S3DistCp Step: Raw S3 -> HDFS Completed 2018-11-15 23:07 (UTC+8) 1 minute
s-3S0BIC2C1II4D Elasticity Setup Hadoop Debugging Completed 2018-11-15 23:07 (UTC+8) 14 seconds

Question:
1) how to add rdb_load step. I created json configuration for Redshift and configured the RedShift on AWS. Can you please share the command line example of EMR ETL with rdb_load steps

Here is the command I am using but there is no rdb_load step.
2) There are 2 steps it took more than 20 minutes to execute and data volume was 5Kb :-). What is the reason for such a long time execution?

Thanks
Oleg.

ihor · November 16, 2018, 6:57pm

@oruchovets, what version of Snowplow pipeline are you running? There are differences between the versions. If in older versions you had to specify the target within the EmrEtlRunner configuration file itself in the later version the target configuration file is a separate entity and it has to be in the correct format according to the version used. That also means referencing the targets file in your CLI command with --targets option. Omitting the targets would indicate skipping the data load step: https://github.com/snowplow/snowplow/wiki/2-Using-EmrEtlRunner.

The enriched and shredded steps are expected to be the longest. You also need to bear in mind that EMR cluster processing is designed for large volumes of data and you might encounter a very small volume processed as long as the large volume. Having said that, ~25 minutes for 5KB still sounds too long. You might have misconfigured your cluster for that amount of data.

oruchovets · November 19, 2018, 5:00pm

Thanks, Ihor. it was helpful, using --target is working for me to populate data to Redshift. The only thing, I spent a couple of hours rerunning emr and keep on failing on missing table exception. I had to create a bunch of tables.
The list is here:
https://github.com/snowplow/iglu-central/tree/a65bd9574c3bb34f1699afda5a22c3e717df3f78/sql/com.snowplowanalytics.monitoring.batch

also, a problem with log running emr is solved. I choose a stronger machine for the cluster and not it takes instead of ~30 min around one minute.

Questions
how do I know what tables should I create? I understand it depends on the tracking purposes but still - is there documentation which tables should be used for which activity?

another question - to how to track properly enableLinkClickTracking (should I create a new post for this question?)

;(function(p,l,o,w,i,n,g){if(!p[i]){p.GlobalSnowplowNamespace=p.GlobalSnowplowNamespace||;
p.GlobalSnowplowNamespace.push(i);p[i]=function(){(p[i].q=p[i].q||).push(arguments)
};p[i].q=p[i].q||;n=l.createElement(o);g=l.getElementsByTagName(o)[0];n.async=1;
n.src=w;g.parentNode.insertBefore(n,g)}}(window,document,“script”,"//**********.cloudfront.net/2.9.2/sp.js",“snowplow”));

window.snowplow(‘newTracker’, ‘cf’, ‘*********.cloudfront.net’, { // Initialise a tracker - point to cloudfront that serves S3 bucket w/ pixel
appId: ‘web’,
cookieDomain: null,
gaCookies: true
});
window.snowplow(‘enableActivityTracking’, 30, 10);
window.snowplow(‘enableLinkClickTracking’, null, true, true);
window.snowplow(‘trackPageView’);

Redshift, all entries are page_ping. What is the way to configure the above tracker to track page view and link clicks?

Thanks
Oleg.

ihor · November 19, 2018, 5:58pm

@oruchovets,

how do I know what tables should I create? I understand it depends on the tracking purposes but still - is there documentation which tables should be used for which activity?

You are right. It is not well documented in the wiki. Although, the following Discourse post should help you out to some extend: Redshift tables mapping. You are welcome to contribute to this list when you find a missing table.

how to track properly enableLinkClickTracking

You should be able to find a few posts in this forum you could follow. The corresponding wiki page is here: 2 Specific event tracking with the Javascript tracker · snowplow/snowplow Wiki · GitHub.

Redshift, all entries are page_ping . What is the way to configure the above tracker to track page view and link clicks?

Could you swap link tracking with pageviews and see it that helps

window.snowplow('enableActivityTracking', 30, 10);
window.snowplow('trackPageView');
window.snowplow('enableLinkClickTracking', null, true, true);

I also noticed non-ASCII quote (‘ vs '). Could be just a copy/paste issue but might be the reason the events did not fire.

If you still have a problem do create a new post.

Topic		Replies	Views
Should I run rdb_load only? For engineers	7	989	February 11, 2020
EmrEtlRunner not loading data into RedShift For engineers	22	1865	November 11, 2019
RDB Loader, Storage Loader, EmrEtlRunner Storage targets	14	2052	October 22, 2019
EmrEtlRunner sink Shredded data into S3 bucket For engineers	0	589	November 11, 2019
Loading from S3's enriched events to PostgreSQL	4	1254	May 20, 2019

Loading data from s3 to Redshift after EmrEtlRunner

Shredding is failing with File does not exist: hdfs:/local/snowplow/shredded-event

Related Topics