Storage to Data Modeling to Analytics

Hi guys!
Can anyone help me or us out.

Basically, I already setup the trackers > collectors > enricher > storage.
(Just not sure if I setup correctly)

Our current setup for collectors is Kinesis and S3 for storage.

The first problem I encounter is that the “Storage” doesn’t process anything (though I setup correctly and running but doesn’t process so What I did was processed it on AWS w/c is Firehose.)

Now, we have an s3 data
Not sure also if this is the correct format , but this is what some files looks like.

Your test site, web 2020-12-17 01:29:10.899 2020-12-17 01:29:03.855 2020-12-17 01:29:01.116 page_ping 270a93ae-b622-442e-8e05-ae544236537b sp js-2.16.3 ssc-2.1.0-kinesis stream-enrich-1.4.2-common-1.4.2 124.22.31.62 097a8a8e-3e67-4270-9391-3b115a2773db 8 0b5a7251-e0fb-46d2-8fe9-fbe9c654130d https://yoursite.test/product/glock-26/ Glock 26 For Sale, Reviews, Price - $587.82 - In Stock http://yoursite.test/ https yoursite.test 443 /product/glock-26/ http yoursite.test 80 / {“schema”:“iglu:com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-0”,“data”:[{“schema”:“iglu:com.google.analytics.measurement-protocol/user/jsonschema/1-0-0”,“data”:{“userId”:""}},{“schema”:“iglu:com.snowplowanalytics.snowplow/web_page/jsonschema/1-0-0”,“data”:{“id”:“b578da3a-109b-4843-8b2b-34854c2463bf”}}]} 0 0 4965 6265 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 en-US 1 0 0 0 0 0 0 0 0 1 24 1536 698 Asia/Shanghai 1366 768 UTF-8 1519 6962 2020-12-17 01:29:01.123 84ad3654-d152-43c8-9719-3690ac85e11e 2020-12-17 01:29:03.848 com.snowplowanalytics.snowplow page_ping jsonschema 1-0-0

How do we process this to data-modeling?

@Aron_Quiray, it does look like enriched data to me. The enriched data (record) should be in TSV conforming to this structure.

To load that data into Redshift (another possibility is Snowflake DB), you need to run a batch job that would consume the data from S3 bucket containing your enriched data utilising EmrEtlRunner in Stream Enrich mode.

@ihor, Thanks for the reply,
I rephrase my question as I overlooked into 1 detail, the one’s not firing was the “Storage” jar file where the alternative route we did was the firehose (to store data to s3 storage)

now the data provided was from s3.
what steps we need to do inorder to save this one to a db or something so we can start modeling it?

sorry for my questions, really new here at snowplow.

@Aron_Quiray, I’m not sure I still follow you. Here’s the typical architecture you would build to get your data into data store other than S3. In your case, you enrich data in real time and you would follow the 2nd picture.

The post is a bit outdated but the same idea is still relevant. I’m not sure if Firehose fits here as files are expected to be in a certain format. The S3 Loader does the job for you - prepares the files for batch processing with EmrEtlRunner (to load the data to Redshift) unless you want to analyze data by some other means (in S3), for example with Athena.

Thanks for the help again @ihor,
Now I slowly understand it. At our current state right now, we’re now on the “Kinesis S3” I believe

If I am right, what are the next steps to make it more readable (we’re planning to put it into snowflake db) how do we import it there?

@Aron_Quiray, you would use Snowplow Snowflake Loader for that purpose.

You should be aware, however, that the enriched data (files) have to be placed in folders like “run=2020-12-01-16-30-50” (run=YYYY-MM-DD-hh-mm-ss) for the Loader to process them.

@ihor
can you give me example of the config.yml of your storage (just leave out credentials and change buckets) ? I can’t seem to make it work :frowning: Please…

@Aron_Quiray, did you mean to say config.json? The config.yml is used with EmrEtlRunner but you would use JSON as a configuration file for Snowflake Loader. You can find an example of the configuration file here.

Your (erroneous) example, however, indicate you did try to run EmrEtlRunner. This contradicts to your statement “we’re planning to put it into snowflake db”. You do not need to run EmrEtlRunner to load the enriched data into Snowflake DB. You would run EmrEtlRunner to shred the data which is required to load the data to Redshift as shown in this doc. Though that workflow is for the older version of RDB Loader. The latest releases product TSV shredded files (as opposed to JSON) as explained here.

Are you trying to load to Redshift first before switching to Snowflake DB?

sorry @ihor, all good now.