Enriched file in enriched/good schema?


#1

Hi Snowplow,
I have a question regarding the EMR-ETL-RUNNER. I successfully got the raw data files from the tracker enriched and parsed. I noticed in the raw files, it shows a guide describing which each field is like-

2 #Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-quer y cs(Cookie) x-edge-result-type x-edge-request-id x-host-header cs-protocol cs-bytes time-taken x-forwarded-for ssl-protocol ssl -cipher x-edge-response-result-type cs-protocol-version fle-status fle-encrypted-fields

In the enriched version, there is no description of each field. I am using the r88 version, and the generic iglu_reslover.json file. I am trying to decipher which each line of code is, and have looked at http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#and iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-1 schema.

Some help would be excellent on this.

As always thanks,
Morris


#2

@morris206, https://github.com/snowplow/snowplow/blob/master/3-enrich/scala-common-enrich/src/main/scala/com.snowplowanalytics.snowplow.enrich/common/outputs/EnrichedEvent.scala#L41-L249. Is that what you are looking for?


#3

It looks close but not exact, any other ones that may be it? I’m guessing some of the fields may be left out, or it’s not going to be an exact match for a couple of reasons right?


#4

Indeed, quite a few fields are expected to be left out. I assume your enriched data is in TSV format. A missed value would be indicated by no value between adjacent tabs.


#5

It is a TSV. Thank you so much for the help.

Thanks,
Morris


#6

@morris206 You might find the DDL in this post helpful as well:


#7

Another question, How do I properly set the target. I want to have my target as a separate s3 bucket from which I plan on getting the data into my data warehouse. For example here is my script for running emr-etl-runner ./r88-emr-etl-runner --c configr88.yml --r iglu_resolver.json (I’m trying to do something like) --t s://target-snowplow. Do I need to reference it to a file, because obviously explicitly setting the target this way is not going to work. Is it one of the json schemas in 4-storage? I guess I’m kinda wondering if I make my own json file to do this?

Thank you,
Morris


#8

@morris206, if you do not intend to load the data into any other target (Redshift, Snowflake, Postgres, Elasticsearch), then you do not need to use -t option at all. The last step in ETL is archiving data to S3. You might need to skip data load step though. Also, do you need files/events in S3 raw, enriched or/and shredded?

Here’s the dataflow diagram explaining how ETL works (for R88 you are using): https://github.com/snowplow/snowplow/wiki/Batch-pipeline-steps-r87


#9

Without the --t step, I cannot run the runner more than once. Because there are files in the enriched and shredded folders etc. It will not let me run it again. This really my only issue. If i can bypass this, and get it to run again I can take it from there.


#10

@morris206, you can use --skip (-x) option to skip loading step: https://github.com/snowplow/snowplow/wiki/2-Using-EmrEtlRunner#21-run-command. Also, you can skip shredding step if you need only enriched events in S3 (thus speeding up ETL). Do note that you are using a very old version too.


#11

Thanks for the help. The r88 version is working fine, is there a reason why I should upgrade? -x {staging,enrich,shred,elasticsearch,archive_raw,rdb_load,consistency_check,analyze,load_manifest_check,archive_enriched,archive_shredded,staging_stream_enrich}, --skip skip the specified step(s) I don’t see an option for target unless it’s named something else.


#12

What I really want is to link it back into a S3 bucket, not skip it. Do you know of a way to do this? The reason why is because I want to keep in it as batches rather than real-time. This is the way that my company and I decided we wanted to do it. Ideally I would keep it in the AWS pipeline this way, and from the target bucket write a custom script to get into BigQuery. Thank you.
Morris


#13

@morris206, you are missing the point here. You do not need targets at all if you do not load the data (step load for r88 and rdb_load for later versions) into Redshift (or other data stores as per my comments earlier). Having the data in S3 is achieved by archiving it (see step 12 of the diagram archive_enriched). In any case, if not archived you cannot run EmrEtlRunner.


#14

Okay, well I don’t understand, isn’t it automatically archived set in the config.yml file? If not, how can I get it to automatically archive


#15

I upgraded to the r90 version because it gives the option of skipping rdb_load, but it’s still not letting me run. Could you be a bit more explicit with what you’re trying to say?
What part I am I missing where the data doesn’t automatically get archived?


#16

@morris206, you cannot run EmrEtlRunner if the files have not been archived. You need to follow the dataflow diagram to recover the pipeline correctly (see instructions under the diagram - for R90 the diagram is here). Once you have recovered the pipeline you need to use the appropriate skip options to keep the batch pipeline running.

What is your current pipeline architecture and what are you trying to achieve?


#17

Right now, I have the runner working properly, I just upgraded to r90 to be able to have the --skip rdb_load option. I want to from here, be able to schedule the runner for every 2 hrs to run because we want the final product to be a batch process not a real-time process. So, my only concern from this point is to be able to get it to run. Obviously the data is not being archived properly because I cannot run the emr-etl-runner more than once. This is my main concern. I am eventually going to write a script to get the data into our Google environment. That’s all. So once again, I just want to be able to run the runner more than once. I get that the data needs to be archived. How can I achieve this? Here is my config file-

aws: 
2   # Credentials can be hardcoded or set in environment variables
  3   access_key_id: <%= ENV['AWS_ACCESS_KEY'] %>
  4   secret_access_key: <%= ENV['AWS_SECRET_KEY'] %>
  5   s3:
  6     region: us-west-2
  7     buckets:
  8       assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
  9       jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
 10       log: s3://xxx/logs
 12         in:                  # This is a YAML array of one or more in buckets - you MUST use hyphens before each entry in the array, as below
 13           - s3://xxx        # e.g. s3://my-new-collector-bucket
 14         processing: s3://xxx/processing_data
 15         archive: s3://xxx/archive_data    # e.g. s3://my-archive-bucket/raw
 16       enriched:
 17         good: s3://xxx/enriched/good       # e.g. s3://my-out-bucket/enriched/good
 18         bad: s3://xxx/enriched/bad        # e.g. s3://my-out-bucket/enriched/bad
 19         errors: s3://xxx/enriched/errors     # Leave blank unless :continue_on_unexpected_error: set to true below
 20         archive: s3://xxx/enriched/archive    # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
 21       shredded:
 22         good: s3://xxx/shredded/good       # e.g. s3://my-out-bucket/shredded/good
 23         bad: s3://xxx/shredded/bad        # e.g. s3://my-out-bucket/shredded/bad
 24         errors: s3://xxx/shredded/errors     # Leave blank unless :continue_on_unexpected_error: set to true below
 25         archive: s3://sxxx/shredded/archive    # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
 26   emr:
 27     ami_version: 5.5.0
 28     region: us-west-2        # Always set this
 29     jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
 30     service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles
 31     placement:      # Set this if not running in VPC. Leave blank otherwise
 32     ec2_subnet_id: subnet-xxx  # Set this if running in VPC. Leave blank otherwise
 33     ec2_key_name: xxx_track
 34     bootstrap: []           # Set this to specify custom boostrap actions. Leave empty otherwise
 35     software:
 36       hbase:                # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
 37       lingual:              # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
 38     # Adjust your Hadoop cluster below
 39     jobflow:
 40       job_name: Snowplow ETL # Give your job a name
 41       master_instance_type: m1.medium
 42       core_instance_count: 2
 43       core_instance_type: m1.medium
 44       core_instance_ebs:    # Optional. Attach an EBS volume to each core instance.
 45         volume_size: 100    # Gigabytes
 46         volume_type: "gp2"
 47         volume_iops: 400    # Optional. Will only be used if volume_type is "io1"
 48         ebs_optimized: false # Optional. Will default to true
 49       task_instance_count: 0 # Increase to use spot instances
 50       task_instance_type: m1.medium
 51       task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
 52     bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
 53     configuration:
 54       yarn-site:
 55         yarn.resourcemanager.am.max-attempts: "1"
 56       spark:
 57         maximizeResourceAllocation: "true"
 58     additional_info:        # Optional JSON string for selecting additional features
 59 collectors:
 60   format: cloudfront # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
 61 enrich:
 62   versions:
 63     spark_enrich: 1.9.0 # Version of the Spark Enrichment process
 64   continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
 65   output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
 66 storage:
 67   versions:
 68     rdb_loader: 0.12.0
 69     rdb_shredder: 0.12.0        # Version of the Spark Shredding process
 70     hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
 71 monitoring:
 72   tags: {} # Name-value pairs describing this job
 73   logging:
 74     level: DEBUG # You can optionally switch to INFO for production
 75   snowplow:
 76     method: get
 77     app_id: atwork  # e.g. snowplow
 78     collector: xxx.cloudfront.net
~                                                       ```

#18

@morris206, have followed the instructions on the dataflow wiki? Have you recovered your pipeline?

As long as any of the processing, enriched/good or shredded/good buckets is not empty, you cannot run EmrEtlRunner. Depending on where you have files present use the appropriate recovery step (or even archive/move the files manually). Once empty, you need to ensure your CLI command is correct for your scenario. If you do not intend to use Redshift, the skip options for your EmrElRunner would be --skip shred,rdb_load,archive_shredded (after you have recovered the pipeline from the current state). Do, please, review the wiki I pointed out to to understand the steps to skip.