Events after enrichment ending in bad bucket


#1

Below is the command used to run the enrichment step

./snowplow-emr-etl-runner run --config snowplow/4-storage/config/emretlrunner.yml --resolver snowplow/3-enrich/config/iglu_resolver.json --enrichments snowplow/3-enrich/config/enrichments/

Below is my config.yml file

aws:
  # Credentials can be hardcoded or set in environment variables
  access_key_id: "xxxxxx"
  secret_access_key: "xxxxxxx"
  s3:
	region: us-east-1
	buckets:
	  assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
	  jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
	  log: s3://dataobjecteventsstorage/logs
	  raw:
		in:                  # This is a YAML array of one or more in buckets - you MUST use hyphens before each entry in the array, as below
		  - s3://dataobjecteventsstorage/      # e.g. s3://my-old-collector-bucket
		processing: s3://dataobjecteventsstorage/raw/processing
		archive: s3://dataobjecteventsstorage/raw/archive   # e.g. s3://my-archive-bucket/raw
	  enriched:
		good: s3://dataobjecteventsstorage/enriched/good        # e.g. s3://my-out-bucket/enriched/good
		bad: s3://dataobjecteventsstorage/enriched/bad       # e.g. s3://my-out-bucket/enriched/bad
		errors: s3://dataobjecteventsstorage/enriched/errors     # Leave blank unless :continue_on_unexpected_error: set to true below
		archive: s3://dataobjecteventsstorage/enriched/archive    # Where to archive enriched events to, e.g. s3://my-archive-bucket/enriched
	  shredded:
		good: s3://dataobjecteventsstorage/shredded/good        # e.g. s3://my-out-bucket/shredded/good
		bad: s3://dataobjecteventsstorage/shredded/bad        # e.g. s3://my-out-bucket/shredded/bad
		errors: s3://dataobjecteventsstorage/shredded/errors     # Leave blank unless :continue_on_unexpected_error: set to true below
		archive: s3://dataobjecteventsstorage/shredded/archive     # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
  emr:
	ami_version: 5.9.0
	region: us-east-1       # Always set this
	jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
	service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles
	placement: us-east-1c      # Set this if not running in VPC. Leave blank otherwise
	ec2_subnet_id:  # Set this if running in VPC. Leave blank otherwise
	ec2_key_name: Snowplowkeypair
	bootstrap: []           # Set this to specify custom boostrap actions. Leave empty otherwise
	software:
	  hbase:              # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
	  lingual:              # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.
	# Adjust your Hadoop cluster below
	jobflow:
	  job_name: Snowplow Unilog # Give your job a name
	  master_instance_type: m1.medium
	  core_instance_count: 2
	  core_instance_type: m1.medium
	  core_instance_ebs:    # Optional. Attach an EBS volume to each core instance.
		volume_size: 100    # Gigabytes
		volume_type: "gp2"
		volume_iops: 400    # Optional. Will only be used if volume_type is "io1"
		ebs_optimized: false # Optional. Will default to true
	  task_instance_count: 0 # Increase to use spot instances
	  task_instance_type: m2.medium
	  task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
	bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
	configuration:
	  yarn-site:
		yarn.resourcemanager.am.max-attempts: "1"
	  spark:
		maximizeResourceAllocation: "true"
	additional_info:        # Optional JSON string for selecting additional features
collectors:
  format: thrift # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
enrich:
  versions:
	spark_enrich: 1.10.0 # Version of the Spark Enrichment process
  continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
  output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
  versions:
	rdb_loader: 0.14.0
	rdb_shredder: 0.13.0        # Version of the Spark Shredding process
	hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
monitoring:
  tags: {} # Name-value pairs describing this job
  logging:
	level: DEBUG # You can optionally switch to INFO for production
  #snowplow:
	#method: get
	#app_id: unilog # e.g. snowplow
	#collector: 172.31.38.39:8082 # e.g. d3rkrsqld9gmqf.cloudfront.net 

below is the message i got in the enrich bad bucket folder

{"line":"CwBkAAAADDE3Mi4zMS4zOC4zOQoAyAAAAWAHUMapCwDSAAAABVVURi04CwDcAAAAEXNzYy0wLjkuMC1raW5lc2lzCwEsAAAAck1vemlsbGEvNS4wIChXaW5kb3dzIE5UIDEwLjA7IFdpbjY0OyB4NjQpIEFwcGxlV2ViS2l0LzUzNy4zNiAoS0hUTUwsIGxpa2UgR2Vja28pIENocm9tZS82Mi4wLjMyMDIuOTQgU2FmYXJpLzUzNy4zNgsBQAAAABgvc25vd3Bsb3duZXcvc2FtcGxlLmh0bWwLAUoAAAAADwFeCwAAAAgAAAAUSG9zdDogbG9jYWxob3N0OjgwODAAAAAWQ29ubmVjdGlvbjoga2VlcC1hbGl2ZQAAAH5Vc2VyLUFnZW50OiBNb3ppbGxhLzUuMCAoV2luZG93cyBOVCAxMC4wOyBXaW42NDsgeDY0KSBBcHBsZVdlYktpdC81MzcuMzYgKEtIVE1MLCBsaWtlIEdlY2tvKSBDaHJvbWUvNjIuMC4zMjAyLjk0IFNhZmFyaS81MzcuMzYAAAAcVXBncmFkZS1JbnNlY3VyZS1SZXF1ZXN0czogMQAAAGJBY2NlcHQ6IHRleHQvaHRtbCwgYXBwbGljYXRpb24veGh0bWwreG1sLCBhcHBsaWNhdGlvbi94bWw7cT0wLjksIGltYWdlL3dlYnAsIGltYWdlL2FwbmcsICovKjtxPTAuOAAAACJBY2NlcHQtRW5jb2Rpbmc6IGd6aXAsIGRlZmxhdGUsIGJyAAAAIEFjY2VwdC1MYW5ndWFnZTogZW4tVVMsIGVuO3E9MC45AAABNUNvb2tpZTogcnhWaXNpdG9yPTE0OTU4NjQ4Mjk2NzhJMDgzSDFNTTNVVklQUVJFSVFTTkRHNVYxRlM2MzQ0VjsgbG9naW5NZXNzYWdlPWxvZ291dDsgRjQ3RjRBMEYzMEZCN0E3NT1zYW5kZXNoLnBAdW5pbG9nY29ycC5jb207IDc1MEIyRTAzMzNBMjhDMUQ9dGVzdDEyMzQ7IEYzMEZCMzNBMj10cnVlOyBhZnRlckxvZ2luVXJsPTsgX3NwX2lkLjFmZmY9YTMzMTA5MDMtMTA5NC00Y2I4LWExNzktZTIwOWNiNDE5OGY5LjE1MDAzODY2OTIuMjYuMTUxMTg3Nzk5Ni4xNTExODYxOTYxLjFhZWM0ZTVlLTYzYjYtNGIwYy04MWNiLTYzODIzZTIyMDBlZgsBkAAAAAlsb2NhbGhvc3QLAZoAAAAkOWJjZjk5ZDYtMzliMy00NDZjLTg5NDUtZTVjNWZkYTZhM2ExC3ppAAAAQWlnbHU6Y29tLnNub3dwbG93YW5hbHl0aWNzLnNub3dwbG93L0NvbGxlY3RvclBheWxvYWQvdGhyaWZ0LzEtMC0wAA==","errors":[{"level":"error","message":"Payload with vendor snowplownew and version sample.html not supported by this version of Scala Common Enrich"}],"failure_tstamp":"2017-11-29T11:09:19.729Z"}
{"line":"CwBkAAAADDE3Mi4zMS4zOC4zOQoAyAAAAWAHUOsYCwDSAAAABVVURi04CwDcAAAAEXNzYy0wLjkuMC1raW5lc2lzCwEsAAAAck1vemlsbGEvNS4wIChXaW5kb3dzIE5UIDEwLjA7IFdpbjY0OyB4NjQpIEFwcGxlV2ViS2l0LzUzNy4zNiAoS0hUTUwsIGxpa2UgR2Vja28pIENocm9tZS82Mi4wLjMyMDIuOTQgU2FmYXJpLzUzNy4zNgsBQAAAABgvc25vd3Bsb3duZXcvc2FtcGxlLmh0bWwLAUoAAAAADwFeCwAAAAkAAAAUSG9zdDogbG9jYWxob3N0OjgwODAAAAAWQ29ubmVjdGlvbjoga2VlcC1hbGl2ZQAAABhDYWNoZS1Db250cm9sOiBtYXgtYWdlPTAAAAB+VXNlci1BZ2VudDogTW96aWxsYS81LjAgKFdpbmRvd3MgTlQgMTAuMDsgV2luNjQ7IHg2NCkgQXBwbGVXZWJLaXQvNTM3LjM2IChLSFRNTCwgbGlrZSBHZWNrbykgQ2hyb21lLzYyLjAuMzIwMi45NCBTYWZhcmkvNTM3LjM2AAAAHFVwZ3JhZGUtSW5zZWN1cmUtUmVxdWVzdHM6IDEAAABiQWNjZXB0OiB0ZXh0L2h0bWwsIGFwcGxpY2F0aW9uL3hodG1sK3htbCwgYXBwbGljYXRpb24veG1sO3E9MC45LCBpbWFnZS93ZWJwLCBpbWFnZS9hcG5nLCAqLyo7cT0wLjgAAAAiQWNjZXB0LUVuY29kaW5nOiBnemlwLCBkZWZsYXRlLCBicgAAACBBY2NlcHQtTGFuZ3VhZ2U6IGVuLVVTLCBlbjtxPTAuOQAAATVDb29raWU6IHJ4VmlzaXRvcj0xNDk1ODY0ODI5Njc4STA4M0gxTU0zVVZJUFFSRUlRU05ERzVWMUZTNjM0NFY7IGxvZ2luTWVzc2FnZT1sb2dvdXQ7IEY0N0Y0QTBGMzBGQjdBNzU9c2FuZGVzaC5wQHVuaWxvZ2NvcnAuY29tOyA3NTBCMkUwMzMzQTI4QzFEPXRlc3QxMjM0OyBGMzBGQjMzQTI9dHJ1ZTsgYWZ0ZXJMb2dpblVybD07IF9zcF9pZC4xZmZmPWEzMzEwOTAzLTEwOTQtNGNiOC1hMTc5LWUyMDljYjQxOThmOS4xNTAwMzg2NjkyLjI2LjE1MTE4Nzc5OTYuMTUxMTg2MTk2MS4xYWVjNGU1ZS02M2I2LTRiMGMtODFjYi02MzgyM2UyMjAwZWYLAZAAAAAJbG9jYWxob3N0CwGaAAAAJGVmMWYxZmM4LTcwN2UtNDE3MS05YTMyLTMwYTY2MDY2MDQ2NQt6aQAAAEFpZ2x1OmNvbS5zbm93cGxvd2FuYWx5dGljcy5zbm93cGxvdy9Db2xsZWN0b3JQYXlsb2FkL3RocmlmdC8xLTAtMAA=","errors":[{"level":"error","message":"Payload with vendor snowplownew and version sample.html not supported by this version of Scala Common Enrich"}],"failure_tstamp":"2017-11-29T11:09:24.308Z"}

Please suggest me changes need to do…


#2

Hi,

You send events to /snowplownew/sample.html endpoint. Sed them to /i or /tp2 (for POST). As far as I see in records, you use direct hooks instead of standard tracker. Please fit in protocol and everything should work.


#3

Hey @grzegorzewald thanks for the reply.

Below is my sample.html page with the snowplow tracker script integrated inside the html page.

 <!DOCTYPE html>
<!-- Template by html.am -->
<html>
<head>
		<!-- Snowplow starts plowing -->
	<script type="text/javascript">
	 ;(function(p,l,o,w,i,n,g){if(!p[i]){p.GlobalSnowplowNamespace=p.GlobalSnowplowNamespace||[];
	p.GlobalSnowplowNamespace.push(i);p[i]=function(){(p[i].q=p[i].q||[]).push(arguments)
	};p[i].q=p[i].q||[];n=l.createElement(o);g=l.getElementsByTagName(o)[0];n.async=1;
	n.src=w;g.parentNode.insertBefore(n,g)}}(window,document,"script","http://d1fc8wv8zag5ca.cloudfront.net/2.8.0/sp.js","snowplow_tracker"));

	//alert('ok');
	
	  // callbacks
	  snowplow_tracker(function () {
		console.log("sp.js has loaded");
	  });
	  snowplow_tracker(function (x) {
		console.log(x);
	  }, "sp.js has loaded");

		
		// Configure a tracker instance named "cf"
	  snowplow_tracker('newTracker', 'unilogsnowplow', 'localhost:8080', {
		  appId: '21',
		  platform: 'web',
		  encodeBase64: false, // Default is true
		  cookieDomain: 'com.unilog.analytics1'
	  });

	  // Access the tracker instance inside a callback
	  snowplow_tracker(function () {
		  var cf = this.cf;
		  var userFingerprint = cf.getUserFingerprint();
		   var domainUserId = cf.getDomainUserId();
		   var domainUserInfo = cf.getDomainUserInfo();
		   var userId = cf.getUserId();
		   var cookieName = cf.getCookieName('id'); // cookie: 'id' for the domain cookie, 'ses' for the session cookie.
		   var pageViewId = cf.getPageViewId();
		  console.debug(userId);
		  console.debug(domainUserInfo);
		  console.debug(userFingerprint);
		  console.debug(domainUserId);
		 // console.log(userId);
		 // console.log(domainUserInfo);
		 // console.log(userFingerprint);
		 // console.log(domainUserId);
	  })

	   snowplow_tracker('enableActivityTracking', 30, 10);
	  snowplow_tracker('enableLinkClickTracking');
	  snowplow_tracker('trackPageView');
	</script>
	<!-- Snowplow stops plowing -->
	<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
	<title>Fixed Width 2 Blue</title>
	<style type="text/css">
		html, #page { padding:0; margin:0;}
		body { margin:0; padding:0; width:100%; color:#959595; font:normal 12px/2.0em Sans-Serif;} 
		h1, h2, h3, h4, h5, h6 {color:darkblue;}
		#page { background:#eee;}
		#header, #footer, #top-nav, #content, #content #contentbar, #content #sidebar { margin:0; padding:0;}
					
		/* Logo */
		#logo { padding:10px; width:auto; float:left;}
		#logo h1 a, h1 a:hover { color:darkblue; text-decoration:none;}
		#logo h1 span { color:#d3d3f9;}

		/* Header */
		#header { background:#eee; }
		#header-inner { margin:0 auto; padding:10px; width:970px;background:#fff;}
		
		/* Feature */
		.feature { background:#eee;padding:0;}
		.feature-inner { margin:auto;padding:10px;width:970px;background:blue; }
		.feature-inner h1 {color:#d3d3f9;font-size:32px;}
		
		/* Menu */
		#top-nav { margin:0 auto; padding:0px 0 0; height:37px; float:right;}
		#top-nav ul { list-style:none; padding:0; height:37px; float:left;}
		#top-nav ul li { margin:0; padding:0 0 0 8px; float:left;}
		#top-nav ul li a { display:block; margin:0; padding:8px 20px; color:blue; text-decoration:none;}
		#top-nav ul li.active a, #top-nav ul li a:hover { color:#d3d3f9;}
		
		/* Content */
		#content-inner { margin:0 auto; padding:10px; width:970px;background:#fff;}
		#content #contentbar { margin:0; padding:0; float:right; width:760px;}
		#content #contentbar .article { margin:0 0 24px; padding:0 20px 0 15px; }
		#content #sidebar { padding:0; float:left; width:200px;}
		#content #sidebar .widget { margin:0 0 12px; padding:8px 8px 8px 13px;line-height:1.4em;}
		#content #sidebar .widget h3 a { text-decoration:none;}
		#content #sidebar .widget ul { margin:0; padding:0; list-style:none; color:#959595;}
		#content #sidebar .widget ul li { margin:0;}
		#content #sidebar .widget ul li { padding:4px 0; width:185px;}
		#content #sidebar .widget ul li a { color:blue; text-decoration:none; margin-left:-16px; padding:4px 8px 4px 16px;}
		#content #sidebar .widget ul li a:hover { color:#d3d3f9; font-weight:bold; text-decoration:none;}
		
		/* Footerblurb */
		#footerblurb { background:#eee;color:blue;}
		#footerblurb-inner { margin:0 auto; width:970px; padding:10px;background:#d3d3f9;border-bottom-right-radius:15px;border-bottom-left-radius:15px;}
		#footerblurb .column { margin:0; text-align:justify; float:left;width:250px;padding:0 24px;}
		
		/* Footer */
		#footer { background:#eee;}
		#footer-inner { margin:auto; text-align:center; padding:12px; width:970px;}
		#footer a {color:blue;text-decoration:none;}
		
		/* Clear both sides to assist with div alignment  */
		.clr { clear:both; padding:0; margin:0; width:100%; font-size:0px; line-height:0px;}
	</style>
	<script type="text/javascript">
		/* =============================
		This script generates sample text for the body content. 
		You can remove this script and any reference to it. 
		 ============================= */
		var bodyText=["The smaller your reality, the more convinced you are that you know everything.", "If the facts don't fit the theory, change the facts.", "The past has no power over the present moment.", "This, too, will pass.", "</p><p>You will not be punished for your anger, you will be punished by your anger.", "Peace comes from within. Do not seek it without.", "<h2>Heading</h2><p>The most important moment of your life is now. The most important person in your life is the one you are with now, and the most important activity in your life is the one you are involved with now."]
		function generateText(sentenceCount){
			for (var i=0; i<sentenceCount; i++)
			document.write(bodyText[Math.floor(Math.random()*7)]+" ")
		}
	</script>
</head>
<body>
	<div id="page">
		<header id="header">
			<div id="header-inner">	
				<div id="logo">
					<h1><a href="#">Cool<span>Logo</span></a></h1>
				</div>
				<div id="top-nav">
					<ul>
					<li><a href="#">About</a></li>
					<li><a href="#">Contact</a></li>
					<li><a href="#">FAQ</a></li>
					<li><a href="#">Help</a></li>
					</ul>
				</div>
				<div class="clr"></div>
			</div>
		</header>
		<div class="feature">
			<div class="feature-inner">
			<h1>Heading</h1>
			</div>
		</div>
	

		<div id="content">
			<div id="content-inner">
			
				<main id="contentbar">
					<div class="article">
						<p><script>generateText(12)</script></p>
					</div>
				</main>
				
				<nav id="sidebar">
					<div class="widget">
						<h3>Left heading</h3>
						<ul>
						<li><a href="#">Link 1</a></li>
						<li><a href="#">Link 2</a></li>
						<li><a href="#">Link 3</a></li>
						<li><a href="#">Link 4</a></li>
						<li><a href="#">Link 5</a></li>
						</ul>
					</div>
				</nav>
				
				<div class="clr"></div>
			</div>
		</div>
	
		<div id="footerblurb">
			<div id="footerblurb-inner">
			
				<div class="column">
					<h2><span>Heading</span></h2>
					<p><script>generateText(2)</script></p>
				</div>	
				<div class="column">
					<h2><span>Heading</span></h2>
					<p><script>generateText(2)</script></p>
				</div>
				<div class="column">
					<h2><span>Heading</span></h2>
					<p><script>generateText(2)</script></p>
				</div>	
				
				<div class="clr"></div>
			</div>
		</div>
		<footer id="footer">
			<div id="footer-inner">
				<p>&copy; Copyright <a href="#">Your Site</a> &#124; <a href="#">Terms of Use</a> &#124; <a href="#">Privacy Policy</a></p>
				<div class="clr"></div>
			</div>
		</footer>
	</div>
</body>

Do i need to change the the collector uri as below?

localhost:8080/com.snowplowanalytics.snowplow/tp2 or
localhost:8080/com.snowplowanalytics.snowplow/i

please help me out.


#4

Hello @sandesh,

I believe you have serious misunderstanding of how data flows through your application: error from your bad rows means that collector received a request on wrong localhost:8080/snowplownew/sample.html endpoint. At the same time you have localhost:8080 collector endpoint and file with JS-snippet called sample.html. Which makes me think that bad row event is just your browser trying to to receive your HTML file on wrong (collector’s) port.

Unfortunately, all this guesswork as well as inconsistent, irrelevant and redundant information from you makes it very hard to help you, although we all sincerely want to continue and succeed.

Please @sandesh:

  • Try to not create new topics, unless your issue is really different one. Many different topics with different descriptions of problem do not help us to understand problem better, but just scatter efforts. Let’s stay within single thread until your problem is solved.
  • Try to provide consistent information without redundant details such as whole HTML/CSS markup of your page. Post your configuration or code only when you changed something there, preferably not whole file.
  • Try to keep using latest and compatible versions of software. You’re already doing that, well done.
  • Try to follow our documentation and terminology. This is most crucial part. If you don’t understand something or think its incomplete - let us know, we’re very keen to keep wiki in up-to-date state and user-friendly format, but not always have resources for that.

If you’ll follow this simple rules (especially following documentation) - you’ll be able to resolve your problems much-much faster.

Thanks.

P.S. Your collector endpoint looks correct. Although config.hocon would be helpful here to understand if you’re using correct port.


#5

Thanks @anton - you were faster :wink: