This is a brief tutorial on how to monitor S3 bucket operations (files added, files removed) using Snowplow to capture and store these events.
The AWS Lambda source is an AWS Lambda function which is triggered when the selected buckets are mutated. This function then uses a Snowplow tracker to send these events into your Snowplow pipeline for real time, or batch processing (depending on which pipeline you’re using).
Before you get started you’ll need to ensure you have Python (2.7) and the AWS-CLI installed and configured (not shown). Then you can run the following steps to ensure you have pyyaml (a dependency for reading the configuration file) and download/extract the deployment bundle:
sudo pip install pyyaml wget https://bintray.com/artifact/download/snowplow/snowplow-generic/snowplow_aws_lambda_source_0.1.0_bundle.zip unzip snowplow_aws_lambda_source_0.1.0_bundle.zip -d snowplow_aws_lambda_source_0.1.0_bundle cd snowplow_aws_lambda_source_0.1.0_bundle
In the downloaded deployment bundle, there’s a set of files that will let you configure and deploy this functionality.
deploy.py is a script that we’ll use to deploy the AWS Lambda with the right configuration.
config.yaml is a configuration file detailing how the AWS Lambda source will operate, detailing which buckets are to be monitored and where events go to.
To get started you’ll first need to edit the configuration file
config.yaml, like so:
snowplow: collector: http://collector.acme.com app_id: com.acme.rawenrichedmonitor s3: buckets: - raw - enriched
assuming your Snowplow collector endpoint is
http://collector.acme.com and the buckets you wish to monitor are
app_id field is attached to each event
this specific AWS Lambda fires - allowing you to differentiate between multiple AWS Lambda sources.
Running the following will deploy the AWS Lambda to your account:
Providing everything completed successfully, adding or removing items in the buckets you have specified will now send a s3 notification event to your selected collector!
If you’re using our batch pipeline with Amazon Redshift - you’ll also need to deploy the following Redshift table definition to your cluster, s3_notification_event_1.sql.