[Draft pending AWS Lambda source release] Using Snowplow's AWS Lambda source to monitor S3 buckets [tutorial]


#1

This is a brief tutorial on how to monitor S3 bucket operations (files added, files removed) using Snowplow to capture and store these events.

The AWS Lambda source is an AWS Lambda function which is triggered when the selected buckets are mutated. This function then uses a Snowplow tracker to send these events into your Snowplow pipeline for real time, or batch processing (depending on which pipeline you’re using).

Before you get started you’ll need to ensure you have Python (2.7) and the AWS-CLI installed and configured (not shown). Then you can run the following steps to ensure you have pyyaml (a dependency for reading the configuration file) and download/extract the deployment bundle:

sudo pip install pyyaml
wget https://bintray.com/artifact/download/snowplow/snowplow-generic/snowplow_aws_lambda_source_0.1.0_bundle.zip
unzip snowplow_aws_lambda_source_0.1.0_bundle.zip -d snowplow_aws_lambda_source_0.1.0_bundle
cd snowplow_aws_lambda_source_0.1.0_bundle

In the downloaded deployment bundle, there’s a set of files that will let you configure and deploy this functionality. deploy.py is a script that we’ll use to deploy the AWS Lambda with the right configuration.
config.yaml is a configuration file detailing how the AWS Lambda source will operate, detailing which buckets are to be monitored and where events go to.

To get started you’ll first need to edit the configuration file config.yaml, like so:

snowplow:
    collector: http://collector.acme.com
    app_id: com.acme.rawenrichedmonitor
s3:
    buckets:
        - raw
        - enriched

assuming your Snowplow collector endpoint is http://collector.acme.com and the buckets you wish to monitor are raw and enriched. The app_id field is attached to each event
this specific AWS Lambda fires - allowing you to differentiate between multiple AWS Lambda sources.

Running the following will deploy the AWS Lambda to your account:

python deploy.py

Providing everything completed successfully, adding or removing items in the buckets you have specified will now send a s3 notification event to your selected collector!

If you’re using our batch pipeline with Amazon Redshift - you’ll also need to deploy the following Redshift table definition to your cluster, s3_notification_event_1.sql.