Form tracking with Snowplow [tutorial]


#1

Quite often we are asked about Snowplow ability to capture the values of the fields within a form submitted, say, on a registration page. An example of such a question follows:

Can someone let me know if or how I need to change the .js in order for form fields to be picked up? We’re going to need any email addresses entered to be stored so that we can join them to sales transactions when we build reports.

Form tracking is available out-of-the-box. In this tutorial, I will explain how to enable it, which is a simple task. However, you also should aware of its limitation. As a result, you might wish to consider a custom event/context to capture the required fields.

Enabling form tracking

All you need to do is to add the following line to your script:

snowplow('enableFormTracking');

where snowplow is the name you gave to the global function to access the tracker (see the tag you used to load sp.js), that is if you gave it a different name then amend the above line accordingly.

The above method allows you to capture 2 event types, namely change_form and submit_form.

You might feel the submit_form is what you are after. However, the complication here is the fields’ values in the form will be available in Redshift in JSON format. That is to retrieve, say, the actual email address submitted with the form you would need to deploy some means to extract that value from JSON, which is not available in Redshift itself.

An example of such a value follows:

[{"name":"Message","value":"This is how you do it.","nodeName":"TEXTAREA"},{"name":"Name","value":"Ihor Tomilenko","nodeName":"INPUT","type":"text"},{"name":"Email","value":"ihor@properweb.ca","nodeName":"INPUT","type":"text"},{"name":"Subject","value":"Form Tracking","nodeName":"INPUT","type":"text"}]

As can be seen, the data is actually a varchar representation of the list of JSONs. Each JSON in the list provides data associated with each (tracked) field of the form. Such a value would populate the (single) column elements of the com_snowplowanalytics_snowplow_submit_form_1 table.

The change_form event on the other hand records each individual value for each field in the form to the dedicated column, namely value of com_snowplowanalytics_snowplow_change_form_1 table. Each time a field is changed (focus from the field is shifted) the change_form event for the corresponding field is fired.

# select element_id, value 
  from atomic.com_snowplowanalytics_snowplow_change_form_1
  limit 4;

 element_id |         value          
------------+------------------------
 Name       | Ihor Tomilenko
 Email      | ihor@properweb.ca
 Message    | This is how you do it.
 Subject    | Form Tracking

If you are interested only in one field from the form (or just a few) you can apply a restriction on what is to be captured. This is achieved by means of whitelist and blacklist.

You can find out more about the form tracking from the following wiki page: https://github.com/snowplow/snowplow/wiki/2-Specific-event-tracking-with-the-Javascript-tracker#form-tracking

In addition to adding the script line, you would also have to create the corresponding Redshift tables as both of the events have been implemented as unstructured (self-describing) events. The corresponding DDLs could be found at the following links:

You can link the records in those table back to atomic.events by means of relation:

event_id = root_id and collector_tstamp = root_tstamp

The problem though is it’s not obvious when the change_form event (which allows acquiring the values of the fields on the form easily) is related to any actual submit_form event.

Alternative approach (custom event/context)

Alternatively, you could fire a custom event or attach a custom context to the Snowplow authored track... event to capture a just email address or whatever fields you might be interested in.

Custom contexts could be added as a last parameter to the corresponding function. That could allow you to have your email address field captured into a dedicated field, for example, thus avoiding the need to parse JSON for submit_form event.

Yet another solution is to create your own version of form submission event, so-called self-describing event.

These last two approaches are more involved, however. They required you to create your own JSON schema, JSONPaths file, and Redshift table. Though, it could be automated with the help of Schema Guru and/or lately with igluctl CLI.

You can read more about the events and contexts in this wiki section: https://github.com/snowplow/snowplow/wiki/Events-and-Contexts

Here, I’m going to show you how you could capture the values of the form fields of your interest by means of the custom self-describing event with the aim of getting an easy access to it in Redshift.

Custom form submission event

For the purpose of this tutorial and to keep it simple, I aim to capture just an email address from the contact form (as per original question at the top of the post).

I will start with defining the JSON schema:

{
	"$schema": "http://iglucentral.com/schemas/com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0#",
	"description": "Schema for form submission",
	"self": {
		"vendor": "ca.properweb",
		"name": "submit_form",
		"format": "jsonschema",
		"version": "1-0-0"
	},
	"type": "object",
	"properties": {
		"email": {
			"type": "string"
		},
		"form": {
			"type": "string"
		}
	},
	"minProperties":1,
	"required": ["email"],
	"additionalProperties": false
}

As can be seen, the only properties defined are email and form. Taken we also have the name of the event, submit_form, as the value of the name parameter in the self section of the schema, the following values will be accessible from the dedicated Redshift table ca_properweb_submit_form_1 as summarised below.

  • submit_form - event name
  • form - form id/name submitted
  • email - email address captured

Let’s try out our new CLI to generate the corresponding DDL and JSONpaths file. But first, I want to make sure the JSON schema is in a valid format. Provided the schema was saved as schemas/ca.properweb/submit_form/jsonschema/1-0-0, I will use the lint command for the task as shown below (with its output)

$ ./igluctl lint schemas/
SUCCESS: Schema [/home/ec2-user/schemas/ca.properweb/submit_form/jsonschema/1-0-0] is successfully validated
TOTAL: 1 Schemas were successfully validated
TOTAL: 0 errors were encountered

That’s a good start. Let’s generate the Redshift table definition and the corresponding JSONPaths file:

$ ./igluctl static generate --with-json-paths schemas/
File [/home/ec2-user/./sql/ca.properweb/submit_form_1.sql] was written successfully!
File [/home/ec2-user/./jsonpaths/ca.properweb/submit_form_1.json] was written successfully!

Here, we have them:

$ cat sql/ca.properweb/submit_form_1.sql
-- AUTO-GENERATED BY igluctl DO NOT EDIT
-- Generator: igluctl 0.1.0
-- Generated: 2016-08-19 18:44

CREATE SCHEMA IF NOT EXISTS atomic;

CREATE TABLE IF NOT EXISTS atomic.ca_properweb_submit_form_1 (
    "schema_vendor"  VARCHAR(128)  ENCODE RUNLENGTH NOT NULL,
    "schema_name"    VARCHAR(128)  ENCODE RUNLENGTH NOT NULL,
    "schema_format"  VARCHAR(128)  ENCODE RUNLENGTH NOT NULL,
    "schema_version" VARCHAR(128)  ENCODE RUNLENGTH NOT NULL,
    "root_id"        CHAR(36)      ENCODE RAW       NOT NULL,
    "root_tstamp"    TIMESTAMP     ENCODE LZO       NOT NULL,
    "ref_root"       VARCHAR(255)  ENCODE RUNLENGTH NOT NULL,
    "ref_tree"       VARCHAR(1500) ENCODE RUNLENGTH NOT NULL,
    "ref_parent"     VARCHAR(255)  ENCODE RUNLENGTH NOT NULL,
    "email"          VARCHAR(4096) ENCODE LZO       NOT NULL,
    "form"           VARCHAR(4096) ENCODE LZO,
    FOREIGN KEY (root_id) REFERENCES atomic.events(event_id)
)
DISTSTYLE KEY
DISTKEY (root_id)
SORTKEY (root_tstamp);

COMMENT ON TABLE atomic.ca_properweb_submit_form_1 IS 'iglu:ca.properweb/submit_form/jsonschema/1-0-0';
$ cat jsonpaths/ca.properweb/submit_form_1.json
{
    "jsonpaths": [
        "$.schema.vendor",
        "$.schema.name",
        "$.schema.format",
        "$.schema.version",
        "$.hierarchy.rootId",
        "$.hierarchy.rootTstamp",
        "$.hierarchy.refRoot",
        "$.hierarchy.refTree",
        "$.hierarchy.refParent",
        "$.data.email",
        "$.data.form"
    ]
}

Once uploaded to the dedicated buckets on S3 and the correct permissions/access is set, I’m going to take care of the JavaScript event tracking function (using jQuery). Please, bear in mind the below code is for demonstration purposes only. It has to be more robust to be deployed to production.

<script type="text/javascript">
    $('#contact-form').submit(function(){
        var form_id = $(this).attr('id');			//form id
        var email_address = $('input[name=Email]').val();	//email address value
        window.snowplow('trackUnstructEvent', {
            schema: 'iglu:ca.properweb/submit_form/jsonschema/1-0-0',
            data: {
                form: form_id,
                email: email_address
            }
        });
    });
</script>

After firing a few events and ensuring they were sent to the collector and were available as raw events in the raw bucket, I launched the pipeline job. Let’s check the results.

# select derived_tstamp, event, schema_name as event_name, form, email
  from atomic.events
  join atomic.ca_properweb_submit_form_1
  on event_id=root_id and collector_tstamp=root_tstamp
  limit 3;

     derived_tstamp      |  event   | event_name  |     form     |       email       
-------------------------+----------+-------------+--------------+-------------------
 2016-08-19 02:23:56.999 | unstruct | submit_form | contact-form | ihor@properweb.ca
 2016-08-19 02:21:24.999 | unstruct | submit_form | contact-form | ivan@pupkin.org
 2016-08-19 02:22:42.496 | unstruct | submit_form | contact-form | alex@gmail.com
(3 rows)

Here we have it: the email addresses captured on submission of the form are available in the dedicated table.

Summary

Snowplow is a flexible platform giving you different ways of capturing data you are interested in. The form submission, in particular, could be handled in the following 3 ways, 2 of which are available “out-of-the-box”.

  • submit_form - Snowplow authored event capturing the form values and available in Redshift as a list of JSONs. Despite its simplicity, the drawback is you would have to deploy some means of extracting the required values from JSON
  • change_form event - it does capture fields values into dedicated columns. However, it’s not directly related to the form submission. If the same field was amended a few times prior to form submission, it’s hard to correlate the values with the actual values submitted with the form
  • self-describing event gives you the most flexibility. It’s up to you how you define it. The drawback, however, it is quite involved and requires a much deeper understanding of the Snowplow pipeline functioning and techniques involved

The choice is yours!


Javascript tracker not tracking form submissions
How to track form filled by browser's autofill?
#2

Thanks for the great tutorial. It pretty much confirmed we’re doing it alright. One point though: In many countries email addresses are considered PII (personal indentifiable information) and hence not allowed to be stored on our systems. For that reason i created an issue to support a custom function to be applied to the form field value before transmitting it to snowplow. For example a MD5 or SHA1/SHA256 hashing algorithm. Does that make sense?


#3

Hi @christoph-buente,

Thank you for your comments and contribution. The feature you are referring to is expected to be added to JS tracker v2.7.0.

–Ihor


#4

I’m a little confused.

  • The links that point to the SQL for submin_form_1.sql are broken.
  • I read a post that says that everything has been moved to iglu central.

So, I’m unclear as to how I need to get the database configured properly or if I have to put some JSON schema(s) in the enrichments directory.

A little help here would be great.

Thanks,

Sonny


#5

OK. I found the sql in GitHub here at the SQL Location


#6

Sorry for the confusion @sonnypolaris - yes we moved all of the SQL files out of snowplow/snowplow to Iglu Central, where the JSON Schemas associated with the SQL files also live.


#7

Hi @ihor - Thank you for the great tutorial which is so helpful. I only have an issue, after implemeting JS tracker on my website and using the enableFormTracking as explained in the tutorial, i just want to track the fields after clicking the button submit.
Because in my case, when i fill in a textarea or input text fieldand event fires before completing the whole form and click in submit. forexample if i have 3 textfields, it generates 3 struct events.

Any suggestions please?


#8

Those are Change Form events. These fire as users move between form fields and are generated by the JS tracker. When the user submits the form the JS tracker will also create a form submit event. So in your example of a form with 3 fields, you will have 3 change form events and 1 form submit event, therefore 4 events in total.

The change form event allow analysts to monitor form engagement (abandonment for example) and to see the values user’s put into certain fields.

Hope this clears that up.


#9

Thank you for the reply, things are more clear now.
In my case, i only want to track submit_form event, i mean only if a user complets the form and press submit (if not, no need events about certain fields)
I am confused how to set this especially that i’m using GTM without dataLayer (i don’t want to change the web page source code). I have tried using variables and triggers in GTM but things didn’t work!!


#10

Thanks @ihor for the tutorial.

My pipeline uses postgresql, which means I can’t use storageloader to load custom contexts or unstructured events into my database (this ticket would fix this but has been open since 2014, and this thread discusses the issue).

I’d like to have form submission data accessible in postgresql without extracting all data from one column (as this article described). How can I collect form submission data and load it into postgresql, in dedicated columns?


Help tracking form data via unstructured event & kafka
#11

Hi @rob - until we add first-class shredding support for Postgres into Snowplow, these sorts of use-cases will be very challenging to achieve. We don’t have a timeline for full-fat Postgres support currently - so until then, switching to using Redshift is your best bet.