Snowplow Golang Tracker released


#1

We are very excited to announce the release of the Snowplow Golang Tracker!

This release brings a fully asynchronous, SQLite backed tracker that can be used in your golang apps and servers. It will also be used as a building block for:

  • Building a daemon to be used with the PHP Tracker for robust async sending (issue #54)
  • Powering the Snowplow Tracking CLI, to let Snowplow users send events from the command-line on Linux, Windows and OS X
  • Building an equivalent to Logstash for tailing logfiles into Snowplow as well-structured Snowplow events (working title the Snowplow Logfile Source)

If you have any questions about the release, please post to this topic!


#2

@josh - excited to see the initial plan for implementation of Snowplow Logfile Source. We have implemented something similar and very reliable with the dotnet tracker and MSMQ. We don’t have any experience with Golang, but looks straight forward. Let us know if there is anything we can do to contribute.


#3

Hey @digitaltouch - thanks for sharing that! We believe that a few companies have implemented something like this internally for Snowplow in a few different languages (e.g. Python). It would be awesome to get something standardized out.

The open question we have for the Snowplow Logfile Source is really around how to transform textual loglines into well-structured JSON. How did you do it in your side? Did you have some specific awareness of different logfile syntaxes (e.g. Apache Common Log, Apache Combined Log, Apache Error Log, RFC3164 Syslog), or did you provide configuration options to convert e.g. CSVs to JSONs, or something else?

Edit: another question I have is around how you prevent a logfile from being processed twice - do you perhaps archive/rotate logfiles after processing, or manage some kind of state to track your “cursor position” within the files?


#4

@alex - I apologize in advance for the lengthy reply. I agree it would be great to have a standardized package out.

To clarify, I was using “log” loosely. We primarily use for converting .csv files into well structured JSONs that get sent to Snowplow through the standard trackers. However, I believe that we could build off of what logstash has done with the syntax awareness of various log files (Apache Common, Apache Combined, etc) with relative ease, or we could build some sort of DSL that has predefined JSON schemas for each common log file (something similar to the csv-mapper project).

The most important thing for us is to implement an extremely flexible file mapper that can be passed as an argument to the script. This is important to us, and possibly the rest of the community for a couple of reasons:

  1. Maintenance - This prevents multiple packages from having to run slight variations of the same process.
  2. Transformation - We find ourselves transforming the contents of the text file because of strange variations (thinks “–” vs. “” for NULL or dates that are spread across multiple fields) that could be handled by prebuilt find and replace methods. This results in a separate codebase for each vendor.

This would result in a process similar to the following:

Huskimo -> S3 -> SNS -> Snowplow Logfile Source (Including simple transformations)-> Snowplow Collector -> Snowplow Pipeline

I use Huskimo loosely because we found out the hard way that it is easiest to extract from third parties, save to text files, and send through the Snowplow pipeline through trackers because we can monitor bad rows without having to constantly re-deploy the extraction and loading scripts - or have DBA monitor. Going straight from third parties to Redshift turns out to cause the same problems as traditional ETL pipelines when managing more than 10 different datasources.

Currently we use our own version of Huskimo that we write on a vendor by vendor basis, transform any of the anomalies in the API format to a text file, read the text file, send the event to snowplow, and archive the file to S3. It is working great, as we can tune the scripts on a vendor by vendor basis, and we can allocate certain machine types to certain vendors based on the level of compute needed. But it feels like we can do better.

As far as processing logfiles twice, we handle that with deduplication scripts with StorageLoader. We started with a system to handle cursor position (with Redis) and found that it was way easier to manage deduplication in SQL. We compress and archive the logfiles after processing. If the script fails mid way through parsing, that log may get parsed twice.


#5

Hi @digitaltouch - thanks for the super-detailed thoughts! I think we are very much on the same page.

I believe that we could build off of what logstash has done with the syntax awareness of various log files (Apache Common, Apache Combined, etc) with relative ease, or we could build some sort of DSL that has predefined JSON schemas for each common log file (something similar to the csv-mapper project).

Definitely agree - it feels like we can probably do both. Fastest would probably be to ship the Logfile Source with some standard logfile formats but also support Lua scripting to support simple transformations.

We can take some inspiration from [Amazon Kinesis Agent] kinesis-agent too.

(Much) further down the line, it would be interesting as well for the Snowplow pipeline to be able to support schema inference for unknown logfile formats (similar to the [Sequence project] sequence-project).

we found out the hard way that it is easiest to extract from third parties, save to text files, and send through the Snowplow pipeline through trackers because we can monitor bad rows without having to constantly re-deploy the extraction and loading scripts - or have DBA monitor. Going straight from third parties to Redshift turns out to cause the same problems as traditional ETL pipelines when managing more than 10 different datasources.

Completely makes sense. We made two architectural mistakes with Huskimo huskimo:

  1. Directly integrating with Redshift rather than emitting Snowplow events and letting the Snowplow pipeline and Iglu do the heavy lifting
  2. Adding all of the integrations into a single codebase. Each integration is completely independent of the others - they make much more sense as individual projects

Our [Snowplow AWS Lambda Source] lambda-source project (pre-release) is a better example of our planned post-Huskimo approach to these kinds of integrations.

On your integration dataflow:

Huskimo -> S3 -> SNS -> Snowplow Logfile Source (Including simple transformations)-> Snowplow Collector -> Snowplow Pipeline

This is super-interesting. Given that most third-party APIs require pagination anyway, what’s the benefit of roundtripping the data through S3 and SNS and the Logfile Source - why not just embed the Snowplow tracker in “Huskimo”? This is what we are planning post-Huskimo and I’m confident it would work with Singular, Twilio and Desk.com at the very least. In other words:

Snowplow Foo.com Source embeds Snowplow Tracker -> Snowplow Collector

As far as processing logfiles twice, we handle that with deduplication scripts with StorageLoader. We started with a system to handle cursor position (with Redis) and found that it was way easier to manage deduplication in SQL. We compress and archive the logfiles after processing. If the script fails mid way through parsing, that log may get parsed twice.

This is a bit surprising to me. Do you mean that you extract the whole data source on each run? We’ve found this impossible given API rate limits - and it feels unnecessary in any situation where resources are append-only or have a lastUpdated timestamp. What were the problems you encountered with cursor positions in Redis? We were thinking of using DynamoDB for this.

In any case, I agree that you need good deduplication (but not source-specific deduplication as that’s a maintenance nightmare) given that a source will update its cursor pessimistically, so records may come through twice.


Phew! Ditto apologies for the length of this reply, but it’s an interesting topic. A final question: are you open to open-sourcing/contributing any of your source integrations? It feels like it would be easier to discuss your experiences in all this with some of the code in front of us.


#6

Right now we are currently running with the Snowplow tracker embedded in a Huskimo like program - not the integration workflow I suggested. It is really working great, but as it grows, so do the number of packages that are emitting events to Snowplow. A handful of thoughts:

  • We are seeing that we can generally run the API extraction process on much cheaper instances. We tend to batch our API and database extractions on daily increments.
  • We have found that to emit a large number of events reliably to Snowplow we need an instance with high networking performance (m4.xlarge and up). The total cost of ownership starts to add up with these instances. This really only becomes and issue on large files with a million or more records (not usually coming from API’s). This will allow us to run the Logfile Source on dedicated instances and ASG.
  • We have a first functioning version of a web based file upload that allows users to map columns in the file to a schema pulled in from Iglu. We then pass this file in as an argument with a configuration file (JSON) to a executable that emits the events. Its very buggy, and we got sidetracked on other work.
  • We find ourselves modifying API extraction code a lot (SDK updates, API Enhancements, Poor documentation, new features). When separating them, we can keep the developers maintaining code that extracts data, formats any data intricacies, and write to S3 (easy concept) and data engineers focused on ingesting those files and getting them into Snowplow / other systems.

We run daily incremental extractions. I can see where this will cause issues when needing to run more frequently than daily. No major problems with Redis, we just got frustrated with managing another database / integration. When we realized we could use a date as our cursor without causing major issues we decided to do that. This definitely wont work with users that want more frequent extracts.

I really don’t recommend letting the C# codebase we are using out into the wild :slight_smile: I will simplify the script to show the process as pseudo code we are using which will be a great starting point for discussion. Happy to contribute/open source anything towards the Logfile source. Would you prefer that here, another thread in Discourse, or a Github case?


#7

Thanks @digitaltouch - that all makes a lot of sense.

I will simplify the script to show the process as pseudo code we are
using which will be a great starting point for discussion. Happy to
contribute/open source anything towards the Logfile source. Would you
prefer that here, another thread in Discourse, or a Github case?

Do you want to create a new ticket in the Logfile Source project with your pseudo-code when you’re ready?