Server-side trackers, proposed new features


#1

Dear Snowplowers,

I wanted to share with you some of my experiences developing proprietary event trackers to see if there’s an interest in adopting those ideas here, by this community.

Currently server side trackers do not seem to receive as much love as javascript tracker, they are treated a bit like second class citizens. In my previous engagements, this was not the case. Instead, a symbiotical, collaborative utilization of client-side and server-side trackers can be deployed to promote clean and useable data collection strategy. But let’s first state what are we trying to solve for.

In my experience with large, popular destinations between 40% to 60% of web traffic serviced by web servers never renders in the browser.

  1. Bots, often unknown and uncategorized, knock on every door and peak in every keyhole. They trigger API calls, steal proprietary content and almost never render the retrieved content in a real browser. They end up not executing javascript code on the browser side. Various types of DDOs attacks can’t be noticed through analysis of javascript generated event logs. Owners are often baffled - backend registers activities, CPU and IO spikes, AWS (or other cloud provider) spins out resources - but there’s no corresponding activity on the monitoring dashboards. Malicious abusers find security holes in APIs and award themselves and their friends with virtual goods not earned through predefined routes. Game leaderboards explode with unnaturally high scores and havoc ensues.
  2. Browser updates prevent content from being rendered, stop javascript execution mid-flight. Native browser extensions render as a black block on the screen, get disabled due to version incompatibility. All of this is unnoticed until the volume is material, but then it is too late - you have already taken a hit to the bottom line and now fixing the problem is a firefight. Investigations take multiple groups of scared, uncertain stakeholders breathing down the developers necks through sleepless nights and weekends.
  3. In both of those two cases outlined above events were registered by the backend and middleware and were not registered by the front-end. We were paying attention to the front end and were oblivious to the body of the iceberg. But sometimes things can get even more embarrassing. Front end developers were pretty damn sure they have enabled event logging across the board, say 99% of the possible ways of interacting with the application were covered. Who gives a damn about the last mile, right? Then one shiny day, after the webdev resources moved on to other projects we find out it wasn’t 1% that was left behind… or overnight due to buzdev or marketing activities 1% ballooned into 30% or 50%. How do you know you did everything right? Well, five hundred years ago, professional accountants figured out how - they count everything twice, all the time. And if at the end the two sides balance to zero, then they did it right. If it didn’t, then they did it wrong. That simple.

Now let’s review what can be done to bridge these gaps. Currently snowplow does not integrate into web application backends easily. But let’s take RoR or Spring Framework or PHP as an example.
If an HTTP(s) request was intercepted by a filter onRequest (per Request) and an emitable Event object was created and made available to the application in later http request/response lifecycle phases and onResponseCommit the object was examined to see if it needs to be emitted? What if the unique event id generated on the server side was made available to the javascript tracker on the front end to associate the two parts of the same “happening” together? What if user’s headers (e.g. cookies and authorization headers) were used to federate user’s identity? Oh, then the server would be logging not only server start, server fault, and server down events, but “user david made an API call (Part-A/Server Side)” and some seconds later “user david rendered financial reports in the browser single page app (Part-B/Client Side)”. I’m struggling for words here, but I hope you can see the possibilities. The best part is that developers do not need to re-invent the wheel every time they need to enable event logging. If you standardized on parameter and cookie usage and naming conventions, then most of the commonly used concepts can be pre-populated into the event context automatically by the library. All developers need to do is a few last touches:
a. no this is not a generic page view event, change it to something more closely descriptive of the business logic.
b. add event specific context
c. change a default or two, because you know better.
Done and Done.

Now, by constantly monitoring what is captured on the front-end and back-end sides of the “event accounting” we can easily isolate any attributes (browser versions, ip addresses, cookies…etc) that cause lump-sided event tracking and fix the problem before it becomes an embarrassment or a a threat to the businesses bottom line.

In our proprietary event logging frameworks we have often stitched multiple parts of the same logical event from independently logged parts into single record similar to de-duplication feature recently added to snowplow. Example:
Part A: Server generated content to render (size, timings on API calls are logged)
Part B: Page lazy loaded on the browser (load times, modules loaded)
Part C: Google client side ads we requested ( 10 requested, 3 delivered, 3 displayed )
And with snowplows ability to add contexts to existing event logs, stitching logic is probably going to be very simple!

Thoughts? Comments?


#2

I definitely think there’s merit in improving some of the server side clients - that said I’ve very rarely used them across a large number of projects (it’s primarily JS/iOS/Android).

Having Snowplow complement existing application performance monitoring (APM) tools certainly makes sense - often by using a transaction id similar to AWS XRay/Loggly/other tools. This would be simple enough to send in a custom context. However there’s a lot of consideration that would need to go into building out more APM style functionality if the intention was to replicate some of the features in current tools such as:

  • How do you send a large number of these events in a non-blocking low latency manner from a server side application so that you don’t slow down a request?
  • Do trace events need to be generated in order?
  • If the application is composed of multiple microservices should each microservice generate an event or should a single service collect and aggregate performance information for that trace?
  • What is the expected behaviour when an object/page is served from cache? (You’ll end up with discrepancies between client side and server data)
  • If a request failure occurs upstream how do you signal that downstream events should not fire?
  • Should you record network requests that are part of the response but non HTTP/HTTPS?
  • If the frontend is sending n events per second and a backend service is sending 3*n events per second how do you ensure that if the backend scales that the collector load balancer scales without generating 5xxs?

Some of these issues are easier to solve than others but I think having an APM to complement Snowplow data is a fair bit easier than adding that functionality into Snowplow libraries itself.


#3

Very interesting thread @dashirov!

Lots to digest - but on the core idea of “double entry book-keeping” for events - I believe you could prototype this today:

  1. Server-side: on each request:
  • Generate a request ID (UUID)
  • Emit a request event with a request ID
  • Render the page with JavaScript automatically binding a request custom context to all events
  1. Client-side: execute the JavaScript so that all events are sent in with their request context (including request ID)

I’d be interested to see the results of this!

Just a note on this:

Currently server side trackers do not seem to receive as much love as javascript tracker, they are treated a bit like second class citizens

Things are a little more nuanced than this - it is true that the surface area of the client-side trackers (JS / iOS / Android / .NET) is greater than the server-side trackers; there’s just more interesting stuff to capture (e.g. geolocation, client sessions), and more developer integation options that we need to support.

But, at least some of the server-side trackers have gone through many releases (e.g. the Python Tracker), are used operationally at Snowplow (e.g. the Ruby Tracker) and have some unique features in them (e.g. the Scala Tracker’s EC2 instance context). However yes there is always more to do!


#4

@mike, I do not necessarily advocate competing with APM vendors, mainly because I’m scared of the complexities they have to deal with to get basic functionality out of the door. I’d like to frame this as “integrated tracking”. Below is my response point by point.

  • We did it by combining what you know as tracker and collector into a single logical unit. Thrift records were submitted to a local service (think local agent: FlumeNG, Facebook ScribeD, etc.), who’s job was to guarantee delivery to centralized processing facility. Alternative implementations are running an event logging thread pools with a small memory buffer. In either case, Web server fires and forgets.
    In snowplow’s case the tracker can be set to keep-alive, and you won’t believe how much more data can be pushed through a kept-alive connection! Add to it a managed connection pool and a load balancer on the collector’s end and you’re golden.

  • No order, but will need a trace key and possibly a trace sequence number. It almost never is important to get things back in order, because you can maintain the state offline near realtime. I was able to reconstitute a marketing funnel comprised of client-side, server-side events against a state database seeded with 2 billion acquisition keys at a rate of 800K transactions per second on 5 inexpensive physicals and a VoltDB deployment. Trading desk developers will disagree, but most of them would object against HTTP trackers anyway.

  • micro services - probably up-to the software architects. Remember what we’re solving for: if each can be reached and triggered independently with the master unaware, then I would log on them too. If there’s one possible entry point and master aggregator blocks until transaction completes, then it makes sense to log at the orchestrator level only.

  • I’ve never witnessed cache inflicted discrepancies, we made sure small javascript includes with tracer variables were always shipped from the origin with headers preventing them from being legitimately cached. Local browser cache and CDN cache treated identically.

  • I think you just log the failure and reconcile (expecting no corresponding downstream events being fired)

  • not sure how to record non-http requests without proper SDK or tracker built for that environment. APM vendors (Nastel, AppDynamics, NewRelic etc.) do some crazy tag reconciliation to make sure they can track things through and through. Most of the machinery relies on JVM though with some notable deviations, but the bulk of it is in the VM.

  • scalability - If you have autoscale enabled, then it is not a problem. If you are an obsessive-compulsive running with a calculator, then you do a pre-deployment assessment in dev/staging environment and with some predictive analytics scale the clusters ahead of the new volumes. Both cases not any different from introducing a new feature, running burst marketing campaigns, on-boarding a new client, etc. Life as usual. Obviously with more data logged, processed and retained the bills will increase, but it’s a choice we all have to make.

… One thing that no one commented on is ability to generate a server-side pageview event unmolested by ad blockers, ip blockers and other unpleasant realities of event tracking on the web today. Something javascript tracker does with its eyes closed, but the server side trackers aren’t really built for. But a web application server is an extension of a web browser - it has deep knowledge of the http request and http response and houses a lot, if not all application business logic. It enforces authentication, authorization and audit logging, and it has a lot to say about what passes through it without having to send that data to the browser, just to get logged. Some properties for example, want to prevent price comparison shoppers from scraping the price points, but standard e-com practices call for exposing that data in the GTM data layer… You really don’t have to, if you don’t want to. In a hybrid mode, those data points will be merged into a single record on the backend.