Server-side contexts and end-to-end latency

Hi community!

My company wants a user event logging system to power real-time control systems for marketplaces (content recommendations, personalization, promotions, etc). We’re evaluating using Snowplow. We have some requirements that I want to validate.

  1. Server-side contexts . For each event, I want to log server-side-generated contexts about the requests and items being interacted with. I want to keep this information on the server-side for 2 reasons: it’s a lot of information and it’s sensitive. A majority of the logged data will be this type of data. What’s a good practice for supporting this in Snowplow? I found a generic Enrichment REST API hook but that seems inefficient for this type of logging. Do people fork Snowplow and modify the pipeline directly?

  2. End-to-end latency . Longer-term, I want to reduce the end-to-end ingestion latency as much as possible so we can use the client signals asap. What’s the end-to-end latency goal for Snowplow? I see some docs that say this can get down to a few seconds (Kinesis+Flink streaming). Does this latency apply to enrichments and data models?

Thanks!

  • Dan

@anton - I noticed New Horizons Enrichment. Looks exciting! If I want to do more complex joins in the enrichment step, how well does that work with fs2?

Depending on your programming language of choice you can use one of the server side tracking SDKs. Most languages are supported (Go, Java, PHP, Python etc)

I don’t think there’s a latency goal as it is somewhat proportional to the resources provisioned to the pipeline as well as the configuration of the pipeline itself (as well as the underlying cloud tech).

In general < 30 seconds from collection to BigQuery in GCP or collection to enriched stream is the most common latency. At the moment Flink streaming isn’t part of the pipeline however a Kinesis stream is provided that any tech can be used to read off.

For the most part yes, however this depends on what enrichments are enabled and how they are configured. For example if an API enrichment is enabled with a API that takes 50ms to respond that is going to slow down processing far more than the IAB enrichment. Data models aren’t typically run in real time and tend to be run in batches or microbatches which allow for things like sessionisation, windowing over long time periods that are trickier to achieve in real time.

Thanks for the replies!

Depending on your programming language of choice you can use one of the server side tracking SDKs. Most languages are supported (Go, Java, PHP, Python etc)

Would I have to do the server-side-only context join with the events before sending it to Snowplow? The same machine won’t have all of the context. I’m curious if Snowplow has a way to track these extra contexts and perform the joins later.

At the moment Flink streaming isn’t part of the pipeline however a Kinesis stream is provided that any tech can be used to read off.

What’s the recommended setup on AWS if I want very low latency? I saw the New Horizons work.

Do you have a mocked out example of what you’d like to add and where / when the data exists? In general you can add a base layer of information into the tracking call (via one of the SDKs) but you can augment this with additional information - for example if you don’t have it at run time via the enrichments functionality.

I believe at this stage fs2 only runs on GCP - for AWS the recommended setup is to use stream-enrich.

mike
December 20

DanHill:

Would I have to do the server-side-only context join with the events before sending it to Snowplow? The same machine won’t have all of the context. I’m curious if Snowplow has a way to track these extra contexts and perform the joins later.

Do you have a mocked out example of what you’d like to add and where / when the data exists? In general you can add a base layer of information into the tracking call (via one of the SDKs) but you can augment this with additional information - for example if you don’t have it at run time via the enrichments functionality.

Yea, I can give an example.

  • A user visits a page. We log a Pageview.
  • The page makes a request for a list of items to buy using parameters on the page. On the server-side, we log a Request that contains a bunch of server info (parameters and experiment info, latency records, server execution details, etc). We log Insertions (a candidate for impression) for each item that contains a bunch of item metadata, ranking info, info used for future optimizations, private pricing info, etc. Both of these are kept server-side.
  • When the item is visible on the screen for enough time, impression events are logged.
  • If a user interacts with the items, we’ll log events for these interactions.

Currently, I have a prototype Flink job that does a temporal join for all of these.

DanHill:

What’s the recommended setup on AWS if I want very low latency? I saw the New Horizons work.

I believe at this stage fs2 only runs on GCP - for AWS the recommended setup is to use stream-enrich.

Will fs2 eventually be used on AWS too?