Suggestions about how to filter out data from the enriched stream


#1

We need a subset of the events from the enriched stream. We thought about using kinesis analytics that has a common SQL language and would be easier to maintain from our side. Does it sounds reasonable or would you suggest something different?

If we go for Kinesis Analytics do you have a sample how to query it as we have two data formats csv and json for the context (and we need to query both) in the same record?

Thanks in advance

Josi


#2

As you’ve flagged the last time I checked Kinesis Analytics it still only works on either simple style delimited records (CSV/TSV) or JSON.

There are a few alternatives here depending on your use case, throughput and what technology you want to choose (this isn’t an exhaustive list but some of the more common ones).

Apache Spark
Spark Streaming has first class support for connecting to Kinesis and this can be combined with the one of the Snowplow Analytics SDKs to make things easier.

Kinesis Tee
Kinesis Tee can run arbitrary transformations and filters on a Kinesis stream and push data into another stream. It includes a transformation from the TSV+JSON format into nested JSON called SNOWPLOW_TO_NESTED_JSON. You might then be able to pipe this output stream into Kinesis Analytics to query.

Apache Beam
Apache Beam offers some more flexible stream processing options compared to Spark Streaming and Kinesis Analytics. Kinesis isn’t fully supported a sink for all the SDKs but there’s been a lot of progress on the Java SDK in particular.