Hi,
The headline’s perhaps misleading, there are servers involved, but I’m trying to make the management of the infrastructure of snowplow as light touch as possible, and with that in mind, I’m trying to write the architecture in terraform, and use SaaS (and inherently scalable) components where I can.
I have the following pipeline in mind. I’ve got as far as the S3 sink.
- Scala collect (CodeDeploy in an ASG)
- Kinesis stream
- Scala enrich (CodeDeploy in an ASG)
- Kinesis stream
- Kinesis S3 sink (likely a CloudWatch event starting a container every minute, with a large buffer specified, so work won’t be undertaken every minute, only when the buffer’s filled).
- File written and S3 event sent to AWS Lambda.
- AWS Lambda to start a container running storage-loader.
- Load to RedShift.
It mixes batch and the real-time functions, but I haven’t read anything that says this can’t be done at present.
-
As a novice, I’d like to know if it’s appropriate for the Kinesis S3 sink to be used at this point in the chain? Or is it to be used for un-enriched data only?
-
Can I please have your comments on any issues you might see with this architecture?
Thanks in advance,
Graham