Setup Each Service

Hi all,

Does anybody have any setup for the environment in AWS for each service as snowplow mini?

Hi @nando_roz

Could you elaborate a little more on what setup instructions you’re looking for?

You should find all the documentation on our documentation site.

Hi Paul, thanks for your reply. I would like to implement in my AWS account the same approach as Snowplow Mini but using the AWS service for each part, eg:

How to install collector and enrichment in EC2 instance?
How to send the events to Elasticsearch and even on S3?

Do you have any support in this approach?

Thanks by advance

You can use our new Terraform Modules for setting up the Collector and Enrich steps if you’re familiar with Terraform.

We’ve also published a complete quick start guide, which helps you set up a full pipeline (including an iglu server for custom events), that loads into S3 and Postgres RDS.

For other components outside of this set up, we have our classic set up documentation as well as component specific information. The elastic search loader, since you asked specifically for that, is here along with the repository.

1 Like

Hi Paul, great. I finished to run the quick start guide with successfully. I have some questions:

  1. How can I find the correct size to user on EC2 instance for all applications or Postgress, instead the t3.micro and db.t3.micro for my data volume? Do we have some parameters table for each volume?
  2. Is possible use all aplication in just one EC2 instance instead the 8 EC2 intances created in the quickstart guide for example, all applications in just one instance?
  3. Should I use all the instances created in the quickstart example terraform or I can configure by my self by another way?
  4. I am testing sending some events and works fine, but when I try to send events, the example in the quickstart guide each 1 seconds, the file doesn’t appear in S3 immediatly, just the first one, the second ot third one appears 3 minutes after each one. Is just send one page_view per second. Should I configure something else?
  5. I am getting an error to access postgress trough pgadmin for example, I am receiving timeout. The db server is in public access mode and the security group has access for my IP. I am seeing if is some other problem, but if you could tell me some tricks, will be great :slight_smile: .
  6. To send events to Elasticsearch, could I configure a delivery stream in Kinesis and the output could be the ES cluster or do you have another approach to do it?

Sorry for my long text :smile:

Regards,

Hey @nando_roz - not Paul but will try to answer some of these for you!

How can I find the correct size to user on EC2 instance for all applications or Postgress, instead the t3.micro and db.t3.micro for my data volume? Do we have some parameters table for each volume?

The right size is the size at which your pipeline is processing data without any latency building up. So if you are observing the metrics for lag in Kinesis and the CPU for your nodes is within acceptable bounds you do not really need to change anything!

There is no table as such for different volumes but generally speaking up to about 100 RPS you shouldn’t need to change anything - after this point you are going to want to look at each component individually and scale them up either horizontally or vertically to ensure it keeps up with load.

Is possible use all aplication in just one EC2 instance instead the 8 EC2 intances created in the quickstart guide for example, all applications in just one instance?

The quickstart guide only works by deploying individual apps to individual services. You could of course setup your own deployment to do this if you wanted but you will lose out on the ability to upgrade and scale components in isolation.

Should I use all the instances created in the quickstart example terraform or I can configure by my self by another way?

If there are things you do not want (like saving “raw” data to S3) you can disable / remove that module from the quick-start.

I am testing sending some events and works fine, but when I try to send events, the example in the quickstart guide each 1 seconds, the file doesn’t appear in S3 immediatly, just the first one, the second ot third one appears 3 minutes after each one. Is just send one page_view per second. Should I configure something else?

This is by design - the S3 Loader rotates files to S3 every 3 minutes as default. More often would result in thousands and thousands of objects landing in S3 which would cost vastly more and make it almost impossible to work with the data on S3. You can adjust the time_limit here: https://github.com/snowplow-devops/terraform-aws-s3-loader-kinesis-ec2/blob/main/variables.tf#L151-L155

I am getting an error to access postgress trough pgadmin for example, I am receiving timeout. The db server is in public access mode and the security group has access for my IP. I am seeing if is some other problem, but if you could tell me some tricks, will be great.

I would try perhaps psql on your command line as a lower level way to connect but it could also be that you are not allowed to connect over port 5432 on your local network. Some work networks block outbound port ranges and RDS could be on that block list.

To send events to Elasticsearch, could I configure a delivery stream in Kinesis and the output could be the ES cluster or do you have another approach to do it?

We have an ES Loader Terraform module in internal testing at the moment - will update here once its ready for you to grab!

Hi Josh,

Thanks a lot for your reply.

  • About RPS I my case, sometimes I have 5 up 6 thousands users at same time so, 100 RPS would not be possible. In which component should I have overload?
  • about S3 loader time, If I understood the quickstart guide is a Real Time Pipeline, configure the S3 loader time for each 3 minutes, doesn’t keep as Real Time, I will see it on Postgress where the events appears less then 3 minutes, am I correct?
  • about ES, it will be greate :slight_smile:

Regards,
Fernando

About RPS I my case, sometimes I have 5 up 6 thousands users at same time so, 100 RPS would not be possible. In which component should I have overload?

If that translates to 6000 RPS in quite a few places! Your biggest bottleneck will be Kinesis Shard counts (you will likely want at least 10 shards for “raw” and “enriched” streams) but as well the size of your RDS cluster will become a factor.

Beyond this the CPU of each of these components will cap pretty fast with a single t3.micro:

  1. Postgres Loader (most intensive on CPU)
  2. Stream Enrich (second highest on CPU)
  3. S3 Loader (will need higher memory as it caches the window in memory)
  4. Collector

You will need to work through each component at your peak volume and check that it has enough bandwidth looking at Kinesis throughput, RDS throughput and CPU allocation for each microservice.

about S3 loader time, If I understood the quickstart guide is a Real Time Pipeline, configure the S3 loader time for each 3 minutes, doesn’t keep as Real Time, I will see it on Postgress where the events appears less then 3 minutes, am I correct?

Into Postgres you should see events in about 1 second generally - S3 can be real-time but it makes no sense to store each event as an individual object - if you want them faster / slower you just need to adjust the buffer thresholds as linked above.

Hi Josh,

About RPS - Just to understand, 6000 RPS could be in my case 6000 page_views simutaneous?

About Postgres - I could access now, but after send events, I couldn’t see in the database? Should I configure something else in the terraform module? I am not receiving any error

About RPS - Just to understand, 6000 RPS could be in my case 6000 page_views simutaneous?

It is the number of requests hitting your Snowplow Collector per second. Its likely your true RPS would be lower as its very unlikely for you to have everyone on your site viewing the page at exactly the same time - its something you will need to benchmark and measure however.

About Postgres - I could access now, but after send events, I couldn’t see in the database? Should I configure something else in the terraform module? I am not receiving any error

Your data should be in the “snowplow” database in the “atomic.events” table. If nothing is showing up there it would be worth checking the logs for the Postgres Loader to see if there are any errors.

1 Like

I am receiving this error:

Hey @nando_roz looks like your Iglu Server URL is missing its protocol (http(s)) - can you post the variables for your custom Iglu Server here so we can see if thats the issue?

Hi,

Could be that.

image

Yep you need to add http:// to the front of that and it should start to work as its documented here: https://github.com/snowplow/quickstart-examples/blob/main/terraform/aws/pipeline/default/terraform.tfvars#L29

Perfect. It works.
Thanks a lot for your support

1 Like

Glad to hear it @nando_roz !