Increasing execution time on batch mode instance


#1

Hi,

I’ve set up a Snowplow 1 hour batch instance and everything seems to work. I can get page views events with my clojure collector, i can shred an store them in a Redshift cluster.

However In this 2 days I’m noticing a constant increase of time in my EMR jobs. In a batch structure it obviously becomes a problem. I allocated a couple of m1.large instances but I can’t believe to need more computation resources to process “only” 3M events (265K peak/hour).
The only enrichment I’ve set up is geo-ip localization.

Am I missing some point?

Thank you for your time,
Federico


#2

Hi @Frederico - can you share:

  • The EMR configuration part of your config.yml
  • The EMR run times of your last 5 runs
  • The number of events processed in each of your last 5 runs

Thanks!


#3

Thanks for your answer, following what you asked:

config.yml

emr:
ami_version: 4.5.0 # Don’t change this
region: eu-west-1 # Always set this
jobflow_role: EMR_EC2_DefaultRole # Created using aws emr create-default-roles service_role: EMR_DefaultRole # Created using aws emr create-default-roles
placement: # Set this if not running in VPC. Leave blank otherwise
ec2_subnet_id: # Set this if running in VPC. Leave blank otherwise
ec2_key_name: tag
bootstrap: [] # Set this to specify custom boostrap actions. Leave empty otherwise
software:
hbase: # Optional. To launch on cluster, provide version, “0.92.0”, keep quotes. Leave empty otherwise.
lingual: # Optional. To launch on cluster, provide version, “1.1”, keep quotes. Leave empty otherwise.
# Adjust your Hadoop cluster below
jobflow:
master_instance_type: m1.large
core_instance_count: 2
core_instance_type: m1.large
task_instance_count: 0 # Increase to use spot instances
task_instance_type: m1.large
task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
additional_info: # Optional JSON string for selecting additional features

executions

Number of events
733492
806176
957370
988477
1124883
Last 5 iterations accidentally manage an increasing number of events, however 2 days ago I got a much irregular series such as:
651477
1503891
1752706
36852
271039

This data are only page views events enriched with the only ip_lookup.json

Thanks in advance


#4

Thanks for sharing @Federico1. I would suggest updating to:

    jobflow:
      master_instance_type: m1.medium
      core_instance_count: 3
      core_instance_type: m3.xlarge
      task_instance_count: 0
      task_instance_type: m1.small
      task_instance_bid: 0.25

Give that a go and please share your new job times.


#5

Thanks for your answer, I applied your suggested changes and I’ll keep track of new execution times.
May I ask you which misconfiguration you spotted?

I’ll write you back in a while.

Thanks again


#6

Hey @Frederico1:

  • Your master instance type was a little overprovisioned
  • Your core instance cluster was a little underprovisioned

Let us know how you get on!

Alex


#7

Hi,
after few days I’m still encountering the problem.

I’ve noticed that on failing I can find already processed data in processing folder, it seems to re-process each event from the dawn of the instance. I can’t remember to have ever set such configuration.

Have you ever encountered this (probably) misconfiguration?


#8

Hi @Federico1, I’m wondering if you are falling foul of this:

Important 2: do not put your raw:processing inside your raw:in bucket, or your enriched:good inside your raw:processing, or you will create circular references which EmrEtlRunner cannot resolve when moving files.

From this EmrEtlRunner configuration documentation.

Let me know?


#9

Thanks for your suggestion, you were right.

Sorry for loosing your time.


#10

No worries @Federico1 - thanks for letting us know what the problem was!