ETL Shred step taking longer and longer

alex · March 29, 2017, 1:31pm

No, Hadoop Shred is the more hard-working job (assuming all Hadoop Enrich enrichments are working performantly), and increasingly so as it does more work around event de-duplication…

lookaflyingdonkey · March 29, 2017, 6:20pm

Ahh makes sense, I totally forgot about the dedupe and was just thinking it was mapping them out for import into redshift

dashirov · March 29, 2017, 9:53pm

I’m trying to wrap my head around this math here… The workload size according to your post is 14GB or within an order of magnitude of it. How does that relate to consuming 2TB of Hadoop capacity? This makes zero sense to me, but I observe similar behavior on my workloads.

lookaflyingdonkey · March 30, 2017, 12:17am

So 14gb is from S3 so that would be compressed, and text compresses really well.

I am not sure if the records are uncompressed on HDFS, and I also assume that there would be multiple copies for raw, enriched, shredded, etc

But I am not sure if that would equate to 2tb or if my assumptions have any basis…

lookaflyingdonkey · March 30, 2017, 6:22pm

Just a quick update, I am now running twice a day and it is processing 15-20million events per batch and taking about 1h20m to do that.

So it looks like it was just struggling a bit with the higher volumes (72 and 58m rows) coupled with the large number of task servers.

I am still running with just core nodes but would like to test task at some point as it would be a significant cost saving

Cheers,
Dean

Topic		Replies	Views
ETL very very slow in larger batches Troubleshooting	24	4993	January 29, 2018
EMR ETL perfomance Enrichment	11	1950	January 25, 2017
Handling large volumes of duplicated event_ids AWS batch pipeline (Legacy)	3	1238	July 3, 2018
Shred problems using Batch Troubleshooting	1	829	December 5, 2020
EmrEtlRunner running for days at Step "Shred Enriched Events" Enrichment	3	1250	May 9, 2018

ETL Shred step taking longer and longer

Related Topics