Snowplow RDB Loader R31 released

We’re happy to announce release R31 of RDB Loader and Shredder with new bad rows format and data quality improvement.

1 Like

Hello! We’ve noticed a bug in R31 that may be impacting some users. The result is a significant increase in shredded/bad data.

The bug is in the schema validator, where elements of property with type [array, null] are invalid if property is null. We believe the occurrence of this is quite rare, but please do check if this case exists if you’re on this release.

We will be pushing R32 shortly, which will include a fix for this issue. Apologies for any inconvenience.

Hi there. Is it normal to experience twice an increase in RDB shredder run length after upgrading from 0.14.0 to 0.15.0?

Hi @Aurimas_Griciunas,

We did notice 5-10% increased run length for some pipelines (mostly those with cross-batch deduplication enabled) due the fact that we re-worked caching of the DAG for the “orphan events” fix. But 100% increase definitely doesn’t look normal.

Do you have cross-batch deduplication enabled? Does your pipeline use shredded types heavily? What instances/volume we’re talking about?

Hi, @anton,

Thank’s for such a swift response!

Cross-batch deduplication is not enabled.
Our pipeline has around 45 distinct shredded types consisting of ~ 80 schema versions (some obsolete, so around 60 active ones).
Latest test was performed on 24 GB of raw .lzo files using 20 i3.2xlarge instances. ~10 million rows in table.
Shredding of enriched events took 74 minutes versus 32 minutes benchmark which was performed on the same data but with no version bump (i.e. {“rdb_loader”: 0.16.0, “rdb_shredder”: 0.15.0} vs {“rdb_loader”: 0.15.0, “rdb_shredder”: 0.14.0})

Thanks for the details, @Aurimas_Griciunas!

20 i3.2xlarge

I’m wondering if number of instances has something to do with it. Most of our high-volume pipelines tend to use more “vertical scaling”, e.g. single r4.16xlarge or several 8xlarge. I’ll try analyze some of our pipelines with similar characteristics and will get back to you ASAP.

Meanwhile, I think most important question for you is whether you care about the orhpan events issue and shredded bad rows. If not then probably it makes sense to rollback to R30.

Quick update. Changing emr cluster to 5 x i3.8xlarge CORE instances actually doubled the run time of shredd job for both old and new rdb_shredder versions on the same data.

@Aurimas_Griciunas, different EC2 types and their number requires different Spark tuning to effectively utilize those instances in EMR cluster. There are plenty of posts on the subject in this forum.