Snowflake loader with realtime pipeline

I’ve got a couple of questions about using the Snowflake loader to load the output of the realtime pipeline via S3 - apologies if these have been answered already elsewhere!

  1. If we’re partitioning the folders by YYYY-MM-DD-HH, is there a risk of a partially processed folder being marked as completed if the snowflake loader run occurs in the whilst the hour’s folder is still being filled

  2. I’ve had a couple of instances where the transformer step has failed (due to an AWS problem), and then the job gets stuck due to new columns already existing. Manually dropping the columns fixes the problem. Is this expected behaviour?

Thanks!

Iain

@ian,

If we’re partitioning the folders by YYYY-MM-DD-HH, is there a risk of a partially processed folder being marked as completed if the snowflake loader run occurs in the whilst the hour’s folder is still being filled

We typically archive the files produced by S3 Loader to a separate “archive” bucket and start transforming the run folders from there.

I’ve had a couple of instances where the transformer step has failed (due to an AWS problem), and then the job gets stuck due to new columns already existing. Manually dropping the columns fixes the problem. Is this expected behaviour?

I believe you are rather referring to Loader, not Transformer. Yes, if that happens you have 2 options: delete the newly created column or amend the manifest table. See Snowflake loader error - column already exists for more details.

1 Like

I believe above message is for @iain, so I’m going to mention him instead. :slight_smile:

2 Likes

Thank you.

We’ve set it up with the s3 loader loading events into the archive bucket as recommended, partitioned by YYYY-MM-DD-HH (i.e. hourly).

Are we still likely to get partially loaded buckets for e.g. run-2020-05-18-16 if the transformer job runs at 16:30 (for example), or does the transformer have a way of detecting if a bucket has changed since the last run?

@ihor do you know if partially loaded buckets are likely to be a problem with the events partitioned by hour as above?