Snowflake loader with realtime pipeline

iain · May 14, 2020, 10:36pm

I’ve got a couple of questions about using the Snowflake loader to load the output of the realtime pipeline via S3 - apologies if these have been answered already elsewhere!

If we’re partitioning the folders by YYYY-MM-DD-HH, is there a risk of a partially processed folder being marked as completed if the snowflake loader run occurs in the whilst the hour’s folder is still being filled
I’ve had a couple of instances where the transformer step has failed (due to an AWS problem), and then the job gets stuck due to new columns already existing. Manually dropping the columns fixes the problem. Is this expected behaviour?

Thanks!

Iain

ihor · May 17, 2020, 1:56pm

@ian,

If we’re partitioning the folders by YYYY-MM-DD-HH, is there a risk of a partially processed folder being marked as completed if the snowflake loader run occurs in the whilst the hour’s folder is still being filled

We typically archive the files produced by S3 Loader to a separate “archive” bucket and start transforming the run folders from there.

I’ve had a couple of instances where the transformer step has failed (due to an AWS problem), and then the job gets stuck due to new columns already existing. Manually dropping the columns fixes the problem. Is this expected behaviour?

I believe you are rather referring to Loader, not Transformer. Yes, if that happens you have 2 options: delete the newly created column or amend the manifest table. See Snowflake loader error - column already exists for more details.

ian · May 17, 2020, 2:34pm

I believe above message is for @iain, so I’m going to mention him instead.

iain · May 18, 2020, 9:08am

Thank you.

We’ve set it up with the s3 loader loading events into the archive bucket as recommended, partitioned by YYYY-MM-DD-HH (i.e. hourly).

Are we still likely to get partially loaded buckets for e.g. run-2020-05-18-16 if the transformer job runs at 16:30 (for example), or does the transformer have a way of detecting if a bucket has changed since the last run?

iain · June 4, 2020, 10:14am

@ihor do you know if partially loaded buckets are likely to be a problem with the events partitioned by hour as above?

Topic		Replies	Views
Snowflake Loader - Process ran successfully but no data loaded in transform s3 bucket	7	1292	October 17, 2019
Snowflake Transformer Not a file error Troubleshooting	1	858	May 27, 2021
Snowflake DB Loader Error & Recovery question For engineers	2	696	October 4, 2019
RDB Loader 5.3.1 released (with important bug fix on Snowflake Loader) New releases	0	606	January 25, 2023
Snowflake loader with snowplow s3 loader - gzip? AWS real-time pipeline	2	956	May 11, 2020

Snowflake loader with realtime pipeline

Related Topics