We have been implementing the following snowplow pipeline to load some data into snowflake.
Collector → Enricher → S3 Loader → (EMR FROM HERE) s3DistCP → Snowflake Transformer → Snowflake Loader → s3DistCP for archive
Up until the first s3DistCP, everything works fine, but when running the jobs on EMR, the transformer outputs the following error:
Caused by: java.io.IOException: Not a file: s3a://snowplow-events/enriched/archive/run=2021-05-26-13-21-12/2021/05
Im guessing that error appears because that is in fact not a file, its a folder. After s3distcp, the folder structure is as follows:
Is there some configuration i need to change to make it run correctly? This is the configuration for the s3distcp step:
Thank you very much for all the support! Let me know if i need to provide more information.