I’ve had a problem with the snowflake loader skipping data which I’m trying to find a workaround for.
If the folder names are set to YYYY-MM-DD-HH (i.e. hourly), and the snowflake loader is running every 4 hours, whenever it’s run, it marks the current folder as complete even though the streaming enrich is still populating it with data. As a result, every 4 hours about 50% of the data for that hour doesn’t get loaded into snowflake.
I’ve found a workaround by setting the folder name to YYYY-MM-DD-HH-mm (i.e. every minute), which reduces the skipped data to no more than 1 minute every 4 hours. However this has increased the time taken to process using EMR considerably.
Is there a better way around this problem?