Enrich/good folder contains empty 'run=[date]_$folder$' files


#1

I have just setup the snowplow enrich process using R97 Knossos.

The process completes successfully, but the enrich/good folder contains no folders, and just empty files with the name pattern

run=[timestamp]_$folder$

In enrich/bad I can see matching folders, with two files (_SUCCESS and part-....-.txt).

In archive/enrich I can see a matching folder with the same ifle pattern as above, but also an empty file with the run name and _$folder$ appended.

I’ve noticed there has been a similar issue which should now be fixed here: https://github.com/snowplow/snowplow/issues/3139


#2

Hello @hanskohls,

These _$folder$ files are harmless. We had plans to remove them, but pushed back this ticket.

enrich/good should not contain data after pipeline finished, folders got archived into archive/enrich by S3DistCp step that leaves these ghost _$folder$ files.

If data is present in archive/enrich then I don’t see reasons for you to worry.


LoadError after upgrading to R109
#3

For more information, check out the documentation from AWS:

https://aws.amazon.com/premiumsupport/knowledge-center/emr-s3-empty-files/


#4

is there a way to use the aws cli to delete all the _folder files?


#5

What I have done before is:

export BASE_RM_PATH=example-bucket/example-path; for f in $(aws s3 ls --recursive s3://$BASE_RM_PATH/ | grep '_\$folder\$' | perl -nae 'print "$F[3]\n";'); do echo "aws s3 rm s3://$BASE_RM_PATH/$f"; done

once happy with the result, simply remove the “echo” and quotes to execute


#6

thanks @knservis . i tried using the include with remove but didnt quite get it working…


#7

Did you run it as is (replacing BASE_RM_PATH=example-bucket/example-path for the correct path)? If yes, did you get the output you expected ( a whole series of aws s3 rm statements with the expected files to be deleted)? @bhavin


#8

ah… no i ment i tried running the s3 ls rm --recursive --include “.path.” to filter and remove only one file from all the folders :slight_smile:


#9

@bhavin Please let us know if you tried my suggestion. If it worked or if it didn’t or if you decided to do something else or nothing at all - let us know as this will help others that are reading this thread.


#10

hey @knservis… I had to put in a slight modification.
from perl -nae 'print "$F[3]\n"; to perl -nae 'print ( (split("/" , @F[3] ))[-1] , "\n") ;' since we are using the same $BASE_RM_PATH we only need the file name run=date... if we dont do that the script will repeat the prefix twice.

# modified version

export BASE_RM_PATH=<s3bucket>/<prefix>;

for f in $(aws s3 ls --recursive s3://$BASE_RM_PATH/ | grep '_\$folder\$' | perl -nae 'print ( (split("/" , @F[3] ))[-1] , "\n") ;');
do
echo "aws s3 rm s3://$BASE_RM_PATH/$f";
done

i ended up using the one liner

echo "enter s3path:"; \
read s3path; \
aws s3 ls --recursive s3://$s3path/ \
| awk -F '/' '/_\$folder\$/  { print $3 }' \
| xargs -I {} echo aws s3 rm s3://$s3path/{}

but what I really wanted to do is use the --recursive and --include & --exclude flag for rm and let aws cli do the work for me, which will be faster and I wouldn’t have to worry about intermediate errors and clean up or tracking, etc.
( finally i got it this time … )

read s3path; \
aws s3 rm --dryrun s3://$s3path/ \
--recursive \
--exclude '*' \
--include "*_\$folder$" \
;

let me know what you think!. and thanks for the pointer above…


#11

That exclude include trick in the last example seems to work well and it will be faster than listing and then doing an rm for each. That’s very helpful @bhavin thanks.


#12

@bhavin works awesome thanks,
once we upgrade to latest version i won’t need this, but until we do, this is very helpful