R35 Shredder - no shredding_complete.json file created

mgloel · February 26, 2021, 9:15am

Hi,

we managed to setup the shredder. The EMR job has been completed but we cannot find a shredding_complete.json file in the top folder of the run. It seems that this file is required to trigger the RDB Loader via the SQS queue, right?

This is the content of the shredded bucket:
s3://our-shredded-bucket/good/run=2021-02-25-17-39-49/

├── _SUCCESS
├── vendor=com.myapp
│   ├── name=generic_tracking_event
│   │   └── format=json
│   │       └── model=1
│   │           ├── part-00000-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│   │           └── part-00003-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│   └── name=minimal_tracking_event
│       └── format=json
│           └── model=1
│               ├── part-00000-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│               ├── part-00001-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│               ├── part-00002-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│               ├── part-00003-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│               ├── part-00004-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│               ├── part-00005-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│               ├── part-00006-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│               └── part-00007-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
└── vendor=com.snowplowanalytics.snowplow
    ├── name=atomic
    │   └── format=tsv
    │       └── model=1
    │           ├── part-00000-bbc17974-0f3c-418e-96f6-bd6a692ed254.c000.txt.gz
    │           ├── part-00001-bbc17974-0f3c-418e-96f6-bd6a692ed254.c000.txt.gz
    │           ├── part-00002-bbc17974-0f3c-418e-96f6-bd6a692ed254.c000.txt.gz
    │           ├── part-00003-bbc17974-0f3c-418e-96f6-bd6a692ed254.c000.txt.gz
    │           ├── part-00004-bbc17974-0f3c-418e-96f6-bd6a692ed254.c000.txt.gz
    │           ├── part-00005-bbc17974-0f3c-418e-96f6-bd6a692ed254.c000.txt.gz
    │           ├── part-00006-bbc17974-0f3c-418e-96f6-bd6a692ed254.c000.txt.gz
    │           └── part-00007-bbc17974-0f3c-418e-96f6-bd6a692ed254.c000.txt.gz
    └── name=duplicate
        └── format=json
            └── model=1
                ├── part-00000-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
                ├── part-00001-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
                ├── part-00002-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
                ├── part-00003-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
                ├── part-00004-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
                ├── part-00006-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
                └── part-00007-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz

The config.hocon looks like this:

{
  "name": "myapp",
  "id": "4113ba83-2797-4436-8c92-5ced0b8ac5b6",

  "region": "eu-west-1",
  "messageQueue": "SQS_QUEUE",

  "shredder": {
    "input": "SP_ENRICHED_URI",
    "output": "SP_SHREDDED_GOOD_URI",
    "outputBad": "SP_SHREDDED_BAD_URI",
    "compression": "GZIP"
  },

  "formats": {
    "default": "JSON",
    "json": [ ],
    "tsv": [ ],
    "skip": [ ]
  },

  "storage" = {
    "type": "redshift",
    "host": "redshift.amazon.com",
    "database": "OUR_DB",
    "port": 5439,
    "roleArn": "arn:aws:iam::AWS_ACCOUNT_NUMBER:role/RedshiftLoadRole",
    "schema": "atomic",
    "username": "DB_USER",
    "password": "DB_PASSWORD",
    "jdbc": {"ssl": true},
    "maxError": 10,
    "compRows": 100000
  },

  "steps": ["analyze"],

  "monitoring": {
    "snowplow": null,
    "sentry": null
  }
}

We could not find anything in the logs of the EMR job that indicated that the shredder job has been aborted. Does it only create this shredding_complete.json if the output type is TSV?

Best,
M.

anton · February 26, 2021, 8:26pm

Hi @mgloel,

Does it only create this shredding_complete.json if the output type is TSV?

No, that should happen regardless of the output format.

It seems that this file is required to trigger the RDB Loader via the SQS queue, right?

Loading is triggered by SQS message sent by Shredder. Did you check if there’s something in SQS queue? When Shredder finishes it does two things:

Sends an SQS message
Creates the shredding_complete.json

If it failed at the first step it won’t proceed to the second. Is there anything in your Loader’s logs?

Also make sure that your SQS queue if FIFO. Shredder fails if it isn’t.

mgloel · February 26, 2021, 9:18pm

Hi Anton,

thanks for your reply.

Our sqs queue is fifo.
We are only receiving some empty messages on the queue:
The RDB Loader log is only indicating that it is listening to the SQS queue:

2021-02-26 22:16:33
INFO 2021-02-26 21:16:32.939: RDB Loader [myapp] has started. Listening sp-sqs-queue.fifo

anton · February 26, 2021, 10:14pm

The only thing I can think of is that your EMR cluster is incapable of writing to SQS.

I’d check other EMR logs - YARN container logs for example, they’d contain an exception if Shredder has failed writing to SQS or S3.

mgloel · March 1, 2021, 12:04pm

Thanks for the tip. The EMR could indeed not write to SQS.
We had to add kms permissions to the EMR_EC2_role.

deepshah7 · May 22, 2021, 8:35am

Hello There,

I am facing the same issue. The RDB Shredder finishes successfully, but it doesn’t create the shredding_complete.json file nor does it send any message to the SQS.

I checked the permissions for the roles EMR_EC2_DefaultRole and EMR_DefaultRole and they both have SQS full access policy attached to it. What could be the problem, any pointers?

Which logs should I check on S3?

Thanks for the help in advance.

deepshah7 · May 22, 2021, 10:32am

OK I figured it out. For the benefit of others who might be facing similar issue, I am going to document what exactly was the problem and how I got around it.

Had to sync all the logs generated from the EMR cluster on S3 to my local machine. I used the AWS CLI tool for this.
After this I unzipped all the logs.
grepped “SQS” in them and I was able to locate the exception. This exception was causing the message to not get delivered on the queue.
I had this exception Caused by: com.amazonaws.services.sqs.model.AmazonSQSException: The queue should either have ContentBasedDeduplication enabled or MessageDeduplicationId provided explicitly
I went to SQS console and enabled “Content Base Deduplication” and ran the Shredder again
It worked!
I am able to see the message in the SQS queue and also the shredding_complete.json in S3.

rmichaelvp · July 2, 2021, 8:38pm

Thanks @deepshah7 I wasn’t receiving any message to my SQS until I enabled “Content Base Deduplication”

pramod.niralakeri · January 3, 2022, 10:39am

I also have similar issue, and tried all solutions. but no luck. Do you see any other way?

EMR shred is not able to write to SQS.

Topic		Replies	Views
RDB shredder doesn't create S3 folder referenced in SQS message For engineers	2	1113	July 7, 2022
Final step at shred/load failing with no error For engineers	5	1058	March 14, 2022
Snowplow RDB Loader 2.0.0 released New releases	0	1419	December 3, 2021
RDB Loader cannot find jsonpath Announcements	1	603	March 3, 2021
RDB loader container fails when there's no new shredded data Storage targets	3	994	July 22, 2021

R35 Shredder - no shredding_complete.json file created

Related Topics