R35 Shredder - no shredding_complete.json file created

Hi,

we managed to setup the shredder. The EMR job has been completed but we cannot find a shredding_complete.json file in the top folder of the run. It seems that this file is required to trigger the RDB Loader via the SQS queue, right?

This is the content of the shredded bucket:
s3://our-shredded-bucket/good/run=2021-02-25-17-39-49/

├── _SUCCESS
├── vendor=com.myapp
│   ├── name=generic_tracking_event
│   │   └── format=json
│   │       └── model=1
│   │           ├── part-00000-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│   │           └── part-00003-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│   └── name=minimal_tracking_event
│       └── format=json
│           └── model=1
│               ├── part-00000-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│               ├── part-00001-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│               ├── part-00002-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│               ├── part-00003-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│               ├── part-00004-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│               ├── part-00005-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│               ├── part-00006-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│               └── part-00007-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
└── vendor=com.snowplowanalytics.snowplow
    ├── name=atomic
    │   └── format=tsv
    │       └── model=1
    │           ├── part-00000-bbc17974-0f3c-418e-96f6-bd6a692ed254.c000.txt.gz
    │           ├── part-00001-bbc17974-0f3c-418e-96f6-bd6a692ed254.c000.txt.gz
    │           ├── part-00002-bbc17974-0f3c-418e-96f6-bd6a692ed254.c000.txt.gz
    │           ├── part-00003-bbc17974-0f3c-418e-96f6-bd6a692ed254.c000.txt.gz
    │           ├── part-00004-bbc17974-0f3c-418e-96f6-bd6a692ed254.c000.txt.gz
    │           ├── part-00005-bbc17974-0f3c-418e-96f6-bd6a692ed254.c000.txt.gz
    │           ├── part-00006-bbc17974-0f3c-418e-96f6-bd6a692ed254.c000.txt.gz
    │           └── part-00007-bbc17974-0f3c-418e-96f6-bd6a692ed254.c000.txt.gz
    └── name=duplicate
        └── format=json
            └── model=1
                ├── part-00000-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
                ├── part-00001-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
                ├── part-00002-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
                ├── part-00003-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
                ├── part-00004-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
                ├── part-00006-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
                └── part-00007-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz

The config.hocon looks like this:

{
  "name": "myapp",
  "id": "4113ba83-2797-4436-8c92-5ced0b8ac5b6",

  "region": "eu-west-1",
  "messageQueue": "SQS_QUEUE",

  "shredder": {
    "input": "SP_ENRICHED_URI",
    "output": "SP_SHREDDED_GOOD_URI",
    "outputBad": "SP_SHREDDED_BAD_URI",
    "compression": "GZIP"
  },

  "formats": {
    "default": "JSON",
    "json": [ ],
    "tsv": [ ],
    "skip": [ ]
  },

  "storage" = {
    "type": "redshift",
    "host": "redshift.amazon.com",
    "database": "OUR_DB",
    "port": 5439,
    "roleArn": "arn:aws:iam::AWS_ACCOUNT_NUMBER:role/RedshiftLoadRole",
    "schema": "atomic",
    "username": "DB_USER",
    "password": "DB_PASSWORD",
    "jdbc": {"ssl": true},
    "maxError": 10,
    "compRows": 100000
  },

  "steps": ["analyze"],

  "monitoring": {
    "snowplow": null,
    "sentry": null
  }
}

We could not find anything in the logs of the EMR job that indicated that the shredder job has been aborted. Does it only create this shredding_complete.json if the output type is TSV?

Best,
M.

Hi @mgloel,

Does it only create this shredding_complete.json if the output type is TSV?

No, that should happen regardless of the output format.

It seems that this file is required to trigger the RDB Loader via the SQS queue, right?

Loading is triggered by SQS message sent by Shredder. Did you check if there’s something in SQS queue? When Shredder finishes it does two things:

  1. Sends an SQS message
  2. Creates the shredding_complete.json

If it failed at the first step it won’t proceed to the second. Is there anything in your Loader’s logs?

Also make sure that your SQS queue if FIFO. Shredder fails if it isn’t.

Hi Anton,

thanks for your reply.

  • Our sqs queue is fifo.
  • We are only receiving some empty messages on the queue:
    Bildschirmfoto 2021-02-26 um 21.47.21
  • The RDB Loader log is only indicating that it is listening to the SQS queue:

2021-02-26 22:16:33
INFO 2021-02-26 21:16:32.939: RDB Loader [myapp] has started. Listening sp-sqs-queue.fifo

The only thing I can think of is that your EMR cluster is incapable of writing to SQS.

I’d check other EMR logs - YARN container logs for example, they’d contain an exception if Shredder has failed writing to SQS or S3.

1 Like

Thanks for the tip. The EMR could indeed not write to SQS.
We had to add kms permissions to the EMR_EC2_role.

Hello There,

I am facing the same issue. The RDB Shredder finishes successfully, but it doesn’t create the shredding_complete.json file nor does it send any message to the SQS.

I checked the permissions for the roles EMR_EC2_DefaultRole and EMR_DefaultRole and they both have SQS full access policy attached to it. What could be the problem, any pointers?

Which logs should I check on S3?

Thanks for the help in advance.

OK I figured it out. For the benefit of others who might be facing similar issue, I am going to document what exactly was the problem and how I got around it.

  • Had to sync all the logs generated from the EMR cluster on S3 to my local machine. I used the AWS CLI tool for this.
  • After this I unzipped all the logs.
  • grepped “SQS” in them and I was able to locate the exception. This exception was causing the message to not get delivered on the queue.
  • I had this exception Caused by: com.amazonaws.services.sqs.model.AmazonSQSException: The queue should either have ContentBasedDeduplication enabled or MessageDeduplicationId provided explicitly
  • I went to SQS console and enabled “Content Base Deduplication” and ran the Shredder again
  • It worked!
  • I am able to see the message in the SQS queue and also the shredding_complete.json in S3.
1 Like

Thanks @deepshah7 I wasn’t receiving any message to my SQS until I enabled “Content Base Deduplication”