R35 Shredder - no shredding_complete.json file created

Hi,

we managed to setup the shredder. The EMR job has been completed but we cannot find a shredding_complete.json file in the top folder of the run. It seems that this file is required to trigger the RDB Loader via the SQS queue, right?

This is the content of the shredded bucket:
s3://our-shredded-bucket/good/run=2021-02-25-17-39-49/

├── _SUCCESS
├── vendor=com.myapp
│   ├── name=generic_tracking_event
│   │   └── format=json
│   │       └── model=1
│   │           ├── part-00000-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│   │           └── part-00003-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│   └── name=minimal_tracking_event
│       └── format=json
│           └── model=1
│               ├── part-00000-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│               ├── part-00001-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│               ├── part-00002-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│               ├── part-00003-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│               ├── part-00004-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│               ├── part-00005-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│               ├── part-00006-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
│               └── part-00007-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
└── vendor=com.snowplowanalytics.snowplow
    ├── name=atomic
    │   └── format=tsv
    │       └── model=1
    │           ├── part-00000-bbc17974-0f3c-418e-96f6-bd6a692ed254.c000.txt.gz
    │           ├── part-00001-bbc17974-0f3c-418e-96f6-bd6a692ed254.c000.txt.gz
    │           ├── part-00002-bbc17974-0f3c-418e-96f6-bd6a692ed254.c000.txt.gz
    │           ├── part-00003-bbc17974-0f3c-418e-96f6-bd6a692ed254.c000.txt.gz
    │           ├── part-00004-bbc17974-0f3c-418e-96f6-bd6a692ed254.c000.txt.gz
    │           ├── part-00005-bbc17974-0f3c-418e-96f6-bd6a692ed254.c000.txt.gz
    │           ├── part-00006-bbc17974-0f3c-418e-96f6-bd6a692ed254.c000.txt.gz
    │           └── part-00007-bbc17974-0f3c-418e-96f6-bd6a692ed254.c000.txt.gz
    └── name=duplicate
        └── format=json
            └── model=1
                ├── part-00000-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
                ├── part-00001-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
                ├── part-00002-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
                ├── part-00003-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
                ├── part-00004-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
                ├── part-00006-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz
                └── part-00007-2171e936-54d8-421a-a94d-e6e801df7734.c000.txt.gz

The config.hocon looks like this:

{
  "name": "myapp",
  "id": "4113ba83-2797-4436-8c92-5ced0b8ac5b6",

  "region": "eu-west-1",
  "messageQueue": "SQS_QUEUE",

  "shredder": {
    "input": "SP_ENRICHED_URI",
    "output": "SP_SHREDDED_GOOD_URI",
    "outputBad": "SP_SHREDDED_BAD_URI",
    "compression": "GZIP"
  },

  "formats": {
    "default": "JSON",
    "json": [ ],
    "tsv": [ ],
    "skip": [ ]
  },

  "storage" = {
    "type": "redshift",
    "host": "redshift.amazon.com",
    "database": "OUR_DB",
    "port": 5439,
    "roleArn": "arn:aws:iam::AWS_ACCOUNT_NUMBER:role/RedshiftLoadRole",
    "schema": "atomic",
    "username": "DB_USER",
    "password": "DB_PASSWORD",
    "jdbc": {"ssl": true},
    "maxError": 10,
    "compRows": 100000
  },

  "steps": ["analyze"],

  "monitoring": {
    "snowplow": null,
    "sentry": null
  }
}

We could not find anything in the logs of the EMR job that indicated that the shredder job has been aborted. Does it only create this shredding_complete.json if the output type is TSV?

Best,
M.

Hi @mgloel,

Does it only create this shredding_complete.json if the output type is TSV?

No, that should happen regardless of the output format.

It seems that this file is required to trigger the RDB Loader via the SQS queue, right?

Loading is triggered by SQS message sent by Shredder. Did you check if there’s something in SQS queue? When Shredder finishes it does two things:

  1. Sends an SQS message
  2. Creates the shredding_complete.json

If it failed at the first step it won’t proceed to the second. Is there anything in your Loader’s logs?

Also make sure that your SQS queue if FIFO. Shredder fails if it isn’t.

Hi Anton,

thanks for your reply.

  • Our sqs queue is fifo.
  • We are only receiving some empty messages on the queue:
    Bildschirmfoto 2021-02-26 um 21.47.21
  • The RDB Loader log is only indicating that it is listening to the SQS queue:

2021-02-26 22:16:33
INFO 2021-02-26 21:16:32.939: RDB Loader [myapp] has started. Listening sp-sqs-queue.fifo

The only thing I can think of is that your EMR cluster is incapable of writing to SQS.

I’d check other EMR logs - YARN container logs for example, they’d contain an exception if Shredder has failed writing to SQS or S3.

1 Like

Thanks for the tip. The EMR could indeed not write to SQS.
We had to add kms permissions to the EMR_EC2_role.