Missing events data in BQ when running manual test

kuangmichael07 · October 13, 2021, 10:58pm

Hi Snowplowers,
I ran a manual test on a React site to test each of the Snowplow events and validate the schema.
A weird situation happened when I click, for example, the ad click event button quickly about 10 times, usually only 6-8 events would be saved in BQ, and some are missing. (all 10 events fired succesfully)

code:

doAdClick = () => {
  analytics.track('AdClick', {
    targetUrl: "http//www.test.com",
    clickId: "11233",
    costModel: "cpm",
    cost: 3,
    zoneId: "9",
    impressionId: '9/10/3:40',
    advertiserId: "201",
    campaignId: "123124"
  })
}

And in our debugging process, we saw all 10 events in good-sub topic, but not all in enriched-good topic.
In BQ we saw quite a large amount of adaptor_failure:

  {
    "app_id": "",
    "timestamp": "2021-10-13T22:49:50.175Z",
    "event_name": "",
    "error_type": "adapter_failures",
    "schema": "iglu:com.snowplowanalytics.snowplow.badrows/adapter_failures/jsonschema/1-0-0",
    "data": {
      "failure": "{\"timestamp\": \"2021-10-13T22:49:50.175Z\", \"vendor\": \"snowplow\", \"version\": \"health\", \"messages\": [{\"field\": \"vendor/version\", \"value\": \"snowplow/health\", \"expectation\": \"vendor/version combination is not supported\"}]}",
      "payload": "{\"vendor\": \"snowplow\", \"version\": \"health\", \"querystring\": [], \"contentType\": null, \"body\": null, \"collector\": \"ssc-2.3.2-rc1-googlepubsub\", \"encoding\": \"UTF-8\", \"hostname\": \"x.x.x.x\", \"timestamp\": \"2021-10-13T22:49:49.215Z\", \"ipAddress\": \"35.191.x.x\", \"useragent\": \"GoogleHC/1.0\", \"refererUri\": null, \"headers\": [\"Timeout-Access: <function1>\", \"Host\", \"User-Agent: GoogleHC/1.0\", \"Connection: Keep-alive\"], \"networkUserId\": \"xxx\"}",
      "processor": {
        "artifact": "beam-enrich",
        "version": "1.2.3"
      }
    }
  },

Would it because Snowplow treated some of the repeated events as robot generated events?
If yes, how can we change the enrichment to avoid this case? Thank you

mike · October 13, 2021, 11:08pm

No - this shouldn’t happen as even if an event is flagged as a bot / spider it will persist through the pipeline.

If you have seen 10 in the raw topic you should get 10 in BQ assuming no insertion failures / bad rows that have been raised.

These look to all be from the GCP load balancer healthcheck so are unrelated to your first issue. The snowplow/health endpoint does not exist, which is why you are seeing these adapter failures. You will want to use just /health for the load balancer health checkpoint which should return a 200 OK.

phxtorise · October 14, 2021, 5:29am

I also met the same problem. I clicked a button five times. I first checked the good topic in Pub/Sub, which had nine records about this event in total, but in which only five are unique, others are duplicates(I am not sure why this will happen, this is the first question). The number matched the click times, which proved all the events data had arrived collector successfully. And then I checked the bad topic, nothing was found there, which was also reasonable. Next, I checked the enriched-good topic, and I only found three records. And I supposed the missing two records should be in the enriched-bad topic. But when I checked the enriched-bad topic, nothing related was found, only a large amount of adaptor failure showed. And finally, I checked BigQuery, there were three records found, matching the number in the enriched-good topic. Since the two records are lost before inserting into BQ, I suppose it should have nothing to do with the insertion failures.

So for now, the question is, why would two records lost in the enrichment step(from good topic to enriched-good topic)? Would adapter failure be the cause or any other unexpected reasons?

Colm · October 14, 2021, 11:07am

@kuangmichael07 there is an unfortunately awkward issue with debugging failed events in BigQuery which I suspect might be what’s making things confusing/difficult in your case. It’s documented in the missing fields note in the documentation.

Because the datatype of the error description field for enrichment failures and schema violations, they’re difficult to query via BigQuery. In cases like yours, the issue is quite likely to be down to schema violations or something similar.

This exists because we implemented a massive restructure of failed events (which on the whole improves things hugely from the previous iteration), and we discovered that polymorphism affects the BigQuery experience only after launch. (In fact, before launch, querying via BQ wasn’t even possible). We actually currently have engineers working on resolving that issue.

So the bad news is that just using BQ to debug this isn’t very easy until that’s resolved. The good news, however, is that we have other workflows which should offer both a faster feedback loop and a better experience overall when setting up or making changes to tracking. Snowplow Micro and Snowplow Mini are miniature versions of the pipeline, which are used for debugging tracking with an immediate feedback loop. Micro can be run locally, and has an api endpoint to surface good/bad events, so that’s probably the best option - Mini runs on a cloud instance and outputs to elasticsearch, and is good for collaboration.

Any issue that you hit in the pipeline will also be surfaced in Snowplow Mini or Micro, but with a faster and more accessible feedback loop.

To give you some context, the flow of data is:

collector → raw good topic → enrich/validation → enriched good or bad topic → Bigquery (good) / GCS (bad)

Since you don’t see the data in the enriched good topic, the likelihood is that the data is failing the validation process.

If you set up Micro on your machine and send the exact same tracking to that, it should surface what’s going on for you. Then you’ll also have an awesome tool to use for experimentation with the rest of your tracking. The other option is to search the raw GCS logs of failed events, or manually find them in the failed events topic, but that’ll be more painful.

I hope that’s helpful, shout if you need help with it.

PS. I can confirm that Mike is right, those adapter failures are just pings on the collector, unrelated to your tracking.

Colm · October 14, 2021, 11:18am

@phxtorise while I’m not sure you necessarily face the same issue, I think the best next step for you is also to set up Micro or Mini (see my reply above).

I’ll answer your specific points here briefly, but I think if you need further help the best thing to do is to open a new thread - just because it’s hard to manage two things in one place, context can get lost and ultimately the two issues may be unrelated. Just a little housekeeping request to keep us sane.

I clicked a button five times. I first checked the good topic in Pub/Sub, which had nine records about this event in total, but in which only five are unique, others are duplicates(I am not sure why this will happen, this is the first question).

Duplicates are normal, although typically not every event would be duplicated. The most common reason you’d see duplicates is connection issues - if the collector doesn’t respond in time for the tracker, then the tracker doesn’t know that the event has been sent and will attempt to send it again. Additionally if you tracked events while offline, those events will be cached and sent when online.

There’s a pretty thorough explanation of the possible causes of duplicates in a live environment in this old blog post. We have better strategies for managing them now, but the explanations of where they come from still apply.

The number matched the click times, which proved all the events data had arrived collector successfully. And then I checked the bad topic, nothing was found there, which was also reasonable. Next, I checked the enriched-good topic, and I only found three records.

My gut says that most likely either there’s something wrong with the deployment, or some error was made in digging for the events in the good/bad topics. In either case, I think the best thing to do is narrow down the possible problems by setting up Micro and using that to test before sending data to the main pipeline.

I hope that’s helpful - like I mentioned before, we’re happy to help further - if you need to, feel free to open a new topic.

Colm · October 14, 2021, 11:24am

@kuangmichael07 and @phxtorise there is one more thing to note - whenever you have a completely new event or entity, when you first send events the columns don’t exist in the table. The mutator component of the loader will create the column for you. However, it takes time for that to happen.

Typically, the first few events won’t show up in the table until the column exists and the repeater sends them back in. It is possible that column creation takes too long for event the repeater and the data goes to the loader bad topic.

If you haven’t set up the repeater these events won’t get re-inserted. In any case, if you re-send the same events again, and they all show up, then that’s most likely the explanation.

kuangmichael07 · October 15, 2021, 5:07am

Hi @mike
Thank you for the reply. For the snowplow/health endpoint for LB, are you talk about something like this:

Or other things? Thank you

mike · October 15, 2021, 5:11am

Yes - it’ll likely be whatever is configured in snowplow-healthcheck-dev.

kuangmichael07 · October 15, 2021, 5:23am

@Colm Thank you for the reply. The events we are testing are all existing events and the payloads are hard coded.
And @phxtorise I think our cases are similar in some ways, even though my pipeline has more issues than yours. The enrichment somehow lost a few events and if you want to open a new topic I will follow and share any new findings with you

Topic		Replies	Views
Bq-failed-inserts topic reason GCP pipeline	3	959	September 1, 2021
Reason for bq bad events topic Storage targets	6	1024	September 7, 2021
Snowplow Events from Google Bucket to BigQuery Storage targets	1	982	July 29, 2020
Using Hadoop Event Recovery to recover events with a missing schema [tutorial] Troubleshooting	17	5105	June 1, 2017
What (or who) builds the events table schema in BigQuery with the Streamloader setup? GCP pipeline	2	870	May 31, 2022

Missing events data in BQ when running manual test

Related Topics