Beam enrichment fails when Dataflow is back online


#1

Hi guys.

I have a setup comprising scala collector and beam enrich in GCP. The collector stream data is well collected by the “good” Pub/sub topic, even when the enrich/Dataflow job is offline. But when the the job is back online, all enrichments fail with the error “snowplow enrich error payload with vendor phpmyadmin and version …” and sometimes “with vendor shaAdmin”. Everything works perfectly when the jobs are running before the events arrive.
Thanks


#2

If you’re seeing vendors like phpMyAdmin and shaAdmin these are legitimate bad rows caused by bots scanning web apps for vulnerabilities.


#3

Thanks @mike for the reply.
I am running my app locally.
I’ve found it happen only and right after when the enrich job comes back online to collect the accumulated messages. I’ve repeated it (manually putting enrich off/on) enough times to rule out any coincidence.
Any suggestions?


#4

@crimsonb that’ll be because the Enrich component is what carries out validation and sends invalid events to bad rows.

If your collector endpoint is open to public traffic (which it is designed to be), then you’re likely get a certain amount of this activity as mike described - they can just be ignored. They will remain in the queue until the Enrich component recognises them as illegitimate and sends them to the bad rows topic.