Consuming data in S3 through different services - Is there a way to use Glue Data Catalog? (or other solution)


#1

Hi there!

We had, to this point, sinked tracking data using the Ruby Tracker SDK > Scala Stream Collector > Enrich Stream > S3 Loader (Kinesis S3) to a S3 bucket. The data is in GZIP format, and it seems to contain the data we expect from our tracker (thanks to the community support, which helped us a lot).

After that, my next mental step was “Ok, now I need to automate data catalog creation to make this data available to different stakeholders”, the problem is, I didn’t find almost anything about snowplow users using Glue Data Catalog, which for me is kind of unexpected. Is there a way to configure crawlers to correctly classify the schema of the data being sent to S3? If not, what is the usual solution?


#2

I am still wondering if I can create a Glue Data Catalog crawler that reads through all fields, including JSON fields, with some logic given, to get contexts and unstruct_event data, but this seems too ambitious. As of now, I manually created a glue table with those settings:


I inserted all 125 columns in the schema, with proper description to ease the life of my coworkers. For that, I used those references:
Using AWS Athena to query the ‘good’ bucket on S3, by @dilyan
Canonical event model doc in Snowplow’s GitHub wiki

As of now, we are able to query data through Athena and other services using this data catalog, and through Athena we can create Views that get the relevant data from JSON fields. This table is lacking partitions, which should be a nice addition, but at least we have something going.

We are discussing other approaches, for example, using Firehose with a transformation lambda function to transform data before load, which should make computing time easier, or give the EmrEtlRunner a try, but we would love to hear other possibilities from people that already faced those problems (or not, just people with ideas :stuck_out_tongue:).


#3

We used a Kinesis Firehose to get data from the Stream Enrich stream, mainly because it’s auto-managed and already partition data into (year, month, day and hour). The problem is that the data that comes from the stream into the firehose is a TSV where records are not separated by newline n (damn). We are using a lambda function to break the records every 130th \t, which let’s us keep this data flow.

After that we had to deal with the Glue Catalog. We managed to create a Data Catalog with partitions by running a crawler in the actual data structure, editting the catalog schema and configurations to properly read the files, which are basically the configurations that I provided in the image above. Then I editted the crawler to don’t change table configurations, aside from partitions, and voilà, we have a catalog with date partitions (year, month, day, hour) that can be updated via crawlers. I also tried to set up a crawler with a GROK expression for classifier, but failed miserably.

Although this solution is somewhat messy, I would say that the issue is solved, but I think that this subject could lead to some interesting discussions, which I’m open for.

Just to pinpoint: this flow is not necessary to load data in Redshift or Elasticsearch, for example, which is, to the best of my knowledge, the actual Snowplow downstreaming pipeline (I guess this article is great to understand the moving pieces), but this gives my team flexibility to query data directly from our data lake, which has data from other pipelines.