Are there still data cube sql files available


#1

On this page in the documentation there is a link to download sql files for redshift for cubes and recipes

The github link is broken , looks like you changed the name of the folder from 5-analytics to 6-analytics but I think that I found the folder in github that was referred to in the documentation.

https://github.com/snowplow/snowplow/tree/master/6-analytics/recipes/redshift

The documentation seems to indicate there are both recipe and cube sql files available for download but I only see recipes at this github link. So are cube sql files still available and if so, where do I find them?


Trouble setting up views on AWS Redshift
#2

Hi @ryanrozich - we’ll publish new ones later this week.


#3

Thanks @christophe !


#4

Awesome @christophe - looking forward to it!


#5

Hi @bernardosrulzon and @ryanrozich - the model I mentioned was published to this branch on GitHub: https://github.com/snowplow/snowplow/tree/feature/web-models/5-data-modeling/web-model

Christophe


#6

Thanks so are these the data cube files that you referenced in the docs?


#7

@ryanrozich I’m not sure what docs your referencing, but those are indeed the files. Let me know if you have any questions or feedback (there’s no documentation ready yet - it will be added before we publish them).

Christophe


#8

@christophe the docs I was referring to were the link that i pasted in the original question for this thread

Thanks for posting these, Im opening up now and taking a first look at these files, it looks like these are creating materialized views as tables in these schema. A couple questions

(1) Is the purpose of materializing the views to speed up query execution as opposed to just having them be dynamic views?

(2) Do you have any benchmarks on how much additional cluster resources (disk, etc) it would require to run all of the data cube DDL files you posted? Specifically how much we should allocate on top of what is loaded and stored in the atomic schema.

(3) Whats the recommended way of using these? Is it (a) to drop and recreate the schema every time you want to do a study on web data or (b) Is the intent that we should try to keep these in somewhat in sync with the raw data that is in the atomic schema? If so, should all of these get dropped and recreated in batch on a certain schedule (like every day), or would there be some sort of incremental loading of just new data?


#9

Those are all good questions @ryanrozich!

That documentation will need to replaced - I created a ticket for us to do so: #2932.

That’s indeed the main reason to have materialized views. There few others, which were mentioned in this post: Should I use views in Redshift?

It shouldn’t have too much of an impact on disk usage (which will increase at most a few percentage points).

I’d expect most people to run them on a regular schedule. One option is to, for example, have a nightly run. Another one is to run them each time a new batch of events gets loaded into Redshift. If you’re a Looker user, you can use Looker variant of the model. If you use our SQL Runner application to schedule SQL, you can use this version instead.

On the last point:

The current version doesn’t do that, but we plan to make that possible. We have written a tutorial on this topic: Making SQL data models incremental to improve performance [tutorial]

Hope this helps!

Christophe


#10

Thanks @christophe for the answers! These tables are great by the way.

I believe that I did find a bug in one of your SQL files 03-events-scroll-depth.sql

  LEAST(GREATEST(MIN(NVL(ev.pp_xoffset_min, 0)), 0), MAX(ev.br_viewwidth)) AS hmin, -- should be zero
  LEAST(GREATEST(MAX(NVL(ev.pp_xoffset_max, 0)), 0), MAX(ev.br_viewwidth)) AS hmax,

  LEAST(GREATEST(MIN(NVL(ev.pp_yoffset_min, 0)), 0), MAX(ev.br_viewheight)) AS vmin, -- should be zero (edge case: not zero because the pv event is missing - but these are not in scratch.dev_pv_01 so not an issue)
  LEAST(GREATEST(MAX(NVL(ev.pp_yoffset_max, 0)), 0), MAX(ev.br_viewheight)) AS vmax,

I think you want to limit the scroll to the max width/height of the document not the max width/height of the browser viewport, for example I believe that the last line should read:

LEAST(GREATEST(MAX(NVL(ev.pp_yoffset_max, 0)), 0), MAX(ev.doc_height)) AS vmax,

I noticed that for many of the page views that the vmax was equal to the viewport height exactly, and when I looked at MAX(ev.br_viewheight) I noticed that the users were scrolling much further down the page.


#11

@ryanrozich You might be right - I’ll have a look asap.


#12

One more question @christophe - your data modeling queries seem to run very fast, I’m trying to add my own data modeling queries to add additional context and they seem to take forever to run. I’m wondering if you have some general principals you use to construct these queries to make them execute quickly.

Let me give you an example: We have a custom context that attach to page view events in the JS tracker, right now all we do is put the page canonical_link (if it exists on the page) in this context. I would like this canonical_link to end up in the web.page_views table when the data modeling is done. Following your examples I am trying to create a temp table for relating page_view_id to canonical_link and plan on joining this into the final web_page_tmp table in the 06-page-views.sql file.

However the query to create the temp table is taking forever. To start, here is the DDL for the custom context I am querying against:

-- AUTO-GENERATED BY igluctl DO NOT EDIT
-- Generator: igluctl 0.2.0
-- Generated: 2016-10-23 08:34

CREATE SCHEMA IF NOT EXISTS atomic;

CREATE TABLE IF NOT EXISTS atomic.com_onespot_pageview_context_2 (
    "schema_vendor"  VARCHAR(128)  ENCODE RUNLENGTH NOT NULL,
    "schema_name"    VARCHAR(128)  ENCODE RUNLENGTH NOT NULL,
    "schema_format"  VARCHAR(128)  ENCODE RUNLENGTH NOT NULL,
    "schema_version" VARCHAR(128)  ENCODE RUNLENGTH NOT NULL,
    "root_id"        CHAR(36)      ENCODE RAW       NOT NULL,
    "root_tstamp"    TIMESTAMP     ENCODE LZO       NOT NULL,
    "ref_root"       VARCHAR(255)  ENCODE RUNLENGTH NOT NULL,
    "ref_tree"       VARCHAR(1500) ENCODE RUNLENGTH NOT NULL,
    "ref_parent"     VARCHAR(255)  ENCODE RUNLENGTH NOT NULL,
    "canonical_link" VARCHAR(4096) ENCODE LZO,
    FOREIGN KEY (root_id) REFERENCES atomic.events(event_id)
)
DISTSTYLE KEY
DISTKEY (root_id)
SORTKEY (root_tstamp);

COMMENT ON TABLE atomic.com_onespot_pageview_context_2 IS 'iglu:com.onespot/pageview-context/jsonschema/2-0-0';

Here is a query that I tried writing to put into the data modeling pipeline to relate page_view_id to canonical_link in this table

SELECT 

	wp.page_view_id,
	ctx.canonical_link

FROM atomic.com_onespot_pageview_context_2 ctx

INNER JOIN scratch.web_page_context wp USING(root_id)

It ran for over 45 minutes and I decided to just cancel it. Any ideas for making this run faster?


#13

For this case I found that joining on scratch.web_events instead of scratch.web_page_context sped up this query quite a bit. However, if you have any general tips on writing efficient queries for data modeling, I’d appreciate it.


#14

Hi @ryanrozich - that has been fixed: https://github.com/snowplow/snowplow/commit/059bdbe12d5a433709ab46207d527480d5e58d34


#15

Hey, any idea when this will be fixed? I still can’t access the page in the original post.


#16

Hi @hishamsajid,

That’s because this is an older conversation. Further changes were made since.

We migrated the web model to its own repo on GitHub: https://github.com/snowplow/web-data-model

The recipes are no longer available on GitHub, but you can still find them in this post by Yali: Basic SQL recipes for web data

Hope this helps,

Christophe