Here's a basic architecture diagram of our data platform:
As you can see we use DBT as a workflow and data transformation layer, and BigQuery for data storing and querying.
The goal is to create tables that make analytics easy.
In this article, we will start by explaining the structure of the events and the main challenge this presents. Then we will review the solution we have implemented with DBT and how it enables us to process hourly incoming web events, then historically append new events that have happened in the past 24 hours, without re-processing the entire table.
When a user loads a page there is an event sent which ends up in the data lakehouse.
When a user continues to browse this page, more events are sent (referred to in this article as additive events).
If a user loads the page at 11:50am, then for the next 2 minutes reads the page, views a sub component of the page (a support banner), then clicks on a sub component of the page. The events could be expressed as:
These events would end up in raw format in the lake-house like so:
Now if we wanted to parse the data every hour the SQL might look something like:
This translates the data into a structure that's easy to understand and analyse:
We process the current hour of web events and the previous hour (from now on referred to as the mop up hour) to process any 'just missed' events.
For example if the user views the page at 3:59pm, but we don't get the pageview until 4:02pm. This will have been missed if we just process 3:00pm>>4:00pm at 4:00pm. So instead at 4:00pm we process 2:00pm>>4:00pm, then at 5:00pm we process 3:00pm>>5:00pm.
This is a fast solution as we don't process much data at a time and the JOINS are small.
The problem:
However the user could go back to the page much later:
In the above example we've got to match the additive event at 16:05 pm to the page view event at 11:50am. We could do this by running the past 6 hours every hour (11:00 am >> 5:00 pm), but this would be a lot of data to reprocess, and as most people don't have such long interactions, most of the re-processing would be for nothing as most of the data isn't changing.
On the fact_page_view DBT model we have a macro (The Stitching Macro) that runs in a post-hook. Every hour there are separate models (int_component and int_attention_second) that run for new, additive events. They unnest the fields so there is one row per event and take a hash of the row as a unique key, using the page_view_id and all the other values.
Then the macro finds any matching page views in the past 24 hours of page views for these new additive events, and appends either a unique component_event or higher attention_second value.
The macro also checks that we don't put a duplicate component_event on (if the browser sends the same event twice), or if the macro is re-run for a certain hour it doesn't re-append the entire hour as new events.
Slot time optimising is hugely important here, as we are processing 24 hours of web event data every hour.
Wouldn't this be simpler as multiple UPDATE statements?
Yes, it would save a headache of UNION-ING NULL columns, and then dealing with those NULLS.
However we found that it reduced slot time by grouping all the additive events into one query or UPDATE statement.
The UPDATE statement does a final JOIN to fact_page_view to push the values on (noted by the arrows circled below), but to do this it ingests 24 hours from fact_page_view, with multiple updates BigQuery was doing this ingest multiple times. By only doing one UPDATE we remove this otherwise duplicated slot task.
Conclusion
Due to DBT's atomicity nature of only updating the entire set of columns in a table, we have had to create a bespoke solution to late arriving web events.
We have produced a low latency, high accuracy and low(ish) cost solution. With the price paid being complexity in a non dbt native, slot optimised method.