The Fundamentals of Analytical Data Models
Part 3 of the series titled "Deconstructing Data Models"
👋 Hey there
Before getting into today’s post, I’m sharing this awesome session I’m hosting next week.
This masterclass on Recommender Systems with DJ and Schaun has been designed specifically for product and growth practitioners like yourself to learn what really matters when implementing a recommender system.
Welcome to the third and final part of the series titled Deconstructing Data Models. If you haven’t already, I highly recommend consuming parts one and two first.
Permanent link to this post to bookmark and read later →
Analytical data modeling is the process of defining what data needs to be made available in what shape for analysis or activation purposes.
But what does this process look like in practice?
Before I answer that, let me specify in the simplest possible terms what an analytical data model is:
An analytical data model represents a set of rules, that once applied to one or more tables in a database, creates a modified view of an existing table or a new table altogether.
In other words, an analytical data model can be used repeatedly to transform raw data that is often messy into clean, usable data.
Moreover, the purpose of an analytical data model is to transform data without modifying the source data. When a model is executed, the resulting table – whether a view (a virtual table) or a materialized view (a new table) – is easier for humans to read and analyze, as well as for machines to interpret and process.
Building models
To create an analytical data model, one has to define a set of rules on top of some available data. This is typically done using SQL which is easy to read and write and is the most widely used language for data modeling.
I recommend learning the basics of SQL if you’re interested in working with data (you wouldn’t be reading this if you weren’t). Knowing some SQL helps understand what happens under the hood in a database, enabling better conversations with data folks.
An SQL query references tables in a database and when it is run – either manually or as per a schedule – it looks for rows of data in the referenced tables, executes the rules, and if there are no errors, outputs a table with the expected data. The output can be analyzed directly, visualized using a BI tool, or synced to activation tools.
Queries vs. Models
An analytical data model is a SQL query but a SQL query doesn’t have to be a data model.
Going back to the example in Part 1 of this series, the hypothetical SaaS company serves 10,000 users across 1000 workspaces out of which 200 are paid. To present data about customers, you need to write a simple SQL query that checks for workspaces that are on one of the paid plans at a given point in time. Assuming that the Workspace table contains a boolean column called is_paid, the query would look like this:
Select * FROM Workspace WHERE is_paid = TRUE;
In SQL, the asterisk represents “all” – the rest of the query is self-explanatory.
The query above can be saved as a data model and can either be executed each time against the original table (Workspace) or a new table called Customer can be created once, which only updates when the model is executed again. The key difference is that the new table (the materialized view) uses additional storage but is also faster to load since the model only needs to be executed if there’s an expected change in the data set. If you only wish to view all the rows in the Customer table, you need not execute the model – you just need to run the following query:
Select * FROM Customer;
It’s also worth highlighting that even though SQL is often associated with data transformation or data modeling, a query doesn’t always transform data – it can simply fetch an entire table (like the one above) or specific columns in a table. It doesn’t make sense to refer to such a query as a data model as it simply fetches data – it doesn’t model it for repeated use.
Now let’s say you want to take a look at data about customers who are on the Pro plan. Assuming that the Customer table contains a column called plan_name (whose data type is enum or string), the query would look like this:
Select * FROM Customer WHERE plan_name = ‘Pro’;
This query is the equivalent of applying a filter in a spreadsheet or a BI tool or creating a simple segment in an email marketing tool.
Whether it makes sense to save a query as a model or not is a matter of personal preference – there’s no right or wrong answer here. The whole idea behind creating data models is to save time by eliminating repetitive work. That said, saving every query as a data model can lead to model bloat which is a painful thing to manage. Remember, more models, more problems.
Managing models
That brings us to the importance of managing analytical data models.
Until a few years ago, a system to manage data models didn’t exist because there wasn’t a need for one. Data modeling was done before data was loaded into a data warehouse – a practice that led to the creation of fewer models and a lower volume of data being stored.
Why?
Because storing and processing large amounts of data for analytical purposes was expensive.
Instead of jumping straight into a history lesson here, I figured it might be more fun to first share a couple of verses from a song I wrote called The Modern Data Track – here you go:
The separation of Compute and Storage in the cloud,
Made warehousing cheaper, more accessible, much faster.
Queries take seconds to process in parallel,
Pay as you go, storage is infinitely scalable.
ETL became ELT
Because Extracting and Loading without Transforming is easy,
Once in the warehouse, whenever the need be,
Data can be transformed by writing a SQL query.
A brief history lesson
Historically, analytical data modeling was performed before data was loaded into an analytical database for reporting purposes (data activation was less common and was largely disconnected from analytical workloads).