Contextless Data Collection

Unreliable experiments, suboptimal decisions, degraded experiences, worse products, and ultimately, frustration.

May 01, 2024

I don’t know what got us here but today, it’s a common practice for teams to collect all the data they can from all possible sources and store it all in a centralized database – even when there is no context or predefined purpose.

Contextless data collection has become somewhat of an obsession, and the rationale behind this obsession is that data becomes more accessible for teams when stored in one place (typically a cloud data warehouse, data lake, data lakehouse, or whatever comes next).

However, even though modern data tools have made it cheaper and faster for teams to collect, store, and access large amounts of data, we have to ask these questions:

What’s the point in making data accessible without a predefined purpose and a measurable outcome?
If there’s no experiment to run, decision to make, or outcome to measure, why should the data be collected in the first place?

After all, data collection, like any other activity, requires talented individuals to expend their energy and organizations to spend money.

We have to keep in mind that as soon as the collection process is set in motion and data begins to land in the centralized storage unit in the cloud, storage costs begin to accrue. On top of that, querying the data to create a report or to move a dataset to a downstream system is an additional cost.

Lastly and most importantly, the larger the amount of data, the longer it takes to run a query successfully, which in turn, costs more money.

Let’s look at an example:

X and Y offer competing products and both maintain a report that fetches the latest data every morning (a query being run on a schedule). The CEO of X believes in collecting all the data, irrespective of whether it’s needed for the daily report or not. On the contrary, the CEO of Y thinks it’s better to collect only the data needed for the daily report.

X is obviously paying more for storage, but that’s not it. Every time their respective queries run, X pays more than Y for compute as well because X’s query takes longer to run as it has to process more data.

This is a simplified example but you can imagine the impact when this scenario is multiplied by the number of different queries that are executed every day at large organizations. Irrespective of the outcomes, they end up spending a lot on their cloud bills (and then spend more figuring out how to reduce those bills). Moreover, more data not only increases direct expenditure but also increases the element of risk in the event of a privacy audit (or data breach).

But for a minute, even if we put cost and risk aside, the practice of contextless collection hinders growth teams from getting involved in the data collection process which further prevents them from understanding what data is needed to answer their questions – a prerequisite for folks who are hungry to drive growth using data.

As depicted in the figure above, contextless data collection is the antithesis of the GDG model1 – it appears to be faster but sooner or later, the process becomes lengthy and circuitous, and adversely impacts everything that follows.

On the contrary, when there’s context, questions come up leading to collaboration between teams. As a result, teams are also able to come up with relevant metrics before collection is initiated, leading to a robust data foundation (Good Data).

Pause to ponder 🤔

Based on what you’ve read so far, take a few minutes to think about these popular narratives:

Intro to the GDG model

databeats newsletter 🥁

Discussion about this post