Home / Own posts / Data as building blocks

Data as building blocks

I’ve come across too many instances where data is treated as a byproduct of work, something that’s accessible but not truly usable. Data is generated as a side effect of operations, and people just make do. “The analysts manage with some queries and Excel spreadsheets, we’re fine” (and far too often, I’ve been that analyst).

My LinkedIn

Think of it like lumber in a sawmill – wood undergoes processing (cutting, drying, mixing with other materials) for specific uses. Before jumping into processing (the “how”), the essential question is the “why” – the demand:

Which parts require data for operation: automated processes, managed processes, analyses (product, forecasts, etc.), various controls? Each process demands a different frequency, a different breakdown, and different quantitative metrics.

On the supply side, the data must first be mapped according to the business’s requirements: correct identification of users at each stage, quantitative summation on the same scale (e.g., money always in dollar cents), and so on.

And then, of course, the “how” – what’s needed to organize the data, which tools will work best. As trivial as it may sound, normalization and orderly loading of data is critical. Without it, any attempt to analyze, understand, and leverage the data will quickly hit a wall (especially when it comes to saving money).

As a basic starting point for normalization,

there are two types of data: categorical data, which is usually updated infrequently, with global categories like dates or countries changing little if at all (the last time the calendar itself was updated was in 1582 by the Catholic Pope Gregory. Its adoption in Eastern Europe, the Orthodox world, was in the 20th century). Categorical data, such as users, SKUs, and others, usually needs to be updated on demand, and no faster than the quantitative data. It is important that the update is complete. With dates, for example, it is relatively easy to create a table that contains for each date the day of the week, the day of the year, the quarter it belongs to, holidays, and important events (the more meticulous will separate into a separate events table). If we take SKUs, then it is also important to import barcodes, product names, manufacturers, entry and exit dates from the catalog, etc. (yes, of course, proper data management will also include a manufacturers table, and a catalogs table, etc., until an epsilon small enough – that is, until it stops being interesting).

When pulling external data,

there are usually global standards that are important to consider, for any possible future connection: exchange rates, barcodes as mentioned, tickers for traded companies, and so on. Quantitative data is where the gold is usually buried. By definition, this data contains the business performance, whether in money or any other metric. This data usually comes from operational systems and is built in a way that serves the operational process responsible for it. The overlap between it and analytical data is high by definition, but not identical. The easy and convenient thing is to take it as is, which is often the problem: operational data is updated according to operational needs, not analysis needs: a time zone change because someone changed a setting in AWS, a slightly different tagging definition, a new set of tags… there’s no shortage. The recycle bin on every analyst’s desktop are full of reports that had to be recreated because of this. The obvious solution (which somehow never happens 🤔) is to create a dedicated table for analysis and reporting needs, characterized accordingly or for all data needs in general. Superficially, we’ve doubled the data, but in reality, we’ve severed the dependency from any third party. Any change in tagging, time zone, naming convention will be handled in the loading (ELT, ETL, MPEL, or any other approach) , but our table will remain the same. Of course, with keys to the pre-defined categorical tables.

This way, we can load data from any relevant source,

and as long as we maintain the same structure, we can actually create a single source of truth. The deeper the normalization goes into the data, the more systems the single source of truth can serve, including operational ones where some genius decided that it’s best to spin up an AWS server in India, or play hide and seek with customer financial transactions.

I would plan each stage separately, independently of the others, so that replacing any part doesn’t affect the rest. Data isn’t just a byproduct for reports; it’s an integral part of the product, even if it originates as a byproduct. For digital products, data is, by definition, the raw material.


Continue reading: Data as building blocks

Leave a Reply

Your email address will not be published. Required fields are marked *