Model Simplicity Is A Virtue

4 min read
Thumbnail for Blog Post - Model Simplicity Is A Virtue || blog/simplify-data-models/data-wrangling.png

During a recent analytics conference in London, one of the attending companies presented a staggering fact: they currently manage 85,000 models — which translated into an astounding 150+ models for each active developer. While this might first seem like an impressive figure, it begs the question: are we becoming overwhelmed by data? Such immense quantities teeter more towards a data swamp rather than a purposeful, streamlined collection.

Why We Do Data Modelling?

At its core, data modelling serves one primary function: to transform raw data into actionable, usable data assets. This transformation often involves translating intricate, system-level data into coherent business-friendly concepts. Semantic modelling, as it’s now commonly termed, focuses on making data decipherable for business-centric applications.

Regardless of the data modelling methodology you subscribe to, the aim remains consistent. The nirvana of self-service data is based on achieving this clarity. To make data insights universally accessible, we need to simplify complex datasets and structures.

Historical Context is Crucial

Back in the late 90s and 00s, data modelling’s main challenge was performance. The question was: how do we structure data for optimal processing and access with minimal computational resource? This period gave birth to renowned methodologies by Kimball, Inmon, and Linstedt (data vault).

The 2010s ushered in cloud data warehouses, where performance issues took a backseat and data tools became widely accessible. No longer did you need elaborate structures to obtain swift query times. As the allure of big data grew, Data Science was christened the ‘sexiest’ job title. Organisations were grappling with the mammoth task of scaling data while ensuring some semblance of order across numerous departments.

With the onset of ‘Data Wrangling’, a slew of SaaS companies like Alteryx, Tableau Prep, and Trifacta emerged. They offered intuitive data transformation capabilities, democratising data further and blurring the lines between IT and business roles. And then, in 2018, dbt emerged from this tumultuous landscape, proposing the rigorous control methods that software engineering had long adopted to be applied to data modelling. This also popularized a new role, the Analytics Engineer, renewing the focus on data modelling.

DataWrangling

The humorous side of Data Wrangling.

Query-driven Modelling: A Sticky Situation

The current bottom-up approach, or query-driven modelling, is at the heart of many data team challenges today. An unfortunate side effect of the proliferation and accessibility of dbt, many organizations find themselves landing in what feels like a data quagmire (giggity). Teams are inundated with hundreds, thousands or even tens of thousands of models, often without proper documentation. This results in dwindling data warehouse performance, soaring unscalable processing costs, and an overload of ‘data layers’ with often an unhealthy reliance on a single methodology, or total lack of any methodology.

Business Domain Modelling: The Need of the Hour

It’s imperative for data modellers, whether Analytics Engineers or otherwise, to begin with a comprehensive view of the business. This is often termed the domain or semantic model and we like to use those terms at Tasman as well. Such a model offers a high-level business perspective and serves as the foundation for detailed modelling.

While forming this holistic view, modellers must strike a delicate balance between the business’s ideal state and its current realities. For startups or businesses scaling rapidly, domain modelling is essential - forming the blueprint that directly informs lower level data modelling decisions. Meanwhile, large enterprises can apply the same principles to individual business domains.

More often than not, business domain modelling unearths key discrepancies in business logic:

  • Case of ‘What is a lead’?: We encountered a firm where the Sales, C-suite, and Marketing teams had disparate definitions for a “lead.” This variance in definition caused massive inconsistencies in KPIs, triggering a complete CRM overhaul.
  • Defining a ‘subscription’: On the surface, it seems straightforward. Yet, reaching a consensus on its definition can be complex. Is it merely a transaction with a set end date, or does it define a customer’s entitlement to a product? Differing interpretations can dramatically skew metrics.

Identifying these kind of issues early saves countless hours rebuilding organisational trust, and is critical to achieving ‘single source of truth’ dreams.

In Conclusion

Modern data modelling is at a crossroads. As organizations race to accumulate data, it’s crucial to pause and introspect on the methodology being applied. It’s not about how much data you have, but how effectively and cohesively you can utilise it. Only then can we steer clear of the looming data swamp and harness the true potential of our data assets.