Mastering Modularity - Deep dive into Effective Data Modeling Techniques

{/

/}

Maintaining an efficient data pipeline for many years involved tedious tasks, repetitive problem-solving, and limited access for only a few trusted individuals. The aftermath of this process often led to a tangled mess of code as skilled builders moved on to new opportunities, leaving behind a complex legacy.

The introduction of cloud data warehouses and tools like dbt has changed the game, offering a more enjoyable and efficient approach. The shift involves moving from extensive scripts to modular, accessible data transformations, with a streamlined workflow encompassing data procurement, staging, and presentation to end consumers.

The workflow is now neatly segmented:

  1. Acquire raw source data.
  2. Prepare the data in staging models as required.
  3. Present the processed data to end consumers through fact + dimension models and data marts.

The trustworthiness of data for analytics engineers hinges on how well it is prepared for analysis. If end users lack confidence in the data, they may resort to isolating their work, undermining the collaborative potential of analytics.

To keep data modeling efforts on track, consider the following questions:

  1. Are models consistently defined and named? Is their purpose immediately understandable?
  2. Are models easy to read?
  3. Are models easy to debug and optimize? Can anyone identify and rectify long-running models?

We'll delve into these questions shortly, but first, let's rewind to the era before modular data modeling techniques.

Traditional, monolithic approaches to data modeling

Before the use of any transformation framework or tool, the most common or reliable way for data modelling analysts have are SQL scripts.

Imagine or maybe you are still dealing with SQL files with 10,000 lines of code, or if you were going for something more technical (or fancy) you have split these 10,000 lines into multiple smaller files or stored procedures scheduled to run in a specific order orchestrated by python cron jobs.

Now in the above scenario, since no one other than author can exactly use the components of the huge SQL file, for any new requirement of transformation, analyst have to start from source again. Or there will be multiple copies of the same code.

We can draw parallel of this as the traditional monolithic architecture in the software engineering practices: each consumer of data would rebuild their own data transformations from raw source data.

What is modular way of data modelling?

Traditional, monolithic data modeling involves crafting a single, comprehensive structure, often leading to complexity, rigidity, and challenges in maintenance. Enter modular data modeling, a revolutionary approach that advocates for breaking down complex systems into smaller, self-contained modules.

Within modular approach, every user of data model in the organisation can reuse the modules created by others

The Core Principles of Modularity

  • Scalability: Modular data modeling allows for scalable development. Each module can be developed and maintained independently, facilitating the growth of the system without compromising efficiency.
  • Flexibility and Adaptability: With modular design, adapting to changes becomes seamless. Updates or modifications can be made to specific modules without disrupting the entire system, fostering agility in response to evolving business requirements.
  • Reusability: Modules can be reused across different projects or within the same system, eliminating redundancy and promoting efficiency in development.

Upon adopting dbt as data modeling framework, there is a shift in perspective on data models from viewing them as monolithic entities to considering them as distinct components:

Which transformations were common across various data models? Were there elements that could be abstracted into foundational models and referenced in multiple contexts?

In dbt, one data model can reference another using the ref function.

Further Reading