Domain and Data

Adam Lammiman
4 min readNov 8, 2022
Image by Wynn Pointaux from Pixabay

I’ve been thinking a lot about data recently and how best to integrate software and data solutions. I have also been enjoying the Databricks series champions of data and AI series where leaders in this area talk about their approaches.

This has lead me to formulate this working theory. I will fully admit that it is a work in process and more of a thought experiment than a fully fleshed out philosophy, that being said I have found it useful for conceptualising the differences in these spaces and how they might interact.

Broadly I’ve started to think of two broad categories to put software in, Domain and Data. Like any attempt at categorising and putting things in boxes there are potentially some grey areas and places that don’t fit, but as a model I think it does give some way of dividing the purposes of solutions and the different rules they might abide by.

Domain

A domain application has a specific purpose, it has been developed for a reason, to achieve a focussed goal. It is the solution to a known problem (it maybe the first of a number of answers to that problem and both the problem and the solution may evolve over time, but both the problem and solution are bounded rationalisations).

If we are following the principles of Domain Driven Design then the software should have clear bounded contexts, express the language of the Domain, should model that domain and provide an answer to a specific set of problems in that Domain.

Infrastructure, storage and code should be streamlined and focussed on that purpose. The storage layer should be mainly focus on ‘posh state management’, computation and analysis are strictly limited to achieving the goals of the solution. Just enough, no more and no less.

boundaries and validation should be strictly maintained with invariants checked and managed through well crafted aggregates.

Data

A data application in contrast is much more open in it’s scope. It’s main concern in pooling as many data sources together and giving user the freedom to combine them to ask questions and discover new insights. It is about the ability to find answers to questions you hadn’t yet thought to ask. Where Domain applications are focussed on a goal, data applications are focussed on collation and analysis to form insights. These insights can then be used to inform further iterations of a Domain application or point in the direction of where the next product might arise from.

The main concerns of this type of space are scale, quality, ease of analysis and managed access. It’s important to be able to scale quickly and run analysis easily over large sets of data, it’s equally important that the data you are analysing is correct and it’s biases understood and it’s even more important that this data is only accessible to those who have the permission to see it.

The main boundary for a Data application are quality gates for how the data gets in and then tagging and meta-data to make sure that only those that should see it can see it. Strict validation at the boundary and data cleansing operations allow you to have confidence that data is correct and fit for purpose.

Interaction

How these two spaces interact is really important. The concept of a Data Lake is now becoming very familiar, if we extend that analogy then Domain applications are the tributaries that flow into the lake, everything flows down into it. Another way I’ve found useful is considering the Domain applications to be like the sense organs of an organisation feeding Data to a central organisation intelligence. That way as the number of Domain applications grow then so do the flows of sensory information feeding this intelligence, thus increasing its conception of its environment.

This also means that the quality of this information flow is vitally important, if the information is hard to gather or of low quality then the insights that can be generated from it are limited or worse potentially wrong.

This is why it’s important to have clear boundaries and data quality baked in, if this is focussed on it negates or reduces the need for complicated ETL procedures and data cleansing. Domain objects with clear validation rules for invariants coupled with Domain event driven and messaging systems like Nservicebus, RabbitMQ or cloud native solutions in Azure and AWS allow real time feeds into Data systems where the quality is baked in.

If messaging systems are not available or desirable then using specific database views or api that the data service can communicate through give similar clear boundaries, though you may lose the close to real time aspect of the messaging model, and allow Domain applications to feed into Data spaces without creating a dangerous web of invisible dependences.

Data cleansing may still be required for external data sources but with good quality control and event driven systems you can be confident that your internal data feeds are validated.

Conclusion

I understand these are broad brushes and there are levels of detail I’m not addressing (like reporting solutions, Data warehousing and specific ML applications), however as a broad set of principles I’ve found them useful for trying to articulate what I feel is different about these spaces and how they should fit together and interact.

As I have said before how we define our constraints in a system is important as it gives us clear guidelines as to what is permissible and what is not (ETL’s directly from application production databases for instance). Hopefully others can find use in this as well.

--

--

Adam Lammiman

Making software is a creative pursuit, not just a technical exercise. Exploring the best techniques for its development.