Data Demythed

Debunking the "conventional wisdom" of data modelling, normalization, etc.

Data Modelling

The Case Against Denormalization


Count me amongst those in awe of Euler’s Identity since seeing it for the first time, many years ago in high school in my case:

eiπ = -1

In the late 1890s, the US state of Indiana considered passing a law that would “simplify” the value of π. If the lawmakers had prevailed, the beauty of the above equation would have been lost because it cannot work with the wrong value for π. For example, using a π approximation of 22/7 yields:

ei*22/7 ≅ -0.999999201 - 0.00126448893i

Ugly by comparison, not at all the same thing, and certainly not “simpler.”

And that’s the problem with “denormalization”: in the name of “simplification” and/or “performance,” it deprives us of the correct answer to the problem, and probably a “beautiful” solution at that. And if it is not providing the correct answer, it must be providing the wrong answer. That is, it is a data model more appropriate to a different problem.

Which then begs the question: if the data model is not correct for the problem, how can the programs written upon it present the expected solution? Again the answer is clear: they can’t.

“Denormalization” only works because the problems are sufficiently complex that extra, compensating programming can usually make the delivered solution look sufficiently close to the desired one. Of course, the greater the “denormalization,” the more coding is required. Oh, and the result can only be an even more approximate solution. And yet we seldom, if ever, stop to consider that trade-off.

It’s well known that fixing design issues earlier is cheaper in the long run. “Denormalization” represents design flaws at the very foundation of a system. We need to be more systematic in assessing the cost of such design compromises as the implications ripple up through the coding layers of a system. The obvious way to do that is to require a data modeller to document the differences between the proper, normalized data model versus the data model that has been implemented. From that, a proper impact analysis can be undertaken by all involved in the implementation, to ensure the compromises are appropriate in the scale and scope of the system.

The tricky part of this proposal is that I’ve encountered many data modellers in my career who can offer the “denormalized” option, but few … nay, none who could tell me what the properly normalized data model was. The myth of “denormalization” is used to justify whatever the data model is. It’s time to expect more of the data models on which our systems are built.