Data Demythed

Debunking the "conventional wisdom" of data modelling, normalization, etc.

Data Modelling

The Anti-Normalization Myths

4NF, Data Definition, Normalization

Everyone knows that normalization doesn’t work in practice.

Except me.

Everyone knows it especially doesn’t work for data warehouses.

Except me.

Everyone knows that normalization causes performance issues.

Except me.

Everyone knows that you stop at Third Normal Form. Or maybe Boyce-Codd Normal Form.

Except me.

My memory of venturing into the data side of the IT business back in the 1970s is that normalization was still evolving. As I recall, there was concensus that there would be five rules, and definition of the first three was accepted fairly quickly. Agreement on the fourth floundered for a period, with the Boyce-Codd form accepted as an interim step. Agreement on the fifth seemed to follow relatively soon after the fourth.

What was clear is that each normalization step yields more tuples, or tables if you wish, than the one before it. Even before fourth normal form (4NF) was formulated, it was seen that removing multi-valued dependencies would yield significantly more tuples. Indeed, it seems to me that the myth about stopping before 4NF gained traction before it was defined. Before, that is, we even understood what the impact and usefulness of Fourth Normal Form (4NF) might be.

And, of course, that assumption has prevailed even though, as I will argue in a future posting, 4NF is used all the time.

Early in my career I faced the conflict of finding normalization enticing despite the “conventional wisdom” of its impracticality. I assumed that I would at least find some guidelines on when and how normalization was useful, and where to compromise. Alas, I found nothing. In fact, I have never found any evidence of anyone - besides myself - making a serious effort to apply normalization rules as a data modelling discipline in real-world problems. Cases may exist, but I’ve never encountered anything beyond my own work in a career of 40-odd years.

Academic/textbook treatments don’t count. Their use of small, carefully-constructed examples does nothing to help real world practitioners who don’t have the luxury of small data sets with clear data issues. If we have data, it’s probably too much to analyse in sufficient depth in the time available. Even more problemmatic is when we don’t yet have the data. That leaves us dependent on what people tell us about it, which will always be less than we need to know as data modellers regardless of our skill in extracting requirements. In fact, defining data is the single, most difficult part of the modelling exercise and yet it is a topic on which academic/textbook treatments spend little to no attention. This issue will permeate my future writings because normalization is, ultimately, all about defining data properly. It is those data definitions that give us what we need to know to apply the normalization rules.

I have dealt with a lot of IT people who knew the myths and baulked when they encountered my refusal to embrace them. Some rejected it out of hand and tried to rebuff, even undermine, my efforts. I’m pleased to say that the majority worked with me and, I’ll dare suggest, we were all much happier with the results. I will write more about such experiences in the future.