Data Demythed

Debunking the "conventional wisdom" of data modelling, normalization, etc.

Data Modelling

4NF: We Do It All the Time


Consider the tuple {Emp#, EmpName, Salary}.

By any reasonable definition of the terms, this is clearly in 3NF with EmpName and Salary fully and directly dependent on Emp#. And yet anyone with even modest modelling experience will “intuitively” know to separate them into {Emp#, EmpName} and {Emp#, Salary}.

In fact, this is the resolution of a multi-valued dependency and that “intuitive” action is a move to 4NF.

But what does “multi-valued dependency” mean in practice? In the example, EmpName and Salary have different relationships to Emp#. To put it another way, the value in EmpName typically changes at different times (rarely) for different reasons (e.g., marriage) than Salary (annual review, perhaps?). In 4NF, all elements of a tuple are related in the same way, and are therefore subject to change together at the same time for the same functional reason.

I’m old enough to remember pre-mobile/cell phones, when {..., HomeAddress, HomePh#} was generally considered 4NF. (Well, almost: I had a friend change address by moving across the street and the phone company physically swapped the phone wires of the two houses in the street box, so that he was able to keep his landline number. But that was clearly unusual!) With technical developments that allow ph#s to be moved from one place to another, even between landlines and mobile/cell phones in some places, this is now clearly not 4NF.

If this was part of a payroll system, which presumably only needs to know the current basis for a pay cheque, then 3NF might suffice. But if it’s part of a personnel system, then surely the day after it went live someone would notice the inability to capture salary history.

Interestingly, 4NF seems to have anticipated that for us: adding AsOfDate to the {Emp#, Salary} tuple and its key is a relatively simple data base change. We can probably expect changes to existing code to be relatively straightforward as well, especially where the tuple is just being read. Obviously update routine changes may be a bit more involved.

If we went with the original 3NF tuple, remediating the change will be much more involved. Do we restructure the data base by splitting the table, and refactor all the code affected by that? Much more work will be involved, including in the regression testing required. If instead we simply add the column to the existing 3NF tuple, coding will forever be more complex as it deals with redundancy issues in the multiple entries containing a person’s name. If EmpName was declared as an Alternate Key, that data base-enforced integrity constraint will have to be abandoned with further serious consequences rippling through the system.

Rolling back the impact if a promotion fails is also more difficult in the latter case: it is easier to drop column(s) in the 4NF scenario than glue two tables back together in the 3NF case.

Of course, this is a simple example. Normalization violations are often more difficult to identify than this example. But the cost of remediating will be lower the more normalization has been done up front.

And because 4NF essentially deals with time-based data issues, and maintaining history is essential for so many reasons today (audit trails, marketing, …), it has become extremely important to understand and aim for 4NF data models. At least.