Debunking the "conventional wisdom" of data modelling, normalization, etc.

Data Modelling

2NF: The Missing Use Case

0NF, 1NF, 2NF, 4NF, Data Definition, Normalization

The previous post of “5NF: The Missing Use Case” was the first of four explaining why I gave up looking for a good data modelling book by the end of the 1980s. In this second post, I discuss what I saw as a gap in the Second Normal Form (2NF) discussion. Filling this gap helps us to appreciate how generally applicable, and consistent, Normalization is.

The Classical Explanation

A typical “textbook” example is something like:

{ Employee#, EmployeeName, Salary, SalaryDate }

Assuming that tuple is in First Normal Form (1NF) and the definition:

Salary
The amount an employee is paid annually starting from the indicated SalaryDate.

we can show dependencies as something like:

{ Employee#, EmployeeName, Salary, SalaryDate }
Classical Pre-2NF Dependencies

With EmployeeName having only the Employee# dependency while, per the definition, the Salary is also a function of SalaryDate, the following puts our data into Second Normal Form (2NF):

Employee{ Employee#, EmployeeName } EmployeeSalary{ Employee#, SalaryDate, Salary }

The Missing Discussion

Consider the tuple:

{ Employee#, EmployeeName, Salary, SalaryDate, Part# }

with:

Part#
Identifies a component used in the assembly of the products the company sells.

Part# clearly has no relationship to any of the other items (highlighted by the shaded box around it in the following with no association lines in or out of that shading):

{ Employee#, EmployeeName, Salary, SalaryDate, Part# }
Undiscussed Pre-2NF Dependency

And that lack of relationship also makes it a 2NF violation. Which yields the following 2NF solution:

Employee{ Employee#, EmployeeName } EmployeeSalary{ Employee#, SalaryDate, Salary } Part{ Part# }

The “textbook” examples, at least when I was looking through them, completely missed this point.

Why is This Important?

Because it highlights the generality of Normalization: we can start with all identified, and documented, data items (what I call Zero Normal Form, or 0NF) in one, big tuple. And Normalization, alone, will decompose the data items into their component groupings.

In reality, Part# is unlikely to be alone. PartDescription is an obvious additional data item to have identified:

{ Employee#, EmployeeName, Salary, SalaryDate, Part#, PartDescription }

Drawing the (assumed) dependencies clearly shows the lack of connection between the Employee- and (shaded) Part-related data items:

{ Employee#, EmployeeName, Salary, SalaryDate, Part#, PartDescription }
More Typical Pre-2NF Dependencies

and its 2NF form is evident:

Employee{ Employee#, EmployeeName } EmployeeSalary{ Employee#, SalaryDate, Salary } Part{ Part#, PartDescription }

In practice, of course, many - most? - data modellers would “intuitively” split unconnected-to-Employee data items, Part# and PartDescription, into their own tuple without even thinking about it being an application of 2NF.

But that, importantly, is what they are doing. And it would be better if it was a change made consciously from that perspective, so that the transformation correctness can be properly asserted.

A First Normal Form (1NF) Aside

What if Salary was better defined?:

Salary
The amount an Employee is paid annually starting from the indicated SalaryDate. The Salary may be adjusted by annual review, a change of assignment, a performance review, or for another reason as decided by management.

This highlights something we missed with the previous definition: the “… adjusted …” part makes it clearer that there can be multiple { Salary, SalaryDate } values for an Employee. Hence the original (0NF) tuple becomes:

{ Employee#, EmployeeName, { Salary, SalaryDate } }

And to get this to 1NF, we would need to transform it to … well, what gave us 2NF above!:

Employee{ Employee#, EmployeeName } EmployeeSalary{ Employee#, SalaryDate, Salary }

In effect, 2NF has “compensated” for our missing the multiple values condition that would’ve resolved it via 1NF, and gotten us to the same solution anyway.

A Fourth Normal Form (4NF) Aside

What if, instead, the definition of Salary and SalaryDate was more limited:

Salary
The current amount an Employee is paid on an annual basis.
SalaryDate
The date on which the current Salary started.

These definitions specifically preclude more than one value of each. In other words, knowing the Employee# is sufficient to determine each of EmployeeName, Salary, and SalaryDate. This means we have:

{ Employee#, EmployeeName, Salary, SalaryDate }
1NF, 2NF, and 3NF Dependencies for Current Salary

Which means, in this context, that our 0NF tuple is now also our 1NF, 2NF, and (since there are no indirect dependencies) 3NF solutions:

Employee{ Employee#, EmployeeName, Salary, SalaryDate }

The difference is that, because Salary and SalaryDate are defined as “… current …” values, there is still only one tuple per employee. Hence, SalaryDate does not become part of an identifying candidate key.

However, 4NF requires items in a tuple to have the same functional relationship. Clearly this is not the case. The functional relationships to the Employee# identifier of each of EmployeeName versus { Salary, SalaryDate } are very different:

isKnownBy( Employee# ) -> EmployeeName earns( Employee# ) -> { Salary, SalaryDate }

The functions can be thought of as labels on the arcs:

{ Employee#, EmployeeName, Salary, SalaryDate }isKnownByearns
4NF Functional Relationships

So we split them based on the functional realtionship. And this again almost gets us back to the 2NF form discussed above:

Employee{ Employee#, EmployeeName } EmployeeSalary{ Employee#, SalaryDate, Salary }

Interestingly, if the requirement for multiple { Salary, SalaryDate } values for an Employee is later identified (i.e, the history of that is required), it is simply a matter of changing that candidate key to, again, exactly what we had for the 1NF and 2NF solutions above:

Employee{ Employee#, EmployeeName } EmployeeSalary{ Employee#, SalaryDate, Salary }

There are no structural changes required, such as splitting a tuple containing that + EmployeeName in this example.

And since the model tuples map to database tables, that means no database restructure to split a table into two. Splitting tables can have major “ripple” effects through code. Changing the assumptions around an existing table - how a row within it is identified, in this case - should make it easier to assess the coding changes required. And, with well-structured code, those changes should be more isolated, leading to easier testing before promotion to production.

But with the “… current …” Salary definition in place, what if SalaryDate had not been identified as a requirement?

{ Employee#, EmployeeName, Salary }

This still satisfies the requirements of 1NF, 2NF, and 3NF.

But not 4NF, because the functional relationships to Employee# of each of EmployeeName versus Salary are still different, as shown above. Hence, with 4NF we still finish up with:

Employee{ Employee#, EmployeeName } EmployeeSalary{ Employee#, Salary }

Even now if(/?when) the need for Salary history becomes apparent, the addition of SalaryDate and its inclusion in the EmployeeSalary candidate key is still confined to a single tuple. And, by extension, still means changes to the single table that implements that tuple. With all the benefits of lower impact to the coding changes already mentioned.

It’s almost as if Normalization “sees” the gaps in our original understanding of the data, and works to compensate for that before we even know we need it.

In Summary

I started by explaining how a gap in the “textbook” coverage of 2NF hid an important aspect of Normalization’s generality.

I expanded on that discussion to try to show that application of Normalization rules was very much affected by how we understood the data items. Depending on the definitions and/or their interpretation, we “landed” in a place where each of the following Normalization rules became applicable:

There are three important things to note here:

In other words, it is arguably only necessary to learn 4NF and later (i.e., 5NF), since 4NF will also take care of 1NF, 2NF, and 3NF.

This, of course, stands in stark contrast to the “conventional wisdom” that stopping after 3NF - or the “3.5NF” portrayed as Boyce-Codd Normal Form (BCNF) - is sufficient for most purposes. Which it is simple to demonstrate is not the case.

Acknowledgement

My thanks to Anthony for reviewing this.