2NF: The Missing Use Case
The previous post of “5NF: The Missing Use Case” was the first of four explaining why I gave up looking for a good data modelling book by the end of the 1980s. In this second post, I discuss what I saw as a gap in the Second Normal Form (2NF) discussion. Filling this gap helps us to appreciate how generally applicable, and consistent, Normalization is.
The Classical Explanation
A typical “textbook” example is something like:
{ Employee#, EmployeeName, Salary, SalaryDate }
Assuming that tuple is in First Normal Form (1NF) and the definition:
Salary
- The amount an employee is paid annually starting from the indicated
SalaryDate
.
we can show dependencies as something like:
With EmployeeName
having only the Employee#
dependency while, per the definition, the Salary
is also a function of SalaryDate
, the following puts our data into Second Normal Form (2NF):
Employee{ Employee#, EmployeeName } EmployeeSalary{ Employee#, SalaryDate, Salary }
The Missing Discussion
Consider the tuple:
{ Employee#, EmployeeName, Salary, SalaryDate, Part# }
with:
Part#
- Identifies a component used in the assembly of the products the company sells.
Part#
clearly has no relationship to any of the other items (highlighted by the shaded box around it in the following with no association lines in or out of that shading):
And that lack of relationship also makes it a 2NF violation. Which yields the following 2NF solution:
Employee{ Employee#, EmployeeName } EmployeeSalary{ Employee#, SalaryDate, Salary } Part{ Part# }
The “textbook” examples, at least when I was looking through them, completely missed this point.
Why is This Important?
Because it highlights the generality of Normalization: we can start with all identified, and documented, data items (what I call Zero Normal Form, or 0NF) in one, big tuple. And Normalization, alone, will decompose the data items into their component groupings.
In reality, Part#
is unlikely to be alone.
PartDescription
is an obvious additional data item to have identified:
{ Employee#, EmployeeName, Salary, SalaryDate, Part#, PartDescription }
Drawing the (assumed) dependencies clearly shows the lack of connection between the Employee
- and (shaded) Part
-related data items:
and its 2NF form is evident:
Employee{ Employee#, EmployeeName } EmployeeSalary{ Employee#, SalaryDate, Salary } Part{ Part#, PartDescription }
In practice, of course, many - most? - data modellers would “intuitively” split unconnected-to-Employee
data items, Part#
and PartDescription
, into their own tuple without even thinking about it being an application of 2NF.
But that, importantly, is what they are doing. And it would be better if it was a change made consciously from that perspective, so that the transformation correctness can be properly asserted.
A First Normal Form (1NF) Aside
What if Salary
was better defined?:
Salary
- The amount an
Employee
is paid annually starting from the indicatedSalaryDate
. TheSalary
may be adjusted by annual review, a change of assignment, a performance review, or for another reason as decided by management.
This highlights something we missed with the previous definition: the “… adjusted …” part makes it clearer that there can be multiple { Salary, SalaryDate }
values for an Employee
.
Hence the original (0NF) tuple becomes:
{ Employee#, EmployeeName, { Salary, SalaryDate } }
And to get this to 1NF, we would need to transform it to … well, what gave us 2NF above!:
Employee{ Employee#, EmployeeName } EmployeeSalary{ Employee#, SalaryDate, Salary }
In effect, 2NF has “compensated” for our missing the multiple values condition that would’ve resolved it via 1NF, and gotten us to the same solution anyway.
A Fourth Normal Form (4NF) Aside
What if, instead, the definition of Salary
and SalaryDate
was more limited:
Salary
- The current amount an
Employee
is paid on an annual basis. SalaryDate
- The date on which the current
Salary
started.
These definitions specifically preclude more than one value of each.
In other words, knowing the Employee#
is sufficient to determine each of EmployeeName
, Salary
, and SalaryDate
.
This means we have:
Which means, in this context, that our 0NF tuple is now also our 1NF, 2NF, and (since there are no indirect dependencies) 3NF solutions:
Employee{ Employee#, EmployeeName, Salary, SalaryDate }
The difference is that, because Salary
and SalaryDate
are defined as “… current …” values, there is still only one tuple per employee.
Hence, SalaryDate
does not become part of an identifying candidate key.
However, 4NF requires items in a tuple to have the same functional relationship.
Clearly this is not the case.
The functional relationships to the Employee#
identifier of each of EmployeeName
versus { Salary, SalaryDate }
are very different:
isKnownBy( Employee# ) -> EmployeeName earns( Employee# ) -> { Salary, SalaryDate }
The functions can be thought of as labels on the arcs:
So we split them based on the functional realtionship. And this again almost gets us back to the 2NF form discussed above:
Employee{ Employee#, EmployeeName } EmployeeSalary{ Employee#, SalaryDate, Salary }
Interestingly, if the requirement for multiple { Salary, SalaryDate }
values for an Employee is later identified (i.e, the history of that is required), it is simply a matter of changing that candidate key to, again, exactly what we had for the 1NF and 2NF solutions above:
Employee{ Employee#, EmployeeName } EmployeeSalary{ Employee#, SalaryDate, Salary }
There are no structural changes required, such as splitting a tuple containing that + EmployeeName
in this example.
And since the model tuples map to database tables, that means no database restructure to split a table into two. Splitting tables can have major “ripple” effects through code. Changing the assumptions around an existing table - how a row within it is identified, in this case - should make it easier to assess the coding changes required. And, with well-structured code, those changes should be more isolated, leading to easier testing before promotion to production.
But with the “… current …” Salary
definition in place, what if SalaryDate
had not been identified as a requirement?
{ Employee#, EmployeeName, Salary }
This still satisfies the requirements of 1NF, 2NF, and 3NF.
But not 4NF, because the functional relationships to Employee#
of each of EmployeeName
versus Salary
are still different, as shown above.
Hence, with 4NF we still finish up with:
Employee{ Employee#, EmployeeName } EmployeeSalary{ Employee#, Salary }
Even now if(/?when) the need for Salary
history becomes apparent, the addition of SalaryDate
and its inclusion in the EmployeeSalary
candidate key is still confined to a single tuple.
And, by extension, still means changes to the single table that implements that tuple.
With all the benefits of lower impact to the coding changes already mentioned.
It’s almost as if Normalization “sees” the gaps in our original understanding of the data, and works to compensate for that before we even know we need it.
In Summary
I started by explaining how a gap in the “textbook” coverage of 2NF hid an important aspect of Normalization’s generality.
I expanded on that discussion to try to show that application of Normalization rules was very much affected by how we understood the data items. Depending on the definitions and/or their interpretation, we “landed” in a place where each of the following Normalization rules became applicable:
1NF:
EmployeeName
has a singular relationship toEmployee#
, while{ Salary, SalaryDate }
has a multi-valued relationship.2NF:
EmployeeName
is related toEmployee#
alone, whileSalary
is determined to derive from the combination ofEmployee#
andSalaryDate
.4NF: Even when the
Salary
definition - with or withoutSalaryDate
- defined a singular ("… current …") relationship toEmployee#
, just like that toEmployeeName
, the different functional relationship ofEmployee#
to each mandated the transformation to the same same set ofEmployee
andEmployeeSalary
tuples.
There are three important things to note here:
Each of 1NF, 2NF, and 3NF are removing a subset of functionally-different relationships within a tuple. But they do it by inferring those different functional relationships in some fashion.
4NF is the “general purpose” form because it does make us think about the functional relationships themselves. This also means that 4NF alone is comprehensive up to that point.
Regardless of how we interpreted things in each case, we still finished up with the same pair of tuples when we reached 4NF. This makes sense because, ultimately, the data items have the same relationships to each other regardless of when/how we figure out that detail.
In other words, it is arguably only necessary to learn 4NF and later (i.e., 5NF), since 4NF will also take care of 1NF, 2NF, and 3NF.
This, of course, stands in stark contrast to the “conventional wisdom” that stopping after 3NF - or the “3.5NF” portrayed as Boyce-Codd Normal Form (BCNF) - is sufficient for most purposes. Which it is simple to demonstrate is not the case.
Acknowledgement
My thanks to Anthony for reviewing this.