What’s in a Name? Assume: Nothing!
My previous post started with the 3NF tuple {Emp#, EmpName, Salary}
, and transformed it to the 4NF tuples {Emp#, EmpName}
and {Emp#, Salary}
to remove the multi-valued dependency. It then noted how an AsOfDate
is easily added to the latter to capture salary history even if that wasn’t part of the original requirements. Or perhaps if that requirement had been missed.
But what if the AsOfDate
had been part of the original tuple?:
{Emp#, EmpName, Salary, AsOfDate}
There are no repeating elements, so it is in 1NF. But there does appear to be a 2NF issue: EmpName
is (presumably) a function of {Emp#}
alone, while Salary
is a function of the {Emp#, AsOfDate}
tuple. This is resolved by decomposing to the tuples
{Emp#, EmpName}
{Emp#, AsOfDate, Salary}
The result may look the same as the 4NF result + AsOfDate
added retroactively, but it’s important to understand the profound differences in what is happening in each case. A reasonable definition of Salary
in that previous, 4NF example:
Salary
- The current base annual salary of the associated employee used as the starting point for calculating the employee’s compensation at each pay cycle.
is very different to that in our 2NF example in this posting:
Salary
- The base annual salary of the associated employee as of the indicated
AsOfDate
, and applicable until theAsOfDate
in date sequence of another entry for the same employee. If no such entry exists, then the record contains the current base annual salary for the employee.
Normalization is based on mathematical principles. Different definitions means different terms despite having the same name. A mathematician would not confuse i as the subscript on a ∑ summation with an i representing √-1 just because they look the same. Different definitions, then, lead to different results (i.e., 2NF versus 4NF violations).
On the other hand, if our definitions included:
AsOfDate
- The date upon which the associated
EmpName
andSalary
settings take effect for the associated employee (indicated byEmp#
). These values remain in effect until a later-dated entry for the same employee, based on date sequence. If no such entry exists, then the record contains the current settings for the employee. EmpName
- The preferred name of the associated employee (indicated by
Emp#
) as of the indicatedAsOfDate
. Amongst the reasons a person’s name may change are following a marriage, or by deed poll. Salary
- The base annual salary of the associated employee (indicated by
Emp#
) as of the indicatedAsOfDate
used as the starting point for calculating the employee’s compensation at each pay cycle. Amongst the reasons a person’s salary may change are following an annual performance review, or as part of a promotion.
then EmpName
and Salary
are both dependent on the {Emp#, AsOfDate}
tuple, and {Emp#, EmpName, Salary, AsOfDate}
is indeed in 2NF - and probably 3NF - after all. However, the descriptions of EmpName
and Salary
still make it clear that each changes for different reasons, giving us a multi-valued dependency and 4NF violation. The 4NF decomposition now duplicates AsOfDate
along with Emp#
:
{Emp#, AsOfDate, EmpName}
{Emp#, AsOfDate, Salary}
The literature on normalization is generally silent on the matter of defining data. It is important to note that each time we apply a normalization transform, we also need to adjust data definitions accordingly. Whereas the {Emp#, AsOfDate}
previously qualified {EmpName, Salary}
as a tuple, now different variants of {Emp#, AsOfDate}
identifies each of EmpName
and Salary
independently. For example, consider the revised definition of each Emp#
:
Emp#'EmpName
- Identifies the employee who is known by the associated
EmpName
as of the indicatedAsOfDate
… Emp#'Salary
- Identifies the employee who is paid on the basis of the indicated
Salary
as an annual base rate as of the indicatedAsOfDate
…
Typically both Emp#
s are implemented as foreign keys (FKs). One criticism sometimes leveled at normalization is that FKs represent data redundancy. As we can see, the definition of each FK is different, and the corresponding Primary Key (PK) definition would be different as well. Different definitions mean each, in turn, is a different piece of data. Therefore there is no redundancy. The various Emp#
s are just drawn from the same domain, which facilitates the use of a relational join to assemble the pieces into a result set.