Data Demythed

Debunking the "conventional wisdom" of data modelling, normalization, etc.

Data Modelling

What’s in a Name? Assume: Nothing!

2NF, 4NF, Data Definition, FK, PK

My previous post started with the 3NF tuple {Emp#, EmpName, Salary}, and transformed it to the 4NF tuples {Emp#, EmpName} and {Emp#, Salary} to remove the multi-valued dependency. It then noted how an AsOfDate is easily added to the latter to capture salary history even if that wasn’t part of the original requirements. Or perhaps if that requirement had been missed.

But what if the AsOfDate had been part of the original tuple?:

{Emp#, EmpName, Salary, AsOfDate}

There are no repeating elements, so it is in 1NF. But there does appear to be a 2NF issue: EmpName is (presumably) a function of {Emp#} alone, while Salary is a function of the {Emp#, AsOfDate} tuple. This is resolved by decomposing to the tuples

{Emp#, EmpName}
{Emp#, AsOfDate, Salary}

The result may look the same as the 4NF result + AsOfDate added retroactively, but it’s important to understand the profound differences in what is happening in each case. A reasonable definition of Salary in that previous, 4NF example:

Salary
The current base annual salary of the associated employee used as the starting point for calculating the employee’s compensation at each pay cycle.

is very different to that in our 2NF example in this posting:

Salary
The base annual salary of the associated employee as of the indicated AsOfDate, and applicable until the AsOfDate in date sequence of another entry for the same employee. If no such entry exists, then the record contains the current base annual salary for the employee.

Normalization is based on mathematical principles. Different definitions means different terms despite having the same name. A mathematician would not confuse i as the subscript on a ∑ summation with an i representing √-1 just because they look the same. Different definitions, then, lead to different results (i.e., 2NF versus 4NF violations).

On the other hand, if our definitions included:

AsOfDate
The date upon which the associated EmpName and Salary settings take effect for the associated employee (indicated by Emp#). These values remain in effect until a later-dated entry for the same employee, based on date sequence. If no such entry exists, then the record contains the current settings for the employee.
EmpName
The preferred name of the associated employee (indicated by Emp#) as of the indicated AsOfDate. Amongst the reasons a person’s name may change are following a marriage, or by deed poll.
Salary
The base annual salary of the associated employee (indicated by Emp#) as of the indicated AsOfDate used as the starting point for calculating the employee’s compensation at each pay cycle. Amongst the reasons a person’s salary may change are following an annual performance review, or as part of a promotion.

then EmpName and Salary are both dependent on the {Emp#, AsOfDate} tuple, and {Emp#, EmpName, Salary, AsOfDate} is indeed in 2NF - and probably 3NF - after all. However, the descriptions of EmpName and Salary still make it clear that each changes for different reasons, giving us a multi-valued dependency and 4NF violation. The 4NF decomposition now duplicates AsOfDate along with Emp#:

{Emp#, AsOfDate, EmpName}
{Emp#, AsOfDate, Salary}

The literature on normalization is generally silent on the matter of defining data. It is important to note that each time we apply a normalization transform, we also need to adjust data definitions accordingly. Whereas the {Emp#, AsOfDate} previously qualified {EmpName, Salary} as a tuple, now different variants of {Emp#, AsOfDate} identifies each of EmpName and Salary independently. For example, consider the revised definition of each Emp#:

Emp#'EmpName
Identifies the employee who is known by the associated EmpName as of the indicated AsOfDate
Emp#'Salary
Identifies the employee who is paid on the basis of the indicated Salary as an annual base rate as of the indicated AsOfDate

Typically both Emp#s are implemented as foreign keys (FKs). One criticism sometimes leveled at normalization is that FKs represent data redundancy. As we can see, the definition of each FK is different, and the corresponding Primary Key (PK) definition would be different as well. Different definitions mean each, in turn, is a different piece of data. Therefore there is no redundancy. The various Emp#s are just drawn from the same domain, which facilitates the use of a relational join to assemble the pieces into a result set.