5NF: The Missing Use Case
18 Aug 2023: Discussion around this post at Minimal Modeling may add to understanding of it.
When I found myself heading down the data analysis, modelling, and administration path early in my career (late 1970s), I would spend time in technical bookstores looking for texts to help me on my way. I had 4 things to check in deciding if a book was worth my while. By the end of the 1980s, I had given up on finding an useful book.
This article covers the second-highest item on my list. Subsequent posts will cover the 3rd and 4th items. The 4th, and final, post will address the most important item on my list, something which I’m astonished has not received a lot more attention. Or even any attention at all, from my experience to this day.
But I start with the poor treatment of Fifth Normal Form (5NF).
What Is 5NF?
Simplistically, the first four normalization forms consider the relationship of non-key attributes to a Candidate Key (i.e., an unique identifier for a tuple). There may, of course, be more than one Candidate Key for a tuple.
5NF considers the relationship between elements within a compound Candidate Key. In the example that we use below, we start with the following tuple, with each of its Candidate Keys underlined:
Performance{ Performance#, Concert#, Instrument#, Musician#, … }
The question, then, is whether the presence of the 3-part key of Concert#, Instrument#, Musician#
as a Candidate Key suggest a data redundancy issue for the tuple as a whole.
Traditional Treatment of 5NF
Early on, one author’s text seemed to become the “go to” standard. And, I admit, I bought a couple of editions over the years, and found information within useful to my early learning.
But one thing with which I wasn’t comfortable was the presentation of 5NF. Unfortunately, other texts seemed to simply adopt a similar treatment. They’d present something like the following Entity-Relation Diagram (ERD), to which I add a textual representation since that’s the detail we use for Normalization, as being 4NF and needing to be assessed for whether it was 5NF:
Concert{ Concert#, ConcertName, … } Instrument{ Instrument#, InstrumentName, … } Musician{ Musician#, MusicianName, … } Performance{ Performance#, Concert#, Instrument#, Musician#, … }
Something like this would simply be presented as the only possible transformation to take it to 5NF:
Concert{ Concert#, ConcertName, … } Instrument{ Instrument#, InstrumentName, … } Musician{ Musician#, MusicianName, … } Appearance{ Appearance#, Concert#, Musician#, … } Inclusion{ Inclusion#, Concert#, Instrument#, … } Skill{ Skill#, Instrument#, Musician#, … }
It would then be argued - correctly - that this is not a non-loss decomposition of the original 4NF.
In other words, it is not possible to formulate a query to derive the information that exists in the central [Performance]
entity in the previous diagram.
And how could it be, given that we have removed the context of [Performance]
as a place for that information to come together?
There is nothing about the independent, binary relationships of [Appearance]
, [Inclusion]
, and [Skill]
that can give context to the intersection of [Concert]
, [Instrument]
, and [Musician]
.
In the “traditional” treatment, that would be the end of it, as if there were no other transformational options. The original model would therefore be declared to also be 5NF.
But there are other transformational options to consider.
A big part of the problem here is that the “standard” presentation clearly assumes some definitions for the model components. But those assumed definitions are not provided. Let alone discussed as part of helping us understand the application of 5NF. And yet it is very much the definition of the data that determines how Normalization rules are applied in transforming from one form to the next.
My problem was that I couldn’t conceive what those definitions, supporting the conclusion, might be. In this article I will proceed with data definitions that I can imagine for the modelled components. With my definitions, the original model is not in 5NF. And we will use those definitions to find a solution not considered in the “standard” treatment of 5NF.
Venn Diagram-Based Review
Let’s take advantage of the Relational Model’s mathematical foundation to understand what is happening via Venn Diagrams.
The first, assumed 4NF diagram could be drawn as the following, with [Performance]
being the exact overlap of the [Concert]
, [Instrument]
, and [Musician]
relations:
The considered, and correctly rejected, possible transform as a Venn Diagram - with a “hole” in the middle representing the loss of the [Performance]
context - is:
Properly, and fully, overlapping the [Concert]
, [Instrument]
, and [Musician]
sets in a Venn Diagram, covering all combinations, is a basic technique.
And in doing that here, we show that it covers all aspects of our data model.
And points us towards the correct transformation of our original model:
With the Venn Diagram’s area boundaries pointing us to the appropriate relationships, translating it back to a data model is straightforward:
Concert{ Concert#, ConcertName, … } Instrument{ Instrument#, <InstrumentName, … } Musician{ Musician#, MusicianName, … } Appearance{ Appearance#, Concert#, Musician#, … } Inclusion{ Inclusion#, Concert#, Instrument#, … } Skill{ Skill#, Instrument#, Musician#, … } Performance{ Performance#, Appearance#, Inclusion#, Skill#, … }
Not Done Yet …
As mentioned above, this has pointed us toward the appropriate transformation. But we can’t yet assume it is correct. Our new conception of the model now needs to be re-tested against the user requirements.
In (imagined) discussions of our new perspective on the data with the users of it, we discover some new things:
For a Musician who is singing, their voice is considered their instrument.
For the Musician who is conducting, their instrument is the baton, real or imagined. (A conductor may simply use their arms and hands, or perhaps another technique, to lead a group of musicians.)
Only Musicians performing with an Instrument will be part of a Performance.
Similarly, an Instrument will not be in the Performance without a Musician to play it.
Ahhh.
This tells us that the areas of [Appearance]
and [Inclusion]
outside [Performance]
are each the empty set ("∅"):
Since we don’t need [Appearance]
and [Inclusion]
, our final model therefore becomes:
Concert{ Concert#, ConcertName, … } Instrument{ Instrument#, InstrumentName, … } Musician{ Musician#, MusicianName, … } Skill{ Skill#, Instrument#, Musician#, … } Performance{ Performance#, Concert#, Skill#, … }
This is obviously very different to what the “textbook treatment” concluded.
And it, perhaps, shouldn’t be surprising that the [Instrument]
-[Musician]
relationship is pulled out.
The mastery of an instrument by a musician will (typically) start long before a performance opportunity is conceived and invitations to musicians are made.
I like the way Normalization uses the definitions to seemingly anticipate aspects, and uses, of the data that we might not yet see.
2NF and Those Ellipses
As noted above, the first 4 normalization rules consider the relationship of non-Candidate Key attributes to the Candidate Key(s). Those non-Candidate Key attributes have simply been represented here by ellipses ("…").
But an interesting question to ask is: what non-Candidate Key attributes would exist to yield the example above that the “textbook treatment"s assert is also 5NF?
These attributes would, via Normalization Rules, justify the creation of [Concert]
, [Instrument]
, [Musician]
, and their intersection as [Performance]
.
But there would be no attributes triggering the creation of:
[Appearance]
(i.e., intersection of[Concert]
and[Musician]
),[Inclusion]
(i.e., intersection of[Concert]
and[Instrument]
), or[Skill]
( i.e., intersection of[Instrument]
and[Musician]
)
during application of 1NF, 2NF, 3NF, and 4NF.
In real world models, something like that is highly unlikely. In my experience, there have always been non-Candidate Key attributes that, via pre-5NF normalizations, refactor the model and leave no 5NF violation to be resolved.
Looking at our example a bit further, for instance, it’s easy to conceive of a SkillRating
in [Skill]
.
This could be used to help decide which [Musician]
instance(s), with corresponding [Instrument]
instance(s), will be invited to be part of a [Concert]
instance.
Let’s assume that is an attribute “hidden” in the ellipses, and extract it out to make it explicit.
Start with the un-normalized tuple (still including those pesky ellipses, and assuming no [Appearance]
- or [Inclusion]
-triggering attributes hidden therein), which I refer to as Zero Normal Form (0NF):
{ ConcertName, InstrumentName, MusicianName, SkillRating, … }
and 2NF, even based on just this, iterates to:
Concert{ Concert#, ConcertName, … } Instrument{ Instrument#, InstrumentName, … } Musician{ Musician#, MusicianName, … } Skill{ Skill#, Instrument#, Musician#, SkillRating, … } Performance{ Performance#, Concert#, Skill#, … }
This should look familiar: it’s our final model in the 5NF discussion above!
The explicit addition of that one non-Candidate Key attribute, SkillRating
, moved us directly to that solution!
And, consequentially, the original 4NF/5NF context is no longer possible.
Conversely, it’s also interesting to note that even without SkillRating
as an explicit attribute, our earlier 5NF result derived a home for it in the future via our clarification of the requirements and refinement of definitions.
What happens if we remove the ellipses? Starting with a 0NF of:
{ ConcertName, InstrumentName, MusicianName, SkillRating }
and applying 2NF gets us to:
Skill{ Skill#, InstrumentName, MusicianName, SkillRating } Performance{ Performance#, ConcertName, Skill# }
We now know there are no non-Candidate Key attributes to justify separate [Concert]
, [Instrument]
, and [Musician]
components, so we don’t … indeed, can’t create them as a Normalization transform.
Of course, looking at this data model:
we might “intuitively” sense an issue: there is no obvious grouping of [Performance]
instances into a [Concert]
instance.
But the formalism of Normalization can’t see that without a [Concert]
-specific attribute to justify its existence.
Until then, the grouping is simply a query against [Performance]
for all the Skill#
instances with a given ConcertName
value:
SELECT InstrumentName,
MusicianName
FROM Performance
JOIN Skill USING Skill#
WHERE ConcertName = '...' ;
Another Example with a (Likely) Different Outcome
Another example I have seen used in similar fashion is:
But just because this looks the same doesn’t mean it is. Why? Because the data definitions here are very different.
In the earlier example, based on our basic knowledge of “concert”, “instrument”, and “musician” things, “reasonable” definitions of those things leads to the “skill” connection between “instrument” and “musician” things, and the “concert” and “performance” things connecting into that.
But the relationships between [Subject]
, [Teacher]
, and [Textbook]
are less obvious.
Examples of just a few of the many questions I need to ask, to better understand the definitions of the entities in this case, are:
Are the Textbooks for a Subject prescribed as part of the Subject’s curriculum (exploring
[Subject]
-[Textbook]
) , selected by the Teacher (exploring[Teacher]
-[Textbook]
), or a combination?If Textbooks are prescribed by the curriculum, is it an exhaustive set (i.e., “All these Textbooks are used.” so
[Subject]
-[Textbook]
is exhaustive, and no[Class]
connection is required)? Or is it a list from which a given Teacher might select one or more?Does a Teacher need to be qualified in a Subject (exploring
[Subject]
-[Teacher]
) in some sense to be assigned to a Class for that Subject?
I can imagine the answers to these questions being very different in primary versus secondary education. Primary Teachers tend to be more generalist, teaching most of the Subjects for a Class. In contrast, secondary education tends to have Teachers being more specialised in the Subjects they teach. And that becomes even more so in tertiary education.
For example, discovering that:
Teachers require some level of qualification in a Subject to be assigned a Class for that Subject, and
a Class will use exactly one of the Textbooks listed for that Subject, and the selected Textbook may be different from one Class to the next.
potentially gets me to a very different solution than the earlier example:
Subject{ Subject#, SubjectName, … } Teacher{ Teacher#, TeacherName, … } Textbook{ Textbook#, TextbookName, … } Qualification{ Qualification#, Subject#, Teacher#, … } SubjectText{ SubjectText#, Subject#, TextBook#, … } Class{ Class#, Qualification#, SubjectText#, … }
If, instead, a Class will use all the Textbooks listed for a Subject, then we don’t need the [Class]
-[Subject Text]
link:
Subject{ Subject#, SubjectName, … } Teacher{ Teacher#, TeacherName, … } Textbook{ Textbook#, TextbookName, … } Qualification{ Qualification#, Subject#, Teacher#, … } SubjectText{ SubjectText#, Subject#, TextBook#, … } Class{ Class#, Qualification#, … }
I encourage the reader to explore other variations arising from different definitions.
But, as for the previous example, it’s impossible to actually know how, or even whether, this works as a 4NF/5NF issue without a lot more detail about the definitions of the terms.
In Conclusion
The traditional “textbook treatment” of 5NF has been to:
present a simple, 3-way relationship diagram with an assumption that it is in 4NF,
assume the only alternative is a diagram containing 3 binary relationships, and
correctly conclude that the latter does not allow derivation of the former.
However, that also leads to the incorrect assumption that the original model is also in 5NF.
Treating this as something that can be dealt with as diagrams is not good enough. Normalization relies on definitions of the pieces involved and the relationships between them that those definitions identify.
The “textbook treatment"s, relying simply on data model diagrams, do not communicate this information in the depth required. They don’t define the components to allow us to understand the author’s assertion of 4NF and 5NF in each case. This is particularly a problem because readers are unlikely to assume the same definitions as the author when trying to understand the 4NF and 5NF assertions.
I have delved into two examples that have received that same textbook treatment. By digging deeper into (what I feel are) reasonable definitions of the components, I have shown that each leads to an outcome that the “textbook treatment"s missed. And each of the examples yielded a different outcome reflecting the different definitions and, therefore, different relationships between components.
In the first example, I also looked at how non-Candidate Key data items can trigger pre-5NF transformations that greatly reduce the likelihood of needing to apply 5NF. I showed how adding one data item that seems reasonable for the problem resulted in a 2NF transformation that yielded the same solution that 5NF produced without that data item.
A data model diagram can offer visual clues of Normalization violations (about which I intend to write more in the future). But Normalization itself requires a “deeper dive” into the definitions of the components to determine what - if anything - needs to be done about that.
Acknowledgement
My thanks to Anthony for reviewing earlier drafts.