Debunking the "conventional wisdom" of data modelling, normalization, etc.

Data Modelling

5NF: The Missing Use Case

0NF, 2NF, 5NF, Candidate Key, Data Definition, ERD, Normalization

18 Aug 2023: Discussion around this post at Minimal Modeling may add to understanding of it.

When I found myself heading down the data analysis, modelling, and administration path early in my career (late 1970s), I would spend time in technical bookstores looking for texts to help me on my way. I had 4 things to check in deciding if a book was worth my while. By the end of the 1980s, I had given up on finding an useful book.

This article covers the second-highest item on my list. Subsequent posts will cover the 3rd and 4th items. The 4th, and final, post will address the most important item on my list, something which I’m astonished has not received a lot more attention. Or even any attention at all, from my experience to this day.

But I start with the poor treatment of Fifth Normal Form (5NF).

What Is 5NF?

Simplistically, the first four normalization forms consider the relationship of non-key attributes to a Candidate Key (i.e., an unique identifier for a tuple). There may, of course, be more than one Candidate Key for a tuple.

5NF considers the relationship between elements within a compound Candidate Key. In the example that we use below, we start with the following tuple, with each of its Candidate Keys underlined:

Performance{ Performance#, Concert#, Instrument#, Musician#, … }

The question, then, is whether the presence of the 3-part key of Concert#, Instrument#, Musician# as a Candidate Key suggest a data redundancy issue for the tuple as a whole.

Traditional Treatment of 5NF

Early on, one author’s text seemed to become the “go to” standard. And, I admit, I bought a couple of editions over the years, and found information within useful to my early learning.

But one thing with which I wasn’t comfortable was the presentation of 5NF. Unfortunately, other texts seemed to simply adopt a similar treatment. They’d present something like the following Entity-Relation Diagram (ERD), to which I add a textual representation since that’s the detail we use for Normalization, as being 4NF and needing to be assessed for whether it was 5NF:

Concert{ Concert#, ConcertName, … }   Instrument{ Instrument#, InstrumentName, … }   Musician{ Musician#, MusicianName, … }   Performance{ Performance#, Concert#, Instrument#, Musician#, … }

PerformanceConcertInstrumentMusician
Concert-Instrument-Musician: Assumed-4NF

Something like this would simply be presented as the only possible transformation to take it to 5NF:

Concert{ Concert#, ConcertName, … }   Instrument{ Instrument#, InstrumentName, … }   Musician{ Musician#, MusicianName, … }   Appearance{ Appearance#, Concert#, Musician#, … }   Inclusion{ Inclusion#, Concert#, Instrument#, … }   Skill{ Skill#, Instrument#, Musician#, … }

ConcertInstrumentMusicianSkillAppearanceInclusion
Concert-Instrument-Musician: 5NF Non-Resolution

It would then be argued - correctly - that this is not a non-loss decomposition of the original 4NF. In other words, it is not possible to formulate a query to derive the information that exists in the central [Performance] entity in the previous diagram.

And how could it be, given that we have removed the context of [Performance] as a place for that information to come together? There is nothing about the independent, binary relationships of [Appearance], [Inclusion], and [Skill] that can give context to the intersection of [Concert], [Instrument], and [Musician].

In the “traditional” treatment, that would be the end of it, as if there were no other transformational options. The original model would therefore be declared to also be 5NF.

But there are other transformational options to consider.

A big part of the problem here is that the “standard” presentation clearly assumes some definitions for the model components. But those assumed definitions are not provided. Let alone discussed as part of helping us understand the application of 5NF. And yet it is very much the definition of the data that determines how Normalization rules are applied in transforming from one form to the next.

My problem was that I couldn’t conceive what those definitions, supporting the conclusion, might be. In this article I will proceed with data definitions that I can imagine for the modelled components. With my definitions, the original model is not in 5NF. And we will use those definitions to find a solution not considered in the “standard” treatment of 5NF.

Venn Diagram-Based Review

Let’s take advantage of the Relational Model’s mathematical foundation to understand what is happening via Venn Diagrams.

The first, assumed 4NF diagram could be drawn as the following, with [Performance] being the exact overlap of the [Concert], [Instrument], and [Musician] relations:

ConcertMusicianInstrumentPerformance
Concert-Instrument-Musician: Assumed-4NF Venn Diagram

The considered, and correctly rejected, possible transform as a Venn Diagram - with a “hole” in the middle representing the loss of the [Performance] context - is:

InclusionConcertSkillMusicianInstrumentAppearance
Concert-Instrument-Musician: 5NF Non-Resolution Venn Diagram

Properly, and fully, overlapping the [Concert], [Instrument], and [Musician] sets in a Venn Diagram, covering all combinations, is a basic technique. And in doing that here, we show that it covers all aspects of our data model. And points us towards the correct transformation of our original model:

InclusionSkillAppearancePerformanceMusicianConcertInstrument
Concert-Instrument-Musician: Complete Venn Diagram

With the Venn Diagram’s area boundaries pointing us to the appropriate relationships, translating it back to a data model is straightforward:

Concert{ Concert#, ConcertName, … }   Instrument{ Instrument#, <InstrumentName, … }   Musician{ Musician#, MusicianName, … }   Appearance{ Appearance#, Concert#, Musician#, … }   Inclusion{ Inclusion#, Concert#, Instrument#, … }   Skill{ Skill#, Instrument#, Musician#, … }   Performance{ Performance#, Appearance#, Inclusion#, Skill#, … }

PerformanceConcertInstrumentMusicianSkillAppearanceInclusion
Concert-Instrument-Musician: Complete Data Model

Not Done Yet …

As mentioned above, this has pointed us toward the appropriate transformation. But we can’t yet assume it is correct. Our new conception of the model now needs to be re-tested against the user requirements.

In (imagined) discussions of our new perspective on the data with the users of it, we discover some new things:

Ahhh. This tells us that the areas of [Appearance] and [Inclusion] outside [Performance] are each the empty set ("∅"):

SkillPerformanceMusicianConcertInstrument
Concert-Instrument-Musician: Solution Venn Diagram

Since we don’t need [Appearance] and [Inclusion], our final model therefore becomes:

Concert{ Concert#, ConcertName, … }   Instrument{ Instrument#, InstrumentName, … }   Musician{ Musician#, MusicianName, … }   Skill{ Skill#, Instrument#, Musician#, … }   Performance{ Performance#, Concert#, Skill#, … }

PerformanceConcertInstrumentMusicianSkill
Concert-Instrument-Musician: 5NF Resolution

This is obviously very different to what the “textbook treatment” concluded.

And it, perhaps, shouldn’t be surprising that the [Instrument]-[Musician] relationship is pulled out. The mastery of an instrument by a musician will (typically) start long before a performance opportunity is conceived and invitations to musicians are made. I like the way Normalization uses the definitions to seemingly anticipate aspects, and uses, of the data that we might not yet see.

2NF and Those Ellipses

As noted above, the first 4 normalization rules consider the relationship of non-Candidate Key attributes to the Candidate Key(s). Those non-Candidate Key attributes have simply been represented here by ellipses ("…").

But an interesting question to ask is: what non-Candidate Key attributes would exist to yield the example above that the “textbook treatment"s assert is also 5NF?

These attributes would, via Normalization Rules, justify the creation of [Concert], [Instrument], [Musician], and their intersection as [Performance].

But there would be no attributes triggering the creation of:

during application of 1NF, 2NF, 3NF, and 4NF.

In real world models, something like that is highly unlikely. In my experience, there have always been non-Candidate Key attributes that, via pre-5NF normalizations, refactor the model and leave no 5NF violation to be resolved.

Looking at our example a bit further, for instance, it’s easy to conceive of a SkillRating in [Skill]. This could be used to help decide which [Musician] instance(s), with corresponding [Instrument] instance(s), will be invited to be part of a [Concert] instance. Let’s assume that is an attribute “hidden” in the ellipses, and extract it out to make it explicit. Start with the un-normalized tuple (still including those pesky ellipses, and assuming no [Appearance]- or [Inclusion]-triggering attributes hidden therein), which I refer to as Zero Normal Form (0NF):

{ ConcertName, InstrumentName, MusicianName, SkillRating, … }

and 2NF, even based on just this, iterates to:

Concert{ Concert#, ConcertName, … }   Instrument{ Instrument#, InstrumentName, … }   Musician{ Musician#, MusicianName, … }   Skill{ Skill#, Instrument#, Musician#, SkillRating, … }   Performance{ Performance#, Concert#, Skill#, … }

This should look familiar: it’s our final model in the 5NF discussion above! The explicit addition of that one non-Candidate Key attribute, SkillRating, moved us directly to that solution! And, consequentially, the original 4NF/5NF context is no longer possible.

Conversely, it’s also interesting to note that even without SkillRating as an explicit attribute, our earlier 5NF result derived a home for it in the future via our clarification of the requirements and refinement of definitions.

What happens if we remove the ellipses? Starting with a 0NF of:

{ ConcertName, InstrumentName, MusicianName, SkillRating }

and applying 2NF gets us to:

Skill{ Skill#, InstrumentName, MusicianName, SkillRating }   Performance{ Performance#, ConcertName, Skill# }

We now know there are no non-Candidate Key attributes to justify separate [Concert], [Instrument], and [Musician] components, so we don’t … indeed, can’t create them as a Normalization transform. Of course, looking at this data model:

PerformanceSkill
Concert-Instrument-Musician: Skill-Performance Reduction

we might “intuitively” sense an issue: there is no obvious grouping of [Performance] instances into a [Concert] instance. But the formalism of Normalization can’t see that without a [Concert]-specific attribute to justify its existence. Until then, the grouping is simply a query against [Performance] for all the Skill# instances with a given ConcertName value:

SELECT InstrumentName,
       MusicianName
  FROM Performance
  JOIN Skill USING Skill#
 WHERE ConcertName = '...' ;

Another Example with a (Likely) Different Outcome

Another example I have seen used in similar fashion is:

ClassSubjectTextbookTeacher
Subject-Teacher-Textbook: Assumed 4NF

But just because this looks the same doesn’t mean it is. Why? Because the data definitions here are very different.

In the earlier example, based on our basic knowledge of “concert”, “instrument”, and “musician” things, “reasonable” definitions of those things leads to the “skill” connection between “instrument” and “musician” things, and the “concert” and “performance” things connecting into that.

But the relationships between [Subject], [Teacher], and [Textbook] are less obvious. Examples of just a few of the many questions I need to ask, to better understand the definitions of the entities in this case, are:

I can imagine the answers to these questions being very different in primary versus secondary education. Primary Teachers tend to be more generalist, teaching most of the Subjects for a Class. In contrast, secondary education tends to have Teachers being more specialised in the Subjects they teach. And that becomes even more so in tertiary education.

For example, discovering that:

potentially gets me to a very different solution than the earlier example:

Subject{ Subject#, SubjectName, … }   Teacher{ Teacher#, TeacherName, … }   Textbook{ Textbook#, TextbookName, … }   Qualification{ Qualification#, Subject#, Teacher#, … }   SubjectText{ SubjectText#, Subject#, TextBook#, … }   Class{ Class#, Qualification#, SubjectText#, … }

ClassSubjectTextbookTeacherQualificationSubject Text
Subject-Teacher-Textbook: Potential Solution 1

If, instead, a Class will use all the Textbooks listed for a Subject, then we don’t need the [Class]-[Subject Text] link:

Subject{ Subject#, SubjectName, … }   Teacher{ Teacher#, TeacherName, … }   Textbook{ Textbook#, TextbookName, … }   Qualification{ Qualification#, Subject#, Teacher#, … }   SubjectText{ SubjectText#, Subject#, TextBook#, … }   Class{ Class#, Qualification#, … }

ClassSubjectTextbookTeacherQualificationSubject Text
Subject-Teacher-Textbook: Potential Solution 2

I encourage the reader to explore other variations arising from different definitions.

But, as for the previous example, it’s impossible to actually know how, or even whether, this works as a 4NF/5NF issue without a lot more detail about the definitions of the terms.

In Conclusion

The traditional “textbook treatment” of 5NF has been to:

However, that also leads to the incorrect assumption that the original model is also in 5NF.

Treating this as something that can be dealt with as diagrams is not good enough. Normalization relies on definitions of the pieces involved and the relationships between them that those definitions identify.

The “textbook treatment"s, relying simply on data model diagrams, do not communicate this information in the depth required. They don’t define the components to allow us to understand the author’s assertion of 4NF and 5NF in each case. This is particularly a problem because readers are unlikely to assume the same definitions as the author when trying to understand the 4NF and 5NF assertions.

I have delved into two examples that have received that same textbook treatment. By digging deeper into (what I feel are) reasonable definitions of the components, I have shown that each leads to an outcome that the “textbook treatment"s missed. And each of the examples yielded a different outcome reflecting the different definitions and, therefore, different relationships between components.

In the first example, I also looked at how non-Candidate Key data items can trigger pre-5NF transformations that greatly reduce the likelihood of needing to apply 5NF. I showed how adding one data item that seems reasonable for the problem resulted in a 2NF transformation that yielded the same solution that 5NF produced without that data item.

A data model diagram can offer visual clues of Normalization violations (about which I intend to write more in the future). But Normalization itself requires a “deeper dive” into the definitions of the components to determine what - if anything - needs to be done about that.

Acknowledgement

My thanks to Anthony for reviewing earlier drafts.