Data Demythed

Debunking the "conventional wisdom" of data modelling, normalization, etc.

Data Modelling

Normalize for Better Performance

3NF, 4NF, Normalization, Performance

The first formative experience early in my career was to address data issues in a system that was not meeting expectations. There were obvious, simple things like tuning the underlying data management software.

But more important, to me, was to normalize the data model. We did this incrementally, working with the programming team, over a couple of years.

The result was a dramatic improvement in functionality (expected) and performance (a bit of a surprise). With one round of changes a screen response dropped from a painful ~30 seconds to ~7 seconds.

I took some time to analyse why this happened. And concluded that normalization of the data was the reason.

Consider the following tuple:

{ Emp#, EmpName, EmpAddress, EmpSalary }

EmpName, EmpAddress, and EmpSalary are defined as being the current value in each case. Therefore they each have a direct dependency on Emp# which means it is 3NF.

But, by any reasonable definition, they each have different functional dependencies on Emp#, so the following makes it 4NF:

{ Emp#, EmpName }
{ Emp#, EmpAddress }
{ Emp#, EmpSalary }

Why do we get better performance with 4NF than 3NF? Let’s take a simple example: there is probably a lot of analysis and reporting around salaries independently of names and addresses. And yet names and addresses require much more storage space than salaries.

Lower Storage and Cache Utilisation

With 4NF, analysing salaries does not waste time reading records with space-consuming names and addresses to also get the smaller salary fields. Depending on how it’s stored, salaries under, say, $10M could use just 5 bytes.

A name could easily be 4 times that, and an address would probably be closer to 10 times that at a minimum.

So (as a “back of the envelope” calculation) with 3 bytes for Emp# (allowing values in the millions), we have a choice of reading records of 8 bytes (3+5) versus 78 bytes (3+20+50+5). In other words, under 4NF I can read 9-10 salary records in the time I can read only one in the 3NF representation.

While the time savings reading from today’s SSD devices will be less than they are with storage based on rotating disks and moving read/write heads, that is still a dramatic difference in resource usage on the I/O path.

Similarly, in-memory buffers caching 10x as many salary records as a working set will dramatically reduce resource demands of salary analyses, free up cache space for other queries, etc.

Support More Indexes

It is easy to see a reason to index each field:

Having more indexes available gives the query optimiser more chance of avoiding a full-table scan, by using a more selective, index-based access instead.

But indexes on a table generally impact update performance: the more indexes, the slower the updating.

The 4NF version supports all these use cases with only 2 indexes per table (including a PK-enforcing one over Emp#).

The 3NF would load up one table with 4 indexes.

Further, we’d be increasingly reluctant to add additional indexes to the 3NF table, whereas we could still consider adding an extra index or two on each 4NF table if that would improve the performance of a query.

More Efficient Updating

We split the 3NF tuple because each of EmpName, EmpAddress, and EmpSalary has a different functional relationship to Emp#.

Another way of stating this is: an update to EmpSalary is unlikely to also update EmpName or EmpAddress.

Updating EmpSalary in the 3NF form is a very expensive operation. In our example above, we have to manage a record of 78 bytes in the rollback and/or redo logs (depending on the implementation specifics) when only 5 bytes have changed. Even an implementation that can handle it as an isolated “delta” change against the “before” or “after” state of the row has added complexity - that is, performance overhead - when serving results.

In the 4NF version, the rollback/redo logs are only dealing with 8-character rows. That will place much lower demand on resources that can significantly affect throughput: