Data Demythed

Debunking the "conventional wisdom" of data modelling, normalization, etc.

System Architecture

3-Tier 3/3: Implications of Changing the Application-Data Relationship

4NF, Auditability, Burroughs DMSII, Data Confidentiality, Data Integrity, Data Security, Oracle, Normalization, Programming, PostgreSQL, Schema Updates, SQL Standards

The previous post proposed a solution to the security issue detailed in the post before it.

This post looks at a number of implications of this change, in no particular order:

Can It Really Be Done??

In a word: yes. Work with which I was involved used this approach for a good decade before I retired.

At the same time, there is the “devil in the details”, as they say. Mainly it is changing the way systems are coded, which I touch on further below.

I have given presentations about some aspects of this at conferences and other organisations. These tended to be about implementing the strategy within a particular vendor’s database (Oracle) that I was using at the time. Perhaps the paper and presentation from one of those occasions provides further interesting insights. (As an aside, I discovered some years into it that my development team was calling this framework “Barry-cle” as a riff on the vendor’s name)

It was also nice to attend a conference subsequent to my presentation linked above and have someone tell me they had tried it at their site. And liked the results!

Out of curiosity, I’ve looked at adapting it to PostgreSQL. PostgreSQL has features that probably makes the implementation details simpler. I had implemented Oracle PL/SQL functionality for things that, I think, can become configuration settings in PostgreSQL.

ANSI SQL Security Model Flaw

One of the things this proposal highlights is a flaw in the way that privileges are assigned in Standard SQL: both the confidentiality (i.e., controlling SELECTs) and the integrity (i.e., DELETEs, INSERTs,and UPDATEs) privileges are GRANTed to users/ROLEs.

We never want to trust people(/users) with individual DELETE, INSERT, and UPDATE privileges!

Interestingly, Burroughs DMSII had a mechanism to limit selected integrity privileges by table (or “data set” in its parlance) to specific programs/routines as far back as the 1980s!

Database Organization

Management of database objects is key to making it work, so I offer some details about how I approached it.

The first thing to understand is that application code and its promotion process will impact the database. Any strategy that assumes developers stay out of the database needs to be rethought. The developers need to be given enough access and privilege to be able to develop and test the database-resident code while keeping them away from the data and its schema.

The simple, and obvious, way to do that is to split the data and the code into separate security domains (i.e., a different schema/security role for each). And then the privileges between them need to be well managed. A picture hopefully makes this clearer:

DataSchemaCodeSchemaSELECT- Simple VIEWsSELECT- Application VIEWsEXECUTE- Stored ProceduresDatabase SecurityPerimeterAs needed, by TABLE/VIEW:- DELETE- INSERT column-constrained- SELECT- UPDATE column-constrained
3-Tier: Separate Data and Code Schemas within the Database

And some comments to help further:

The details will vary according to the database chosen. Things on which the framework relies may have different vendor implementations because they aren’t part of the SQL Standard or, perhaps, were implemented by a vendor before the Standard specified something different. For example:

“Schema”, “user”, and “role” are pretty fundamental concepts when setting up a framework under this proposal. The differences in those concepts between Oracle and PostgreSQL, for example, will mean architectural differences in the way each might be used to implement this strategy. For example, when I implemented this in Oracle, I relied on AUTHID DEFINER PL/SQL procedures to manage some privileged actions that facilitated programmers’ needs. (It could be compared to providing Unix/Linux “sudo"-like access to otherwise risky functionality.) From my understanding of PostgreSQL, some of what that Oracle code did could instead be managed via ROLEs and the privileges they are GRANTed, amongst other things. And doing things by configuration is generally much safer than via coding.

Programming Frameworks

This is the biggest hurdle to change. Consider Jakarta Enterprise Edition (J2EE): it includes a framework for managing transactions in the middle tier, which becomes integral to the way code is written.

With our change, though, the transaction is controlled from - wholly resident in, in fact - the backend. So that needs to be worked out. My perspective is that the “programmers” evolved the “middle tier approach” to handling transactions without talking to “me,” the person responsible for the data. And it comes down to what’s more important: code, or data? If a company’s managers are given the choice of losing code versus losing data, they should opt to keep the data every time. It’s the recordkeeping, reflecting the state of the business at any point in time. It’s only by knowing the current state (e.g. how many parts are in stock) that allows a business to transact (e.g., parts can be supplied to fill an order). Code can always be rebuilt atop the data, and the company will have a chance of surviving. But rebuilding the data will probably be impossible, and the company will go out of business.

So if there’s a large “attack surface” over the data and that can be removed, I contend it’s in the interests of the business to adapt. Especially in today’s context of interconnected systems and threats from anywhere on the globe.

But there’s also another way to think about it that should be very appealing to programmers: the update routine in the database can be thought of as a “method” attached to its “object” (conceptually, the related rows in one or more tables). And modularity is all about the design pattern of attaching methods to the corresponding data.

(One day I’ll write about how normalization is for programmers too, refactoring is typically an application of Fourth Normal Form (4NF), and there’s a good reason Dr. Codd missed the connection of his ideas to the transient data of programs.)

See also the Performance and Reusability sections below.

Data Restructuring

Changing the database becomes more challenging as the amount of code against it grows. Especially important is to properly identify and remediate the code that maintains the data structures that are to change.

This assessment becomes much easier when we know all the update code is available in one, central location: the database. It is particularly simple when the database catalogue exposes dependency tracking between objects, such as Oracle’s [All|DBA|User]_Dependencies catalogue view. (PostgreSQL’s pg_depend only seems to track procedure parameter dependencies, and not database object references within the procedure.)

Alas, such tracking will not identify references in dynamically-constructed SQLs. Separate “pattern matching” scans of the source code itself will be needed to find and evaluate those cases. Which is, perhaps, also an argument against allowing dynamic SQL!?

Auditability

Asserting correct configuration of the database components of the architecture is possible via the catalogue tables. This includes queries such as:

Other catalogue checks will become apparent as the architecture evolves.

In my experience, this transparency around the vastly improved data integrity framework finds favour with Information Security folk and Auditors. Once they understand it!

Performance

A typical update includes a number of SQL commands along with notifiying the START- and END-TRANSACTION boundaries. Being executed out of the middle tier and adding network overhead for each database statement adds a lot of overhead to the transaction. Regardless of the speed of the network connection.

An update should be faster with only one round-trip over the network to the database (i.e., calling the update function). Multiply that by the number of update transactions occurring and, even in a modestly busy system, it could represent significant savings of both computing resources and network traffic.

Any locks taken by the in-flight transaction, which may block access to resources by others, will also be held for a shorter period. Which increases concurrency. Especially if any of the locked resources are frequently touched by update, and perhaps even query, transactions.

And all of that, of course, means a faster response to the user. Not to mention supporting more users on the same resources.

Reusability

By moving an update procedure further “back” in the network, it becomes more reusable across applications running in different contexts: web services, phone apps, batch processes, …

And code quality is generally better when it is used more often.

Acknowledgements

Thank you to Anthony and Simon for feedback on earlier drafts of this 3-part post.

Thank you also to Johnetta and her team who took on my ideas early in my thinking, and worked with me to figure it out in practice.