It’s easy to think of the relational database as the standard model in data storage, but like everything else, it had to be invented. Prior to relational databases, data was stored in a tree-like structure that would remind a modern analyst of a JSON- or XML-based NoSQL datastore. Every piece of data was part of a key:value pair that represented the data’s place in a hierarchy. Nested indices would have been challenging to maintain and work with. Queries had to be explicit about how data was retrieved.
Edgar F. Codd, an engineer at IBM, changed everything in 1970. He proposed a data model that turned these real-world-based data structures into a logical abstract representation. By separating the data from the object of study, he created a data model so easy to work with it would become the de facto standard up until today. For this revelation he was rewarded in 1981 with the A.M Turing award, the most coveted prize in computer science.
How it works
The relational model places data into a group of tuples, which we can think of as rows. These are grouped into relations, which we would call tables. Each relation has a header tuple. Take a relation, ‘cities’.
('City', 'State', 'Area', 'Population') ('San Diego', 'California', '1000000', 2000000) ('Denver', 'Colorado', '100000', 4000000)
Multiple relations form a relational database. These relations are accessed via a declarative language, where work is done by defining what you want rather than how you want to get it.
Each of these rows represent a “predicate”, which is a mathematical way of saying If this combination of values exist, create a row for it. This creates a logical model of all possible combinations. In a prior model, if we wanted to add:
('San Diego', 'Texas', '10000', '130000')
to the relation above we would have used San Diego as a key to point to both ‘Texas’ and ‘California,’ and then had a painful structure to traverse. Now we simply say that this new combination of predicate variables (values) exist, thus the predicate is true and a new row will be created.
This sounds a lot like SQL as we know it, but because the relational model is based on formal logic and relational algebra, there are constraints on implementation. SQL violates many of the rules necessary to maintain a true relational structure.
So what’s different?
There are several key differences bewteen a SQL database and Codd’s vision, but here are some of the big ones.
No Duplicate Rows
A true relational database can’t have duplicate rows. If a tuple has a certain set of values, that predicate has already evaluated to true and cannot be added an additional time. This means, for instance, if you have a relation of every car owned in the United States with the columns ‘Make’ and ‘Model’, you better also have a unique ID column to stop duplication.
This formality was quickly discarded in SQL, which ignores the fundamental concept of a tuple in a true relation.
Column Order Significance
In a true relation, the order of the columns is mathematically irrelevant. Allowing column ordering is a big change from the model. However, in SQL column order can matter, most significantly with the union clause which joins two tables together vertically and will not function on columns of different data types.
In a true relational database, every value of every tuple in a relation must have a value. This is because in true predicate logic, every statement must evaluate to True or False. This means that something which is not True is de facto False. The null value makes such comparisons impossible. For instance, a comparison of these two tuples evaluates to Unknown, or null because null cannot be evaluated against itself.
('San Diego', 'Texas', '10000', Null) ('San Diego', 'Texas', '10000', Null)
Along with the difficulty of having a data point for every value in a table, there are some pretty big challenges Null helps us avoid. For instance, imagine a world where left join does not work because it introduces null values into the result set!
E. F. Codd eventually came around on null and suggested three- and four-prong logical operations, in a break from his often rigid adherence to first principles.
The future we live in
E. F. Codd would never become rich from his incredible innovation. In fact, IBM resisted the relational model because they feared it would cut into the profits of other successful database systems they had developed through the ’60s and early ’70s. They would get into the market in 1978. Early, but not before Larry Ellison was able to use Codd’s work to found Oracle, leveraging SQL to build an amazing and user-friendly database. Oracle would go on to become the most famous company in database technology, but the technology Ellison employed was not truly relational, nor would IBM’s be in 1978—a fact that would frustrate Codd for the rest of his career.
It’s clear why a “pseudo-relational” database won the war, with changes that turned out to be so necessary and so useful. It was a disagreement of pragmatism versus integrity that pragmatism won, as it so often does in real-world scenarios. Nonetheless, it’s interesting to see where SQL came from, and good to know that somewhere inside it is a piece of groundbreaking work that made analytics as a career possible.
If you’d like to see a true relational database in action, there aren’t many options, but you can look at the primarily academic RelDB. If you’d like to peer back 50 years into computer science, you can read E.F Codd’s seminal paper here.