Beyond Relational Databases: Is Your Data Relational?

One of the strangest things about technology is how it moves in circles.  The relational database isn’t new technology, and while many changes to the storage model and the performance of the system has changed, the underlying concept is the same.  The leading databases, except for Oracle, all bare SQL in the name, giving the impression that SQL was critical to the concept of the relational database, not merely a front end language for describing access to relational data.

Web sites fit nicely into a relational model.  They have categories, articles, products, etc., sets of data.  The idea of applying set theory to data is at the core of the relational database.  I can quickly and easily get all Articles in the Category of SEO, because those fields are tagged, and I simply pull the appropriate subset.  You can always get intersections (with JOINs), unions, set deletes (EXCEPT), and other set operations… if you are using sets of data.

Martin Kleppmann asks, on Carsonified, Should you go Beyond Relational Databases? That’s the wrong question to ask.  The question is, “Is your data relational?”  If you have groupings of like data, then you need a relational database.  If you are building an application with non-relational data, then storing it in the database to have a quick id look up is foolish, and you should be looking for persistent data storage that is optimized for that sort of data.

For temporary storage, a system like memcached is perfect, it gives you lightning fast references to data that may only exist temporarily.  For a long term storage, maybe a database is your answer, or maybe you need something more tied to your data structure.  We wouldn’t suggest Microsoft switch from it’s DOC format (and the Docx XML version) to relational databases, but I wouldn’t put relational data into something more object oriented.  You might use objects to represent it in memory for easier programming, but if the data is essentially relational, keep it in relations.

Data structures are at the core of computer science.  With all the free information out there, there is no excuse to be building a large scale system without knowing the basics.  The fact that Twitter built their operation without knowing what they were doing doesn’t mean that everyone can… Bill Gates dropped out of Harvard and made a fortune, not every Harvard drop-out is so successful.

PostgreSQL Cascading: Updates and Deletes

So something nice about a real database is cascading values, most commonly used to deletion, but you can use them for updates as well.  Let me give you a scenario: I track groups of data from my clients by a sub_id field.  As they add groups, their ids aren’t in a range.  If I wanted to consolidate them, I could update the sub_id (somewhere that won’t be stomped on by the sequence), but what about historical data.  Baring a cascade rule, PostgresSQL will stop the update.  A non-relational database like MySQL will just leave orphan foreign keys.  If you set Cascading, they come along for the ride.

View the following example:

CREATE TABLE A (a_id SERIAL PRIMARY KEY, letter char);
CREATE TABLE B (b_id SERIAL PRIMARY KEY,
a_id integer REFERENCES A(a_id) ON DELETE CASCADE ON UPDATE CASCADE,
letter char);

INSERT INTO A (letter) VALUES ('a'), ('b'), ('c'), ('d'), ('e');

INSERT INTO B(a_id, letter) VALUES (1, 'a'), (2, 'b'), (3, 'c'), (4, 'd'), (5, 'e');

SELECT * FROM A;
a_id | letter
------+--------
1 | a
2 | b
3 | c
4 | d
5 | e

SELECT * FROM B;
b_id | a_id | letter
------+------+--------
1 | 1 | a
2 | 2 | b
3 | 3 | c
4 | 4 | d
5 | 5 | e

DELETE FROM A WHERE a_id > 3;
-- We delete a few fields from the bottom table, and PostgreSQL magically cascades it to B

SELECT * FROM A;
a_id | letter
------+--------
1 | a
2 | b
3 | c

SELECT * FROM B;
b_id | a_id | letter
------+------+--------
1 | 1 | a
2 | 2 | b
3 | 3 | c

UPDATE A set a_id = a_id + 3;
-- This is the example above, renumbering the code

SELECT * FROM A;
a_id | letter
------+--------
4 | a
5 | b
6 | c

SELECT * FROM B;
b_id | a_id | letter
------+------+--------
1 | 4 | a
2 | 5 | b
3 | 6 | c

So we were able to renumber A, pass the values to B, without any updates to B. Pretty cool, huh?

CMS and CPU Usage

I have normally been adverse to Content Management Systems (CMS), because they generally are coded poorly to work in a “plugin” format.  Each page routinely makes dozens of database calls, which can put a big strain on the CPU.  On the other hand, the let an individual programmer quickly add LOTS of functionality that would have required a team of programmers months to develop.  I used to find them particularly heinous because they destroyed SEO attempts, but all the modern systems let you have a reasonable URL structure.  As a result, if you are successful and decide to build the $100,000 website, you can always point the old URLs to the new location and not break links.

However, something that was a reminder today, a spike in traffic can destroy your server if you aren’t optimized.  If you are getting promoted on television, being interviewed, or otherwise getting mentioned on a popular program that might send a few thousand people to your site at one time, be careful.  Even if bandwidth isn’t a problem, CPU and Memory might be.  If you are expecting a spike, make your home page static.  Most of your visitors will come there, and if you make it a static page (mod_rewrite them to the dynamic script if they have a logged in cookie), you’ll drastically cut your database load.

In fact, I think most sites would do well to always make the home page static.  Something we did “back in the day,” was program the whole site dynamically, then “mirror” the home page to a file with wget or something.  One could have most of their site mirrored to static files, and serve up dynamic pages to logged in users.  Historically, that’s what Slashdot used to do.

Either way, you should remember to optimize your main queries, and create the appropriate indexes as part of bring a site life.  Bandwidth isn’t the only constraint, sites without dozens of servers need to worry about CPU and memory usage as well.  A popular television show can send WAY more simultaneous traffic than social media or search engines, at least at one time.