High Performance Writes with Triggers

So a high performance system I was working on encountered a small technical problem, under heavy and repeated load from a single data source resulted in the Apache threads colliding, timing out, and resulting in the load balancer returning a 503 Error.  After much frustration, it was finally tracked down to this process.  When a data record enters the system, it is recorded, counters are interated, and as it moves to the monetization systems, each transmission is logged.  Each transmission also has counters updated via trigger, all of which presents a tremendous amount of data as simple lookups or small aggregate data.  Knowing how many records come in from a company on 5 tracking codes over 3 months means an aggregate query that sums 15 lines (the monthly total for each tracking code), but this means a tremendous number of UPDATE set count = count+1 triggers.  When a second event takes place before the first one is completed, a deadlock can occur as multiple transactions attempt to access the same row for updating.

The solution?  An holding table without triggers.

Another table for transmissions was created, that lacks any of the counter code that causes the deadlock.  When a transmission is logged, it is immediately recorded in this simple table, which returns control back to the system.  A background daemon runs that calls a single stored procedure to migrate the data:

  1. Record the id of all transmission records in this table (obviously, the database should handle this so changes in other transactions don’t interfere, but doing this explicitly makes it more portable and has a negligible impact on the process.
  2. As a single insert, bring all these records from the holding table to the permanent table.
  3. Delete these records from the holding table.

As a result of this process, the deadlock condition is resolved, and this procedure takes a fraction of the time because all the updates can be processed in the transaction without conflict.

The only drawback, the system’s counters and transmission log is only accurate to the last time it ran.  To make this as minimal as possible an intrusion, a daemon was written in bash, see this tutorial for a starting point, to simply run the stored procedure, sleep, and run again.  To protect yourself from potential race conditions, which result in deadlocking, do three things to avoid trouble:

  1. Use the same sequence for the primary key field in the temp table as the real one.  That way, if you attempt to copy it a second time, it will fail because the ids are in use (this also lets me to bulk adds to the table from a data pull without wasting time in the temp table).
  2. Run the function as a dedicated user.  Have the function run with definer permissions, and create a dedicated user to run the script, capped at a single login.
  3. Run the daemon, not a cron job.  If something delays the scripts, and you cron job it every minute, then you could start a second copy before the first is completed.  If you write the daemon, then the second job doesn’t run until the first is done.

When using a holding table, make certain that you are simply trying to let the triggers run after control has returned to the client function.  In a less high performance task, a simple LISTEN/NOTIFY structure instead of triggers will get the job done with less complexity.

Caveat, referential integrity can be compromised here if you are not careful.  If things are dependent on this table, this approach will not work, as the records will be recorded in the system, but not passed to the table where foreign keys reference.  Engineering around that limitation may create tremendous complexity.

Alternative Approach to explore: partial replication to a second server.  In that scenario, all the control code (meta code) could replicate from the repository to the secondary server(s).  Each of those could hold temporary insert tables.  In that case, as you add servers, make the sequences on each server count by multiples (two servers, one uses evens for ids, one uses odds).  Then you can replicate those tables back to the main database with the LISTEN/NOTIFY system, not worried about multiple calls in short order.

Beyond Relational Databases: Is Your Data Relational?

One of the strangest things about technology is how it moves in circles.  The relational database isn’t new technology, and while many changes to the storage model and the performance of the system has changed, the underlying concept is the same.  The leading databases, except for Oracle, all bare SQL in the name, giving the impression that SQL was critical to the concept of the relational database, not merely a front end language for describing access to relational data.

Web sites fit nicely into a relational model.  They have categories, articles, products, etc., sets of data.  The idea of applying set theory to data is at the core of the relational database.  I can quickly and easily get all Articles in the Category of SEO, because those fields are tagged, and I simply pull the appropriate subset.  You can always get intersections (with JOINs), unions, set deletes (EXCEPT), and other set operations… if you are using sets of data.

Martin Kleppmann asks, on Carsonified, Should you go Beyond Relational Databases? That’s the wrong question to ask.  The question is, “Is your data relational?”  If you have groupings of like data, then you need a relational database.  If you are building an application with non-relational data, then storing it in the database to have a quick id look up is foolish, and you should be looking for persistent data storage that is optimized for that sort of data.

For temporary storage, a system like memcached is perfect, it gives you lightning fast references to data that may only exist temporarily.  For a long term storage, maybe a database is your answer, or maybe you need something more tied to your data structure.  We wouldn’t suggest Microsoft switch from it’s DOC format (and the Docx XML version) to relational databases, but I wouldn’t put relational data into something more object oriented.  You might use objects to represent it in memory for easier programming, but if the data is essentially relational, keep it in relations.

Data structures are at the core of computer science.  With all the free information out there, there is no excuse to be building a large scale system without knowing the basics.  The fact that Twitter built their operation without knowing what they were doing doesn’t mean that everyone can… Bill Gates dropped out of Harvard and made a fortune, not every Harvard drop-out is so successful.

PostgreSQL Cascading: Updates and Deletes

So something nice about a real database is cascading values, most commonly used to deletion, but you can use them for updates as well.  Let me give you a scenario: I track groups of data from my clients by a sub_id field.  As they add groups, their ids aren’t in a range.  If I wanted to consolidate them, I could update the sub_id (somewhere that won’t be stomped on by the sequence), but what about historical data.  Baring a cascade rule, PostgresSQL will stop the update.  A non-relational database like MySQL will just leave orphan foreign keys.  If you set Cascading, they come along for the ride.

View the following example:

letter char);

INSERT INTO A (letter) VALUES ('a'), ('b'), ('c'), ('d'), ('e');

INSERT INTO B(a_id, letter) VALUES (1, 'a'), (2, 'b'), (3, 'c'), (4, 'd'), (5, 'e');

a_id | letter
1 | a
2 | b
3 | c
4 | d
5 | e

b_id | a_id | letter
1 | 1 | a
2 | 2 | b
3 | 3 | c
4 | 4 | d
5 | 5 | e

-- We delete a few fields from the bottom table, and PostgreSQL magically cascades it to B

a_id | letter
1 | a
2 | b
3 | c

b_id | a_id | letter
1 | 1 | a
2 | 2 | b
3 | 3 | c

UPDATE A set a_id = a_id + 3;
-- This is the example above, renumbering the code

a_id | letter
4 | a
5 | b
6 | c

b_id | a_id | letter
1 | 4 | a
2 | 5 | b
3 | 6 | c

So we were able to renumber A, pass the values to B, without any updates to B. Pretty cool, huh?

Primary Keys in PostgreSQL shouldn’t be a Mystery

PostgreSQL has so much power, and people think it’s much harder to use.  You can do just about anything with Alter Table in terms of Adding Columns.  The only tricky thing I’ve found is that you can’t add a NOT NULL column without a Default if there is data, since the default would be NULL.  So you can allow Nulls, fill in your data, then tag it Not Null, or give it a useful default and move on.  I saw another post of someone making PostgreSQL way more difficult than needed, creating a Primary Key later.  While there are plenty (or some) reasons to create a sequence manually (I had a need for unique integers for insertion into a third-party system once), there isn’t a need for something simply like a primary key.

Let’s setup our test table, you can obviously do all the inserts on one line, I was playing on an older server for testing.

test_database=# CREATE TABLE test_table (col_A varchar, col_B varchar);
test_database=# INSERT INTO test_table(col_a, col_B) VALUES ('a','A');
test_database=# INSERT INTO test_table(col_a, col_B) VALUES ('b', 'B');
test_database=# INSERT INTO test_table(col_a, col_B) VALUES ('c', 'C');

Now we want to add a Primary Key:

test_database=# ALTER TABLE test_table ADD COLUMN test_id SERIAL PRIMARY KEY;
NOTICE: ALTER TABLE will create implicit sequence "test_table_test_id_seq" for serial column "test_table.test_id"
NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index "test_table_pkey" for table "test_table"
test_database=# SELECT * from test_table;
col_a | col_b | test_id
a | A | 1
b | B | 2
c | C | 3
(3 rows)
test_database=# \d test_table
Table "public.test_table"
Column | Type | Modifiers
col_a | character varying |
col_b | character varying |
test_id | integer | not null default nextval('test_table_test_id_seq'::regclass)
"test_table_pkey" PRIMARY KEY, btree (test_id)

Look, one line, added a new column test_id, with the sequence, tagged as a primary key.  Not so hard?  Why isn’t that an obvious example under Alter Table, given that it’s something that one might want to do?

PostgreSQL is Powerful, Confuses Newbies

Looking at the list of PostgreSQL blog entries, I saw this one,
Add auto increment column in PostgreSQL, and realized why my database of choice isn’t so popular… it confuses people.  Back in the old days, add an auto-numering field to PostgreSQL as a Primary Key was hard, you had to create the sequence and set the default.  This was a common operation, since a numeric Primary Key is often easier than looking for a natural key, and even when we have a natural key, it’s usually easier to index an integer than a string if we are going to use it as a foreign key somewhere.  The SERIAL “type” in PostgreSQL is the same as auto_increment, but it’s implemented as TYPE integer, attached SEQUENCE, and default value.

CREATE TABLE names_table (name_id SERIAL PRIMARY KEY, name varchar);

This gives you an PRIMARY KEY name_id, that will be auto incremented, and a string column name.  I use these all the time, and the PRIMARY KEY shortcut is an easy way to make a single column the primary key.  You can always use a primary key directive for composite keys in mapping tables and similar constructs.

I find it easy, because I know what a SERIAL is, but given that EVERY newcomer to PostgreSQL is looking for this feature, wouldn’t it make sense to highlight it in the documentation?  Perhaps in the examples given?  In addition, since auto increment is so common, why not support it?  SERIAL is syntactic sugar anyway, why not have it accept the MS SQL/MySQL terminology, you can throw off a notice so people learn the preferred one, but why push off testers?

MySQL Everywhere from a PostgreSQL Junky

Almost 10 years ago, we started working with Open Source databases, and looked at MySQL and PostgreSQL.  Back then, the feature list was night and day, MySQL was fast, lightning fast, on quick reads, and extremely slow on complicated joins.  PostgreSQL was much slower on the simple reads, but the referential integrity features, like delete cascades made the writes/deletes/updates much faster.  Most of the commentary at the time focused on transactions, but PostgreSQL focused on correctness.

Let me give you an example.  Assume I have my website products put into categories for a custom product display system (this was a common problem for someone building SEO sites for products back in 2001-2004).  If I wanted my products to exist in multiple categories, I created an intermediate mapping table, and everyone that uses a ORM system is probably nodding their head right now.  But here was the neat thing, in PostgreSQL, I just set the mapping table’s references to the Ids as NOT NULL ON DELETE CASCADE.  Now, if I delete a category, I don’t have to delete everything from the mapping table, it automatically deletes the entry whenever the category is deleted (and same thing for a product).

Now, the MySQL guys are thinking, lazy programmer, just delete from the mapping table when you delete, and we keep our faster database, or use the slow relational table systems if you are lazy.  That’s all correct, but assume I’m updating this 2001 site to support a new product/category tagging feature.  I want it fast, so I create a table of tags (with some descriptions for my SEO purposes), and two (or more) mapping tables.  Now on my PostgreSQL website, no big deal, I just make these maps ON DELETE CASCADE, and when I delete a category, it deletes the tags as well.  On the MySQL site, I have to dig into the code where I deleted the category, and update it to delete the tags.  This is fine if I built it, but what if that chunk was built by my summer intern who was learning programming at the time, now it’s a pile of spaghetti code, it works, but it’s ugly, and now I have to fix it.

Now, in 2009, MySQL has tables with referential integrity, and PostgreSQL is way faster.  I have a library of pl/pgsql code I have written over the years for PostgreSQL, some of which I updated for the new features in Postgresql 8.x, some of which still dates back to the limitations of Postgresql 7.3.  I have SEO efficient code so I never have to worry about HTML friendly naming schemes for URLs, and other things.  Meanwhile, MySQL has added cool features for the open source world of quick and dirty code, and PostgreSQL has tried to aim to please DBAs with awesome tuning parameters.  As a result, their is a wealth of awesome Open Source code developed in the past 8 years, but it’s all on MySQL.  Even when it’s ported to PostgreSQL, it’s using PostgreSQL as a fake MySQL, as in we don’t do ON DELETE CASCADE, we still delete from the mapping table.

And PostgresSQL fixed the speed problems, 8.3 is FAST, and the new 8.4 is probably FASTER, but we don’t take advantage of it because our code is all on MySQL.  Just another reminder that Worse Really Is Better!