So a high performance system I was working on encountered a small technical problem, under heavy and repeated load from a single data source resulted in the Apache threads colliding, timing out, and resulting in the load balancer returning a 503 Error. After much frustration, it was finally tracked down to this process. When a data record enters the system, it is recorded, counters are interated, and as it moves to the monetization systems, each transmission is logged. Each transmission also has counters updated via trigger, all of which presents a tremendous amount of data as simple lookups or small aggregate data. Knowing how many records come in from a company on 5 tracking codes over 3 months means an aggregate query that sums 15 lines (the monthly total for each tracking code), but this means a tremendous number of UPDATE set count = count+1 triggers. When a second event takes place before the first one is completed, a deadlock can occur as multiple transactions attempt to access the same row for updating.
The solution? An holding table without triggers.
Another table for transmissions was created, that lacks any of the counter code that causes the deadlock. When a transmission is logged, it is immediately recorded in this simple table, which returns control back to the system. A background daemon runs that calls a single stored procedure to migrate the data:
- Record the id of all transmission records in this table (obviously, the database should handle this so changes in other transactions don’t interfere, but doing this explicitly makes it more portable and has a negligible impact on the process.
- As a single insert, bring all these records from the holding table to the permanent table.
- Delete these records from the holding table.
As a result of this process, the deadlock condition is resolved, and this procedure takes a fraction of the time because all the updates can be processed in the transaction without conflict.
The only drawback, the system’s counters and transmission log is only accurate to the last time it ran. To make this as minimal as possible an intrusion, a daemon was written in bash, see this tutorial for a starting point, to simply run the stored procedure, sleep, and run again. To protect yourself from potential race conditions, which result in deadlocking, do three things to avoid trouble:
- Use the same sequence for the primary key field in the temp table as the real one. That way, if you attempt to copy it a second time, it will fail because the ids are in use (this also lets me to bulk adds to the table from a data pull without wasting time in the temp table).
- Run the function as a dedicated user. Have the function run with definer permissions, and create a dedicated user to run the script, capped at a single login.
- Run the daemon, not a cron job. If something delays the scripts, and you cron job it every minute, then you could start a second copy before the first is completed. If you write the daemon, then the second job doesn’t run until the first is done.
When using a holding table, make certain that you are simply trying to let the triggers run after control has returned to the client function. In a less high performance task, a simple LISTEN/NOTIFY structure instead of triggers will get the job done with less complexity.
Caveat, referential integrity can be compromised here if you are not careful. If things are dependent on this table, this approach will not work, as the records will be recorded in the system, but not passed to the table where foreign keys reference. Engineering around that limitation may create tremendous complexity.
Alternative Approach to explore: partial replication to a second server. In that scenario, all the control code (meta code) could replicate from the repository to the secondary server(s). Each of those could hold temporary insert tables. In that case, as you add servers, make the sequences on each server count by multiples (two servers, one uses evens for ids, one uses odds). Then you can replicate those tables back to the main database with the LISTEN/NOTIFY system, not worried about multiple calls in short order.