There are few large, hard to solve problems that keep your typical software engineer up at night – but a malfunctioning database is absolutely one of them. We’ll walk you through how we discovered the issue and what we did to fix it.
Back in December 2015, we noticed query performance on our main Postgres instance starting to slip. Over the course of a few weeks, queries that were once taking 10ms were now taking upwards of 6 seconds or more. We racked our brains trying to analyze expensive background jobs that may be causing database-wide latency but to no avail. Eventually, with the aid of NewRelic, Amazon RDS logs, and hundreds of
EXPLAIN queries, we came to the conclusion that this was a database level issue, not something poorly written queries were causing.
Addressing the Root Cause
We use Amazon’s RDS to host our database, so much of the complexity is hidden from us. However, there is a lot to be desired from the default RDS Postgres database configuration.
After days of research, we narrowed in on the cause: the default autovacuuming settings for large tables can cause regular autovacuuming to not occur for weeks on end. The default setting is when 20% of the rows in a table are created, updated or deleted, an autovacuum process kicks off that accomplishes the following goals:
- Recover disk space from updated/deleted rows
- Update statistics used by the Postgres query planner
- Update the visibility map, thus speeding up index-only scans
- Protect against loss of very old data due to transaction ID wraparound (much less common)
Some of our larger tables have as many as 300 million records, so the default autovacuuming wouldn’t occur until 60 million rows had been changed!
We decided to update our autovacuum scale factor in this case to 2% instead of the default 20%. We’d be running the autovacuum dameon every 6 million rows changed instead of every 60 million.
vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuples
After so much work to identify the problem, all it took to rectify the solution was run the following query and wait for Postgres to do it’s work.
ALTER TABLE changesets SET autovacuum_vacuum_scale_factor = 0.02;
As you can see from our long-term database profiling, the issue reached a boiling point in early January, but quickly after introducing these changes, we were able bring down our average query execution time by nearly 50% across the board, but often times upwards of 90% faster on these larger tables.
# See last time autoformatter ran for all tables SELECT relname, n_dead_tup, last_vacuum, last_autovacuum FROM pg_catalog.pg_stat_all_tables WHERE n_dead_tup > 0 ORDER BY n_dead_tup DESC;
# Get an approximate row count on a very large table SELECT reltuples AS approximate_row_count FROM pg_class WHERE relname = ‘changesets';