There are few large, hard to solve problems that keep your typical software engineer up at night — and a malfunctioning database is absolutely one of them. We’ll walk you through how our team at Contactually discovered the issue of query performance slippage on some of our largest tables (200 mil — 3 billion+ records) and what we did to address it.
The Problem: Systematic Degraded Query Performance
In early 2016, we noticed query performance on our largest tables were starting to slip in NewRelic.
Over the course of a few weeks, queries that were once taking 10ms were now taking upwards of 6 seconds or more. The team considered numerous causes from poorly written queries to expensive background jobs that may be causing systematic performance issues. Eventually, with the aid of NewRelic, Amazon RDS logs, and hundreds of EXPLAIN queries, we came to the conclusion that this was a database level issue, not poorly written queries in our application (a very common cause of poor database performance).
The Culprit: Default Autovacuum Settings
After days of research, we narrowed in on the cause: the default autovacuum settings for large tables.
The default settings can cause weeks to elapse without triggering an autovacuuming process to run on a large table. When autovacuuming doesn’t occur, the query planner is using outdated, incorrect data to decide how to most efficiently execute a query.
Imagine trying to tell a friend where the milk is in a grocery store using directions from what the store looked like 5 years ago — you’re friend is going to waste a lot of time using your outdated directions.
Critical Benefits of Autovacuuming
There are numerous benefits to regular autovacuuming. However, this type of operation consumes system resources. The purpose of the default settings for autovacuuming is to ensure it does not run too often, siphoning precious system resources away from answering queries. It’s a delicate balance to strike.
1. Helps to recover disk space from updated/deleted rows
2. Updates statistics used by the Postgres query planner
3. Updates the visibility map, speeding up index-only scans
4. Protects against loss of very old data due to transaction ID wraparound
The side effects of a poorly functioning query planner can be catastrophic. The CPU and Disk IO hit caused by using these bad directions can lock up vital resources needed by the rest of the database, dramatically decreasing query performance across the board, even for tables that aren’t affected by this specific issue.
Take a look at how a simple select statement, finding a row by its primary key, was impacted (before and after):
The Solution: Tuning of Table-Specific Autovacuum Settings
We use Amazon’s Relational Database Service (RDS) to host our Postgres databases, so much of the infrastructure complexity is hidden from us, e.g. network, filesystem, etc. However, there is a lot to be desired from the default RDS Postgres database configuration.
We found our solution in a rather infrequently mentioned “auto vacuum scale factor” setting that helps Postgres determine when it’s time to perform maintenance on a particular table. RDS defaults to running an autovacuum process when 20% of the tuples in a table have been created, updated or deleted. Some of our larger tables have as many as 3 billion records, so this meant that autovacuuming wouldn’t occur until 600 million rows had been changed!
For our larger tables we decided to update our autovacuum scale factor down to 2% instead of the default 20% (or 60 million records instead of 600).
vacuum threshold = vacuum base threshold + vacuum scale factor * numberof tuples
While the majority of the work to solve our issues was in research, better understanding the internals of Postgres, all it took to rectify the solution was run the following query and wait for Postgres to do its work.
ALTER TABLE table_name SET autovacuum_vacuum_scale_factor = 0.02;
We were able bring down our average query execution time by nearly 50% across the board, but often times upwards of 90% faster on these larger tables.
It goes without saying that this level of speed improvement shouldn’t be surprising when regular database maintenance that should be constantly occurring doesn’t happen for large periods of time.