Scaling Postgres on Amazon RDS: Query Performance on Large Tables

March 11, 2016 jgerhardt

There are few large, hard to solve problems that keep your typical software engineer up at night – but a malfunctioning database is absolutely one of them. We’ll walk you through how we discovered the issue and what we did to fix it.

The Problem

Back in December 2015, we noticed query performance on our main Postgres instance starting to slip. Over the course of a few weeks, queries that were once taking 10ms were now taking upwards of 6 seconds or more. We racked our brains trying to analyze expensive background jobs that may be causing database-wide latency but to no avail. Eventually, with the aid of NewRelic, Amazon RDS logs, and hundreds of EXPLAIN queries, we came to the conclusion that this was a database level issue, not something poorly written queries were causing.

Addressing the Root Cause

We use Amazon’s RDS to host our database, so much of the complexity is hidden from us. However, there is a lot to be desired from the default RDS Postgres database configuration.

After days of research, we narrowed in on the cause: the default autovacuuming settings for large tables can cause regular autovacuuming to not occur for weeks on end. The default setting is when 20% of the rows in a table are created, updated or deleted, an autovacuum process kicks off that accomplishes the following goals:

  • Recover disk space from updated/deleted rows
  • Update statistics used by the Postgres query planner
  • Update the visibility map, thus speeding up index-only scans
  • Protect against loss of very old data due to transaction ID wraparound (much less common)

Some of our larger tables have as many as 300 million records, so the default autovacuuming wouldn’t occur until 60 million rows had been changed!
We decided to update our autovacuum scale factor in this case to 2% instead of the default 20%. We’d be running the autovacuum dameon every 6 million rows changed instead of every 60 million.

<code>

vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuples

</code>

After so much work to identify the problem, all it took to rectify the solution was run the following query and wait for Postgres to do it’s work.

ALTER TABLE changesets SET autovacuum_vacuum_scale_factor = 0.02;

The results
database-graph

As you can see from our long-term database profiling, the issue reached a boiling point in early January, but quickly after introducing these changes, we were able bring down our average query execution time by nearly 50% across the board, but often times upwards of 90% faster on these larger tables.

Helpful Queries

# See last time autoformatter ran for all tables SELECT relname, n_dead_tup, last_vacuum, last_autovacuum FROM pg_catalog.pg_stat_all_tables WHERE n_dead_tup > 0 ORDER BY n_dead_tup DESC;

# Get an approximate row count on a very large table SELECT reltuples AS approximate_row_count FROM pg_class WHERE relname = ‘changesets';

The post Scaling Postgres on Amazon RDS: Query Performance on Large Tables appeared first on The Contactually Blog | For Relationship-Based Businesses.

Previous Post
The Guide to Building Relationships with Executives, Influencers & VIPs
The Guide to Building Relationships with Executives, Influencers & VIPs

A while back, our CEO, Zvi, wrote a post here about how […] The post The Guide to Building Relationships wi...

Next Article
Recruiting Talent for Your Company? Start Building Relationships
Recruiting Talent for Your Company? Start Building Relationships

My second job out of college, I was thrown into a manag […] The post Recruiting Talent for Your Company? St...