Notes On Amazon RDS Replication Lag

What is replication lag?

If your system runs on Amazon Relational Database Service (RDS) you may have opted to configure one or more replicas for your main MySQL database(s). This means you have a master RDS instance and at least one slave RDS instance which receives updates from the master. This process is called replication.

Replication ensures that changes made on the master database also happen on the slave after some period of time. For a variety of reasons this period of time can increase. For example, a long-running query or erroneous query can cause replication to slow down or stop entirely. This results in replication lag: changes made on your main database aren't showing up on the slave replica because the replica is lagging behind.

Being informed when replication lag occurs

Amazon provides monitoring functionality that can alert you when replication lag becomes too high. One of the properties of a slave instance is Seconds_Behind_Master which can be viewed by executing the query show slave status; on the slave.

This field contains the replication lag as measured in seconds and it is usually a small number, such as zero, but it can increase as the lag goes up. When the number becomes too high you can configure Amazon's CloudWatch alert system to send you an e-mail.

This works fine as long as replication occurs normally. There are issues which can cause replication to stop entirely. In that case Seconds_Behind_Master could become NULL and you won't receive an e-mail (as that e-mail is only sent when the delay behind master exceeds a preconfigured threshold).

Resolving replication issues

When there is a replication issue the output of show slave status; is quite useful in debugging and resolving it.

You need to review the values of:

  • Slave_SQL_Running
  • Last_Error
  • Last_SQL_Error

When a particular SQL query failed on the slave it could be that execution of queries in general has stopped. This is indicated by Slave_SQL_Running having the value No.

In that case you'll either need to:

  • Remedy the error by fixing the issue that caused the SQL query to fail.
  • Decide to resume replication by letting the slave ignore that error.

The former situation can be tricky as it requires you to figure out what data or query is problematic based on the values of Last_Error and Last_SQL_Error. These fields may provide enough information to determine any incorrect records but this is not always the case.

In the latter case you would execute the following command on the slave:

CALL mysql.rds_skip_repl_error;

You should only run this command when you've determined that skipping the SQL query won't lead to inconsistent data or incorrect data on the slave (or, at least, that this is allowed to occur by skipping that particular SQL query).

Contents © 2014 Simeon Visser