Thursday, April 7, 2022

High level Steps involved - Disaster Recovery for Galera Cluster

 

Geo-Distributed MySQL Clusters for the Enterprise

Galera Cluster provides high availability and scalability for MySQL. A Galera cluster consists of 3 or more Galera instances in a local network using synchronous replication. Galera Cluster supports multi-master and now has a GUI available for cluster management. While this provides high availability in a local region or site, it does not provide any provisions for disaster recovery (DR) or any multi-site deployment in general, so let’s explore how we could extend the functionality of Galera Cluster to deploy at geo-scale.

Linking Two (2) or More Galera Clusters

To create a multi-site Galera Cluster, it’s tempting to simply add more cluster nodes in a remote region. However, due to synchronous replication and high latencies over the WAN, cluster and application performance would become unacceptable and thus this topology is not recommended. Instead, we turn to native MySQL replication, which is asynchronous and does not come with a significant performance penalty.

The steps to establish replication from Site 1 to Site 2 (DR) are:

  1. Create a cluster in Site 1.

  2. Create a cluster in Site 2 from Site 1. At this point, the 2 clusters are identical.

  3. Configure necessary firewall rules between sites to allow replication from Site 1 to Site 2.

  4. Also, configure the firewall to allow replication in the reverse direction (see below).

  5. Add the following to a cluster node in site 1 (and site 2 for failback, see below):

    1
    2
    3
    log-bin=galera-bin
    log-slave-updates
    server-id=1

    You can adjust values as needed, it’s just important to enable binary logging AND to log slave updates so that writes from other nodes will be reflected in the binary log of the current node

  6. Restart your Galera cluster (Site 1) to make sure settings take effect

  7. Now just set up regular native MySQL replication stream using Site 1 as master and Site 2 as slave

  8. You can use a proxy for MySQL on Site 2 to route writes to all Galera cluster nodes

  9. At this point, be sure to set up monitoring on both clusters so you are notified of any issues!

With the template above and mixing various technologies, you can achieve a DR site for your Galera Cluster. To perform a “failover” or otherwise activate the DR site, simply point your applications to the DR site. IMPORTANT: note the binary log position on Site 2 before sending any writes to it. At this point, the primary site is out of sync with the DR site. If in fact the primary site can be recovered and is viable, then simply establish native MySQL replication in the reverse direction, but start from the binary log position you noted just before the failover. If this is not available, or the primary site cannot be recovered, then it must be reprovisioned. Again, when reprovisioning, be sure to note the binary log position in the MySQL backup!

What Else to Consider with Geo-Distributed Galera Cluster?

Actually using native replication to join Galera Clusters requires us to plan a few additional items:

  1. How do we actually failover?
    1. Stop application traffic to the primary site.
    2. Repoint all application traffic to Site 2.
    3. Note that the above steps will take your application offline for however long it takes to perform the above operations.
  2. How do we fail back?
    1. We will have to reprovision Site 1 since the data is stale.
      1. Schedule downtime, do backup and restore, all nodes now in sync.
      2. Take backup, restore onto Site 1, then establish replication from Site 2 to Site 1.
    2. Stop application traffic to Site 2.
    3. Be sure replication traffic has caught up if using replication.
    4. Reestablish replication from Site 1 to Site 2.
    5. Repoint application traffic to Site 2.
    6. Again, the above steps will incur downtime for your application.
  3. Monitoring and Management.
    1. Need to monitor:
      1. Galera Cluster in both sites.
      2. MySQL native replication.
      3. Optional: MySQL Router or other proxy, if in use
      4. Various tools required for each item above.
    2. Each technology needs to be managed separately, most likely using Do It Yourself (DIY) scripts.
    3. Requires development to view and manage the entire topology from a high level.

No comments: