DR-Recovery Manual Procedure

Overview

The following KB article provides information about the procedure to restore an Elastifile cluster from a backup in GCS.

The procedure should be performed with the help of the Google Elastifile support team and in order to perform it you should contact elastifile-support@google.com in order to get the procedure automation script and steps.

Introduction

Elastifile is a self-managed scale-out file system. Elastifile supports the option to copy read-only snapshots from the primary storage to GCS, either manually or with the built in snapshot scheduler. In case of a disaster the snapshots in GCS can be used to restore the data to a new cluster including all the export names and ACL's.

Restoring from a backup will require to remount and validate the file system.

The procedure is dealing with restoring a cluster in case of an unforeseen failure and if the original cluster cannot be recovered or is needed for additional analysis.

If the original cluster is operational and the only need is to restore specific files or directories, the snapshot can be mounted as a read-only mount and the files and directories can be manually copied to the same location. For instruction, refer to this KB article.

Prerequisites

Before executing the script, the following steps should be performed:

In the same project as the faulty cluster, deploy a new one Elastifile cluster with the same configuration using the same deployment method (Terraform, marketplace deployment). The new cluster must have the same capacity as the original cluster or higher. You may be advised by the support team to create a larger cluster in order to improve the restore performance.
Ensure that the new cluster is running the same or newer version as the original cluster.
Please verify that the new cluster has the same license as the original cluster, that enables ClearTier/Object Tier. if needed please contact the support team.
The original cluster and EMS must be available during the whole restore operation. The original cluster may not be active but the EMS must be active and accessible from the new cluster's EMS.
Ensure that there is a communication on port 443 from the new EMS to the original EMS
Ensure the RA configuration in the new EMS meets the RA configuration as the faulty cluster.

Restoration Flow

The following operations will take place as part of the script execution:

The script will create a worker RA (strong instance)
The new cluster will be configured to connect to the object tier in the same project
A restore operation will run for each DC of the original cluster - one DC at a time
1. The script will find the latest intact snapshot of this dc that exists in the GCS bucket (snapshot information can be found in the logs)
2. Create a dc and internal export on the new destination cluster
3. Copy the data from the GCS bucket to the cluster
4. Delete the internal export
5. Copy the exports and client rules from source EMS
When all the data is copied the script will finish the settings:
1. Create the same number of RA as were in the source cluster
2. Delete the worker RA

Was this helpful?

How can we improve it?