The Elastifile Replication Agent (a.k.a. Replication Service) is in charge of multiple tasks within the Elastifile cluster such as asynchronous replication, snapshot cooling and snapshot deletion from object.
Every Replication Agent can handle up to 3 tasks in parallel, with the following limitations:
- up to 2 concurrent cool snapshot (not more than 1 per data container)
- up to 3 concurrent delete snapshot from object and/or asynchronous replication
e.g. a mix of 1 asynchronous replication + 2 cooling tasks is valid.
In case that a replication agent receives a new request while it already reaches the limitation of running tasks, the following error might be shown up as a system event:
NOTE: Elastifile has a retry mechanism for tasks completion, so in case of a failure to execute a task due to a temporary unavailable resource, next attempt might work.
Mitigation:
If you find the above error keep being repeated, you should increase the number of replication agents in the system, especially where many data containers use asynchronous replication or cooling snapshots exist.
Escalation:
- Make sure all replication agents are active and running by executing the following command:
If they are not, check the following from the EMS:
2. Check for the running control_tasks in the system by executing the following command:
You can contact elastifile-support@google.com by email for consulting.
Please attach the above commands outputs for faster troubleshooting.