#SnapMirror Concurrent Updates

1 messages · Page 1 of 1 (latest)

zealous void
#

Team, SnapMirror module fails, when I trigger an update (ad-hoc) and another one (e.g. scheduled) is currently in progress for that particular relation.

Is there any way getting this condition under control more elegently than simply retrying with Ansible built-in capabilities?

TASK [Update SnapMirror relation] ***********************************************************************************************************************************************************
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: NoneType: None
fatal: [cluster2]: FAILED! => {"changed": false, "msg": "Error updating SnapMirror relationship: calling: snapmirror/relationships/a2a58964-bb95-11ee-9f88-005056b7b9a0/transfers: got {'message': 'Update operation failed. Another transfer is in progress.', 'code': '13303844'}.:"} 
snow patrol
#

That could be possible.

Thinking off the top of my head

If we get a 13303844 (Another transfer is in progress), we could wait (for a certain amount of time) and try again.

We would need a default acceptable ammount of time to wait (and probably a parameter for user to modify this time).

We would also probably want to post a message that we are waiting on another process before starting

#

@zealous void do you have an idea how long transfer like that typically take?

dense hull
#

Or use retries/delay/until Ansible method

zealous void
#

@dense hull - I thought about retries and Ansible built-in stuff. But having that handled in the module feels cleaner and more transparent (especially, when not having full ONTAP background knowledge). Depending on how you design it, yet antoher scheduld or adhoc process could be started from somewhere else, while you are waiting for the next retry. Also, why should a developer/automation engineer care about our limitations and not being able transferring more than one stream/snapshot at a time...
@snow patrol - Not easy to say... Transfer times vary between minutes and hours depending on change rate and schedule. I would set it to a minute and come up with a meaningful warning, if that time elapsed without a successful transfer. Then an engineer can tune it with an appropriate parameter to env specific requirements (like "if a local snap is taken, it has to be transferred to the destination in x minutes")
Makes sense?

tranquil spindle
#

I receive this error on a regular basis. My view as a customer is that the module should create and initialise a snapmirror policy, and if I'm re-running the play, it should fail if there's a problem with it. If an update task is running, it should report OK. If the state is healthy/idle, it should report OK. I can't really think of a scenario where a customer would want their task to fail if a policy is updating in a healthy way.

I actually don't want it to do anything if the policy has been created and initialised. I don't want it to run an update each time. In my eyes, that's what the schedule is for, not what the ansible module is for. I can imagine it being a nice to have feature for some, but not what the core module is used for.

As @dense hull suggests, I think most customers are using retries/delay because of the module behaviour today. I'm personally using

   retries: 2 
   delay: 5
   until: snapmirror_status is not failed

#

One thing that would be nice is a message stating the lag-time is higher than the committed schedule. If you rerun a play that is supposed to create a snapmirror policy and initialise a volume, if the status is healthy but lag-time is > the schedule applied, a warning message here would be great.

snow patrol
# tranquil spindle I receive this error on a regular basis. My view as a customer is that the modul...

A lot of that would be nice, but we don't get the information required from ONTAP to do that.

When we we get

"Error updating SnapMirror relationship: calling: snapmirror/relationships/a2a58964-bb95-11ee-9f88-005056b7b9a0/transfers: got {'message': 'Update operation failed. Another transfer is in progress.', 'code': '13303844'}.:"}

There is nothing to indicate to us that this transfer is the one from they playbook or another one. If it another one is running, and the playbook is adding a new one we wouldn't want to return OK unless we attual added it.

I think what we can do is implement the same logic with have for Jobs. Have a certain number of retries, with a wait time, and a max timeout time.

https://github.com/ansible-collections/netapp.ontap/blob/main/plugins/module_utils/netapp.py#L832

GitHub

Ansible collection to support NetApp ONTAP configuration. - ansible-collections/netapp.ontap

zealous void
#

@snow patrol - Sounds reasonable
@tranquil spindle - I understand the requirement for monitoring the lag time. But I'd be careful implementing monitoring/alerting features like that into an Ansible routine.... This sounds more like an ONTAP alert, that should be raised, whenever the lag time significantly exceeds the committed schedule (or an external routine monitoring that)

tranquil spindle
#

Yeah, you're right @zealous void . My use-case is perhaps a little unique in that I have 100+ fsx filesystems that come with no AIQ support, no alerting, no AWS Integration etc. for this sort of stuff. I believe some improvements are coming (or are perhaps already here) in this area, but I was "monitoring" via Ansible (and Harvest, more so) out of necessity. It makes sense not to have those calls in this specific module.