When nodes are being joined to the cluster, there are instances where the Ansible task fails, even though the node join process itself completes successfully. I'm curious to know if this issue is linked to a timeout setting within the code for node joining. If so, can we extend the timeout value to prevent task failures? Alternatively, if the problem is not related to timeout values, what other options are available to avoid task failures or overcome these errors? It's worth noting that the task doesn't consistently fail with the same error; instead, various errors have been observed during the node joining process.
Errors:
failed: [localhost] (item={u'intra_cluster_ip': u'169.1.1.1', u'standard_node_name': u'test-nas-n01'}) => changed=false
ansible_loop_var: item
item:
intra_cluster_ip: 169.1.1.1
standard_node_name: test-nas-n01
msg: 'Error adding node with ip 169.1.1.1: job reported error: Node <xxxxxxxxxxxxxxxxxxxxxxxxxxx>: Failed to add to the cluster with reason: Failed to start cluster join on node: 169.1.1.1: RPC: Couldn''t make connection [from mgwd on node "test-nas-n01" (VSID: -3) to mgwd at 169.1.1.1], received {u''job'': {u''_links'': {u''self'': {u''href'': u''/api/cluster/jobs/xxxxxxxxxxxxxxxxxxxxxxxxxxx''}}, u''uuid'': u''xxxxxxxxxxxxxxxxxxxxxxxxxxx''}}.'
failed: [localhost] (item={u'intra_cluster_ip': u'169.1.1.1', u'standard_node_name': u'test-nas-n01'}) => changed=false
ansible_loop_var: item
item:
intra_cluster_ip: 169.1.1.1
standard_node_name: test-nas-n01
msg: 'Error adding node with ip 169.1.1.1: job reported error: Failed to enable storage failover service. Reason: Partner "test-nas-n01" is not a member of the cluster. , received {u''job'': {u''_links'': {u''self'': {u''href'': u''/api/cluster/jobs/xxxxxxxxxxxxxxxxxxxxxxxxxxx''}}, u''uuid'': u''xxxxxxxxxxxxxxxxxxxxxxxxxxx''}}.'