OS: 9.15.1, ESXi 8.0.2 with two ESXi hosts.
We would like to setup a HA pair with one node on each ESXi hosts, and have mirrored aggregates... we have created all the networks that OS requires with jumbo etc. and all the tests complete OK.
When we try to deploy the nodes, it sometimes fails on one of the OVF uploads, some times on both... sometimes it completes both OVF uploads, but then fails with errors about not able to write (the configuration I guess) of an .ISO file I guess it needs?... There is a routed network between the deployment VM and the ESXi hosts, they are about 5.000KM appart with a routed and firewalled network... we have opened up the ports like ICMP, 22, 443 and 3260, but that should also not explain the different behaviour we see... our next step would be to move the deployment host closer to the ESXi hosts, but the best way in the future would be to have one central deployment host and x-number of OS's running on smaller remote sites... can someone verify the firewall ports needed between deployment host and ESXi ? (PS: The ESXi hosts are a part of a larger vCenter setup, so we have added the vCenter inside the deployment host, then added the two ESXi hosts we need to deploy on...
#ONTAP Select is showing strange issues while trying to deploy...
1 messages · Page 1 of 1 (latest)
What's your latency between Deploy VM and nodes?
"The maximum latency between the mediator and each ONTAP Select node cannot exceed 125ms"
https://docs.netapp.com/us-en/ontap-select/concept_usecase_mc_sds.html
I thought I heard issues with 8.0.2. Have you tried 8.0.3?
This is not a metrocluster setup. But if latency is an issue with the mediator iscsi LUN presented to the nodes I will look into that, but I doubt it... looks like 32-36ms pings so about 70ms round trip... but we do not even get to that 🙂 the nodes are created but newer powered up... just deleted again...
I have a support session with a NetApp supporter tomorrow, hopefully he can tell me what's wrong...
Doesn't really matter if SDS or not. If you have a two-node cluster these requirements are still valid: A minimum bandwidth of 5Mbps and a maximum round-trip time (RTT) latency of 125ms are required to allow proper functioning of the cluster quorum.
https://docs.netapp.com/us-en/ontap-select/concept_ha_config.html#two-node-ha-versus-multi-node-ha
Check here for port requirements: https://docs.netapp.com/us-en/ontap-select/reference_faq.html#vcenter
OK, just to be clear.. the two ESXi hosts are in the same rack.. it's only the mediator that is located somewhat far away from the two ESXi hosts. We are under the 125ms here, and also over the 5Mbit but of cause I will check to be sure..
OK I jus went through the documentation once again, verified that all the ESXi setup is configured at per documentation... 2 internal port groups and 2 external ones... we even started a suport case where we had a support remote, who didn't really help a whole lot... he told us to contact vmware... 🙄 so we ended up tearing everything down and rebuilt it.. yet we ended up with the same strange deployment issues... most of the time the two nodes start to deploy (looking from vmware), it then tries to upload the OVF to the VMs.. this process fails with none, or just one of the nodes complete the upload the other stays at 0%... when both complete the OVF upload the process fails with a permission error while trying to mount an .ISO image to one of the VM nodes... this is getting very annoying and the debugging options are close to none, or very well hidden 😉 I have now reopened the case with NetApp, and plan to escalate it as much as possible with a system not yet in production 😉
Now as I test with a single node deployment I get the same issue...The process "Export OVF Template" starts at 0% and never completes... is then "canceled by a user" after 60 sec... On the deployment server i get an error about a timeout while trying to deploy the OVF... just great... 🙂
..same problem on a single node deployment, which rules out the mediator iscsi stuff...
Does anyone know if a ONTAP Select node panics if it cannot reach back to the deployment server at boot?
if deploying the OVF file fails already this hints more towards a general network/connectivity issue. Do your datastores have enough capacity? Is the frontend network (which is used for deployment) teh same MTU everywhere (especially if there's a firewall or something in between, then you might need to reduce MTU to 1480 or something). Export-policies on the datastores (if NFS) are correct? i.e. can you manually create files/folders in the datastore where it tries to put the ISO?
The deployment server is behind a firewall, we have opened up 80, 443 and iSCSI, we can verify that this works by deploying a linux VM at the destination and running nmap etc.. up against the deployment server... seems OK.. Surely we do not need Jumbos between deployment and the nodes? We have only just been able to create a sinle node cluster successfully... but we actually need a pair... we just keep getting very strange issues as we deploy... I even eded up in a situation where the deployment server thought it was deploying... the VM was stopped and paniced during boot (hence the question), but I had to reinstall the deployment server to get "past" this because it was just "stuck" and I left it for an hour, rebooted etc.. nothing seems to work... But you are right it all points towards some kind og networking issue... I just have to "prove" it to the customer 😉
just curious....is the OVF in a Content Library? If not, you my want to make one, and dump the OVF in the Content Library
I was working with a customer that had a similar issue, weird, inconsistent errors deploying Select instance in the far distance. Deploy server was in Europe, failing Select instances in East-Asia.
Turned out to be RAID settings related in HW solution the respective remote sites - RAID was configured as write-through instead of write-back.
This fixed all the issues. Timeouts, coredumps, etc.
Something that might be useful is to create the Select instance from the Deploy GUI but to launch the actual deploy from the CLI and to add the -inhibit-rollback flag which will make Select not cleanup on failure so you can better analyze what happened. If the ONTAP instance was created but applying config failed, you can access that instance from the VMware console and investigate further.
We want to use VLANs from inside ONTAP on the external interfaces. the portgroup used have no VLAN tagging, but how come we cannot specify any VLAN when creating the deployment?
Are the port groups configured for VGT? (i.e. VLAN ID 4095 tagged on the port group)
apparently isn't not supported to tag VLANs on ESXi, I tried via the CLI where I was able to configure it, but it complained as I tried to deploy... (not supported on ESXi)... so we have the internal networks on their own VLANs tagged on the port group. Management on a tagged portgroup and the data networks on untagged postgroups. We will then tag the iSCSI traffic and InterCluster with their seperate VLANs on the data portgroups... as fas as I can agther from the "very old" documentation, this should be supported... (they show GUI screenshots from ESXi 5.x in the documentation for 9.16.1 FFS. 😉
according to this, VGT with ONTAP select should be supported on ESX
To support VGT, the ESXi/ESX host network adapters must be connected to trunk ports on the physical switch. The port groups connected to the virtual switch must have their VLAN ID set to 4095 to enable trunking on the port group.
This is what I got as I tried: Error[400]: VlanIDNotSupported - Setting a VLAN ID on an ESX host is not supported. ESX host: plzbw1
This is when these values are set: node modify -name test-01 -cluster-name test -data-vlans 1608 10
I had to set this back to "-" "-" and then it worked..
hmmm doesn't seem to work for me us with the tagging inside ONTAP, we cannot reach the tagged iscsi portgroup with the vmk port for iscsi...
My guess it that this is because we have setup the portgroup with overrides so that it only uses one nic in the vswitch.. and somehow this breakes things...
problem is that we need both iscsi and intercluster... the intercluster is routed and we do not want to mix these two... can we do intercluster via the e0a port that does manegement?
Hmm must admit I have never heard of the vlan 4096 stuff... I will give it a test 🙂
*4095, not 4096
I agree it's not a very well-communicated feature (and it feels more like a hack than a deliberate feature), but it has been there since at least vSphere 4.x or even 3.x and it does its job pretty well