Hi there, we have just setup a new Grid with some physical nodes as well as two virtual gateway nodes and a virtual admin node.
The virtual nodes now seems to complain about NTP time out of sync... and we suspect that VMware might be the cause of this.
The VMs has the default VMWare tools setting where time is synced at startup... but we do not have the periodical update enabled.
We have setup google's NTP servers and verified they are reachable... we will now totally disable the time sync from VMWare tools and reboot the nodes...
Mostly because this is what is recommended... are there other issues which can cause this? And is the time sync messured up again the admin node?
#NTP Time issue
1 messages · Page 1 of 1 (latest)
As a general rule, I stick to ‘time.nist.gov’ where possible, universally.
The option to periodically sync time from inside VMware should not be enabled. The ntpd daemon running in the SG node container is responsible for keeping time in sync.
In a grid, per default, all admin and gateway nodes and the 3 first storage nodes in each site will sync time externally - to the NTP server(s) specified.
All other nodes will only sync internally in the grid.
Time sync is crucial in SG, because you want your metadata to match your data. For anything that relies on time, I've always disabled time sync in VMware Tools, as that turns out to be 'less than perfect' to say the least.
From my experience, I'd rather use a fast & reliable time server 'close by' from pool.ntp.org, as the exact time is less important than having the same time on all nodes/devices. (if that makes sense). (and what Erik said 😉 )
Here's an example from the primary admin node in my lab (the command "ntpq -c peers" is a good troubleshooting step):
`admin@grid-admin01:~ $ ntpq -c peers
remote refid st t when poll reach delay offset jitter
*dns1.swelab.local 194.58.207.148 2 u 19 64 377 0.1886 0.0474 0.0214
-dns2.swelab.local 162.159.200.1 4 u 19 64 377 0.2162 -0.1094 0.0222
#grid-admin02 10.128.16.3 3 u 49 64 377 0.1933 -0.0052 0.0236
-grid-gateway01 10.128.16.3 3 u 40 64 377 0.1997 0.0292 0.0143
#grid-gateway02 10.128.16.3 3 u 28 64 377 0.1447 0.0346 0.0117
-grid-storage01 10.128.16.3 3 u 32 64 377 0.1170 0.0105 0.0162
-grid-storage02 10.128.16.3 3 u 51 64 377 0.1518 -0.0007 0.0343
-grid-storage03 10.128.16.3 3 u 33 64 377 0.2203 -0.0374 0.0443
#grid-gateway05 10.128.16.3 3 u 48 64 377 0.3356 0.0489 0.0206
+grid-gateway06 10.128.16.3 3 u 28 64 377 0.3275 0.0412 0.0143
...
-grid-storage07 10.128.16.3 3 u 54 64 377 0.3042 0.0330 0.0369
#grid-admin03 10.128.16.3 3 u 23 64 377 0.2648 0.0678 0.0201
`
you can also ntpq -c as to get the association IDs and then ntpq -c rv <id> to read the details for the corresponding association... there you see some more details about why a peer got selected or not
also, you can request specific servers from the NTP pool by using <country>.pool.ntp.org as server (e.g. de.pool.ntp.org, fr.pool.ntp.org, etc.), or even explicit different servers from disjunct pools by using 0.de.pool.ntp.org, 1.de.pool.ntp.org, etc. (the responses are guaranteed to be different so you can use that to supply e.g. 3 or 4 different upstream servers)
I think we "fixed" it... it has not complained since yesterday 🙂 The ESXi hosts where the VMs were running didn't have any NTP setup... and I think that even though the VMWare Tools is set to only sync the clock at reboot, it may do it sometimes, "just to p*ss you off" 😉 Or maybe during vmotion? The NTP servers I think has to be IP at least when you setup the basic SG... So basically we fixed the ESXi servers, and rebooted the VMs... and it seems to behave now... and we have disabled the VMWare tools clock sync... 🙂
yeah it definitely re-syncs at vmotion, we have hit that in the past as well 🙂 glad you figured it out
Live vMotion is not supported for SG nodes anyway, so that is a non issue 😆
I guess it makes sense with the gateway nodes at least...
It makes sense for all nodes, again, time is crucial in SG, hence it's not supported 😉