It seems like the data retention has changed after upgrading from version 3 to 4.
Maybe it's just some of the dashboards.. right now I am looking at the Volumes and SVM dashboards and they are blank prior to the update... this is also true for Power details, while Aggregates seems to hold much older data... is this a bug ? Because I think we need to revert back if this upgrade has truncated data..
#Data retention after upgrade to 4.0
1 messages · Page 1 of 1 (latest)
I have been looking at the /vm/ and it looks like "some" old data is there, but data prior to the upgrade has an instance called "nabox-harvest2" whereas the new data has an instance that is just "harvest"... I guess this makes a difference in the grafana graphs.. any way to "fix" this?
That’s on custom panels ?
Ah maybe not. I’m not sure there is a way to rename the labels
Nope it's on the "default" dashboards in Grafana
As shown here... but it's not all data points as you can also see in this picture...
...we are looking back 7 days in the example, and you can clearly see that something is missing in the 3 lower graphs... but the three above seems OK?
Ok, so one possible fix would be to change the container host name
It wouldn’t change the data collected under « havrest » but it would go back to « nabox-harvest2 » from then on.
In grafana? What I need are graphs that go back 90 days... especially capacity values...
You’ll get those graph if you set a time period prior to the migration. Maybe that’s good enough ?
ok it seems like it.. but we cannot create a graph that spans from now and 90 days back... not great I'm afraid...
isn't it possible to reuse the same instance name as the old installation? I guess that's what "breaks" it?
It’s possible it’s actually the container host name.
Maybe @rotund matrix or @dense canopy can chime in. Is there a way to make the harvest instance name less sticky ?
perhaps you could use a Prometheus relabel_config to rewrite the instance label of the new metrics to match the old container name?https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config
https://valyala.medium.com/how-to-use-relabeling-in-prometheus-and-victoriametrics-8b90fc22c4b2
Sounds interesting. I could implement a blanket replacement of nabox-harvest2 to the new name
Let me do some tests.
If that's the "instance" causing the problem, I see the port number is also part of it
Something I didn't quite get. relabeling is done during scraping, not queries
yep, it changes what is stored in the time series db
dang it
@reef shadow do you still have NAbox 3 running with full historical data ?
yep just restored it and restarted collecting, so just lost 1-2 days
Ok. I’m exploring some options with harvest team. One of them would be to relabel data at import time to normalize to the new convention in NAbox 4.
Another option would be to change instance name in 4.0.3 to revert to the same name as v3 but it would « break » data collected by new instances of NAbox 4. Compromises are necessary I’m afraid. Still debating the best option
interresting, let me know if I should test anything...
https://dl.nabox.org/nabox/NAbox-main-migrate.nbx , don't forget to re-download the migrate cli to NAbox 3 after upgrading v4
So just to be sure, I reinstall a v4 node, and apply this update? Then do the migrate (and remember to pull the latest migrate from the v4 node) ?
You can upgrade your current v4 node, and delete all data in /data/victoria-metrics-data/* or you can deploy a new one and upgrade it
Well.. I did as you suggested... and the migration died becuase of space issues... I guess that removing the data isn't freeing up space?
Then again... it seems to be the root that is full?
...I installed a new 4.0.3, applied the main.migrate update you provided, and started a new migration... lets see how it goes... (takes 4-5 hours) 😉
I had that before for another user, filling up root. Happens if connection to Victoria metrics breaks. Which seems odd as it’s local
Maybe the "rm -rf" the data folders did it? 😉
PS: Just ran into the DNS issue again with the new installation... I have never had any issues with other linux hosts running DHCP.. not sure if it's the name servers or the search domain that is not set... after I set them manually and rebooted, it seems to work... and I got the old data as well 🙂
No I did not... 🙂
could be it then... I guess !
Your migration upgrade package has pushed the installation "back" to 4.0.2 right? Would it be OK to upgrade to 4.0.3 then?
Not sure I understand ? "migration" binary is part of NAbox 4 distribution, that's why you strat by pulling it to the NAbox 3 instance
as I remember it, there is no pending changes that are not part of 4.0.3 so yes, you can upgrade to 4.0.3
now, you still need to release capacity in root correct ?
I think that he refers to this link (https://dl.nabox.org/nabox/NAbox-main-migrate.nbx ) you posted above on 27-6-2024.. is that still needed in version 4.0.3?
not needed
if you dc down it should free up capacity, then dc up -d
So what happens exactly ? With DHCP you're not getting the DNS server ? nor the domain names ?
To be honnest didn't even check.. it was installed from OVF with DHCP enabled.. then as I started to import from the old installation, I noticed that the nodes were not installed, I then tried to install them, but they complaint because I was unable to add them using their dns name.. (even full dns).. but I could add them with their IP no problem... I then remembered that I had the issue before... changed to static IP/DNS etc.. and it worked OK after that...
If you wan I can try to replicate it?
If you get a chance I'd be curious. I know that domain names is tricky with some OS to propagate through DHCP, but DNS server is pretty basic stuff
And I never had that issue in any of the lab deployment
I can push a version that stores temporary migration data to /data if you need. It shouldn't be necessary but well...
but if I look at the "about" it reports "v4.0.2-2 (8e920d7)"
correct
I just installed a new 4.0.3 with DHCP, and I can ping IPs only... not even FQDN (not local ones) but strangely it can resolve netapp.com..
lol.
my guess is that it uses a global dns somewhere
resolvectl status ?
no status here.. I can only set the resolver...
naboxtest /home/admin # resolvconf status
Expected either -a or -d on the command line.
uh ??
resolv.conf shows:
nameserver 127.0.0.53
options edns0 trust-ad
search .
naboxtest /home/admin # resolvconf -h
resolvconf -a INTERFACE < FILE
resolvconf -d INTERFACE
Register DNS server and domain configuration with systemd-resolved.
-h --help Show this help
--version Show package version
-a Register per-interface DNS server and domain data
-d Unregister per-interface DNS server and domain data
-f Ignore if specified interface does not exist
-x Send DNS traffic preferably over this interface
This is a compatibility alias for the resolvectl(1) tool, providing native
command line compatibility with the resolvconf(8) tool of various Linux
distributions and BSD systems. Some options supported by other implementations
are not supported and are ignored: -m, -p, -u. Various options supported by other
implementations are not supported and will cause the invocation to fail:
-I, -i, -l, -R, -r, -v, -V, --enable-updates, --disable-updates,
--updates-are-enabled.
See the resolvectl(1) man page for details.
no search domain is ok I guess
as it might not be part of DHCP config
but resolvectl status not working is extremely weird
ls -al /bin/resolvectl
naboxtest /home/admin # ls -la /bin/resolvectl
-rwxr-xr-x. 1 root root 149776 Jul 1 23:17 /bin/resolvectl
Ok, nothing weird
I'm surprised by your prompt
I have "admin@localhost ~ $ "
maybe your terminal emulator is overriding PS1 ?
I did a "sudo bash" 🙂
Wait, did you type resolvconf ?
aha 🙂
that explains it
I use <TAB> because I'm lazy...
admin@naboxtest ~ $ resolvectl status
Global
Protocols: -LLMNR +mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
Link 2 (ens192)
Current Scopes: DNS
Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 10.0.2.101
DNS Servers: 10.0.2.101 10.0.2.102
and 10.0.2.101 should definitely resolv your local host names ?
no 10.0.2.101 is one of my DNS servers...
admin@naboxtest ~ $ nslookup 10.0.2.101
101.2.0.10.in-addr.arpa name = dc01.bbaas.local.
that's what I'm saying, it should be able to resolv your local host names
admin@naboxtest ~ $ nslookup dc01.bbaas.local
Server: 127.0.0.53
Address: 127.0.0.53#53
** server can't find dc01.bbaas.local: SERVFAIL
admin@naboxtest ~ $ ping dc01.bbaas.local
ping: dc01.bbaas.local: Temporary failure in name resolution
SERVFAIL is interesting
reverse lookup seems to work...
sudo journalctl -u systemd-resolved
the DHCP server is running on a Mikrotik with RouterOS, and DNS is a Windoze box.... but as mentioned I have never seen an issue with any of our other Linux dists... mostly Ubuntu and Rocky...
Yes that's definitely a host side issue
admin@naboxtest ~ $ sudo journalctl -u systemd-resolved
Jul 08 15:14:52 localhost systemd[1]: Starting systemd-resolved.service - Network Name Resolution...
Jul 08 15:14:52 localhost systemd-resolved[303]: Positive Trust Anchors:
Jul 08 15:14:52 localhost systemd-resolved[303]: . IN DS 20326 8 2 e06d44b80b8f1d39a95c0b0d7c65d08458e880409bbc683457104237c7f8ec8d
Jul 08 15:14:52 localhost systemd-resolved[303]: Negative trust anchors: home.arpa 10.in-addr.arpa 16.172.in-addr.arpa 17.172.in-addr.arpa 18.1>
Jul 08 15:14:52 localhost systemd-resolved[303]: Defaulting to hostname 'linux'.
Jul 08 15:14:52 localhost systemd[1]: Started systemd-resolved.service - Network Name Resolution.
Jul 08 15:14:56 localhost systemd[1]: Stopping systemd-resolved.service - Network Name Resolution...
Jul 08 15:14:56 localhost systemd[1]: systemd-resolved.service: Deactivated successfully.
Jul 08 15:14:56 localhost systemd[1]: Stopped systemd-resolved.service - Network Name Resolution.
Jul 08 15:15:00 localhost systemd[1]: Starting systemd-resolved.service - Network Name Resolution...
Jul 08 15:15:00 localhost systemd-resolved[1347]: Positive Trust Anchors:
Jul 08 15:15:00 localhost systemd-resolved[1347]: . IN DS 20326 8 2 e06d44b80b8f1d39a95c0b0d7c65d08458e880409bbc683457104237c7f8ec8d
Jul 08 15:15:00 localhost systemd-resolved[1347]: Negative trust anchors: home.arpa 10.in-addr.arpa 16.172.in-addr.arpa 17.172.in-addr.arpa 18.>
Jul 08 15:15:00 localhost systemd-resolved[1347]: Defaulting to hostname 'linux'.
Jul 08 15:15:00 localhost systemd[1]: Started systemd-resolved.service - Network Name Resolution.
Jul 08 15:15:01 naboxtest systemd-resolved[1347]: System hostname changed to 'naboxtest'.
Jul 08 15:14:46 naboxtest systemd-resolved[1347]: Clock change detected. Flushing caches.
Jul 08 15:15:02 naboxtest systemd-resolved[1347]: Clock change detected. Flushing caches.
dig @127.0.0.53 dc01.bbaas.local
mmmm. Maybe .local is throwing it off, that's supposed to be reserved by mDNSResponder
do you have something else than .local to try and resolve ? That would be internal ?
shot in the dark. Can you edit /etc/systemd/resolved.conf and put MulticastDNS=no. Then sudo systemctl restart systemd-resolved
all our local domain are .local
The MulticastDNS=no doesn't fix it...
dang it
I think it's safe to assume specificly .local name don't resolve. Public names are fine
just created a test domain and I am able to lookup "host.test.netapp" ok..
(and ping it)
ok that would confirm the theory
so .local is an issue with the dist you are using I guess?
Now how do we tell resolved to treat .local as regular tld
It seems like it
But couldn't google foo it yet
Our private DNS server has a lot of domains under xxxx.mycompany.local domains which correspond to our internal IP addresses.
We're moving some developrs machines to Ubuntu and found that resolv service is failing to resolve those addesses despite others private xxx.mycompany.com only available in our internal DNS server work, so is able to c...
Seem that nsswitch.conf needs to be changed
Maybe there is a combination that would work there for the hosts line
but it's handled by something.. cannot be altered directly
"hosts: mymachines resolve [!UNAVAIL=return] files usrfiles dns" this line need to be changed to where "dns" is before the UNAVAIL...
so it's a remount with "rw" or somerhing
No that's an immutable part of the system. Well, actually you can remove the symlink and put it directly in etc
ok, not sure how much time I would like to put into this, as a simple fix for me is just to set a static IP... I guess you are aware what the issue is and possibly how to fix it?
Yes you can leave it to me, I'll setup a .local zone and do some tests
ermh... if I reboot my new 4.0.3 host installed with DHCP, then set to static IP (via GUI) it doesn't seem to reember it, and returns to DHCP setup ?? ...
Manage to set this by using the vApp setting on the VM..
...somewhat confusing way to configure IP if you ask me 😉
Normally it should not reapply vmx config if it has not been changed
well I "fixed" it by setting the vApp values... (remembering to set ip like "10.10.10.10/24" syntax 😉
With my research so far, it seems the "fix" is to list ".local" in the DNS domains, which is what you're doing, but I don't think you have to setup static IP for this.
I consider it a "good enough" workaround for now but I'm not closing the issue and will keep looking
And I'm fixing the bug with OVF configuration reloading
4.0.4 is available https://nabox.org/downloads/
Is it just my browser or have you not uploaded 4.0.4? 😉 I can only see 4.0.3
Mmmm. And now it’s there. Maybe GitHub caching !
I guessed the path: https://dl.nabox.org/nabox/NAbox-4.0.4.nbx .. but still not updated on the webpage...