#Migrating from NAbox 3 to NAbox 4
1 messages · Page 1 of 1 (latest)
Hi @neon nymph question about the migration process. When the data are transferred from 3 to 4 will the new NAbox already collect the new data, so there is almost no gap?
Yes the systems are configured first, then historical data is transferred there shouldn’t be a gap
some facts about the migration from 3.5.3 to 4.0.4
the migration tool took 100 hours 😉
source prometheus database was 1.8TB big,
target database victoria metrics is now 1TB big.
time spent while importing: 99h56m57.940330846s
total samples: 1017949577947
samples/s: 2829068.44
total bytes: 20.0 TB
happy migrating!
Thanks for the information. 100h 😳 quite a long time
Great insights.
There is probably a way to optimize this by changing the number of cores in the migration command but I don’t have many environments to test.
Hi @neon nymph probalby a stupid question. Is there a change how nslookup works in the NAbox 4 compare to 3? When I do a nslookup to a specific FQDN I've got the error.
** server can't find fqdn: SERVFAIL
fqdn
Server: 127.0.0.53
Address: 127.0.0.53#53
Would that be that your fqdn ends with .local ?
yes correct 🙈
I actually contributed Flatcar container linux doc on the topic 😂 https://www.flatcar.org/docs/latest/setup/customization/configuring-dns/
If you put the domain name in network config “Domains” you should be fine.
interesting when adding two domains space seperated an error appears "invalid domain name"
Comma ?
Let me check
What version ?
4.0.4
Ok i might know why the domains field isn’t working, try using only one. What’s wrong when changing IP address ?
Thanks 😊 I will will stay currently with one domain
As soon the ip is changed the webserver is not reachable anymore even after a reboot
Can you check network config in /etc/systemd/network
Just saw you post in a other thread about change the ip in cli. When change it over the vApp setting it works 🙂
what I also saw on the FAQ page is that the root PW for the NAbox 3 is missing. Is it possible to add this back in a subsection?
Hi @neon nymph what I just saw in the ONTAP systems section from the admin menu is that the first letter of the cluster is always capitalized, even if the dns name is lower case. And can you say what the percentage in the third column mean?
It's just a name, that's not what's used to connect if I understand you correctly
Regarding the third column I don't have an instance available right now, you should see used capacity overall and IOPS graph
right I would say it just a beauty errors in the presentatio
I mean in the ONTAP system overview
IOPs and latency is clear but how do I need to interpret the percentage in that view
Hi @neon nymph short update from our migration front 🙂 we have now migrated 6 of 10 NAbox instances to version 4 and until now everything went well. Our biggest installation with 17 Clusters (42 nodes / 697 SVMs / 1846 volumes and 868 LIFs) took around 75h. The size in the vCenter could be reduced from 1.87TB to 1.34TB
Nice feedback thank you ! Efficiency is nice !
that´s an average of 2.65 volumes per SVM 🤔 . looks like you like to create SVMs 😜
@candid umbra Business of a service provider 😉 we have a lot of customers from different areas of the industry
is changing the ip address on the new NA4 box to the old NA3 box's IP address supported?
Shouldn’t be a problem
I did the opposite way. Change the ip of the 3er to a temporary one, used the old ip for the 4er and migrated the data
Nice one!
Anyone ran into this error?
Trying to run the migrate fails after Restarting Prometheus with admin APIs enabled... OK. The error following is Taking snapshot Error Failed to take PRometheus snapshot error="Failed to parse snapshot response: invalid character 'S' looking for beginning of value"
Are you root ?
yes
Do you mind copying this https://upload.nabox.org/vaso-puku-geno into nabox 3 and try with this migrate tool ?
that version seems to be working. It's still running but i made it into the transfer stage
ok I think you hit a random bug that I wasn't able to isolate yet.
Did you try a second time with the original one ?
i did not.
HI @all
I stuck in the migration process.
3.5.2 -> 4.04
Last week, I had the same error as a person above (victoriametrics-image was not available Error 125). After reading here, I updated today to 3.5.4 and saw the right image, but now I have an other error.
The migrate tool failed to create the snapshot directory (and the snapshot itsefl). If I create the directory on my own, then the migrate tool begins the migration process, but stops immediate because "found no blocks to migrate" (of course because there are not data in this directory).
If I do the manual way, I cannot take the snapshot, because it always prompts :" error":"admin APIs disabled".
To enable the admin APIS, I did like in the manual steps described, and adapted the docker-compose.yaml with the commad:-property "- --web.enable-admin-api " but without success. Of course we have not customized the Nabox. So why it stucks in enable the admin api and in taking the snapshot?
Can someone gives me further advices?
Thank you in advance
Hello there.
In NAbox 4.0.5 the migrate tool has a little bit more logging, maybe you can do an upgrade and make sure you run curl again to have it copied over in v3
Also, be careful to always run as root
Thank you.
I will give it a try. Indeed, I was today successful in the following way:
I adapted the docker-compose.yaml and have set the enable-api-parameter fix to check the procedure. After container restart , still admin-api was not enabled. I inspected the container with docker and saw that there were other parameter active than described in the yaml. So I searched for this parameter and found it in a file called docker-compose.custom.yaml (sorry, name is perhaps wrong because i havn't the server available now). But setting the parameter in this custom yaml file enabled the api after restart. Then I switched back to the migrate tool and it migrated 1:46 h successful. But because it is our test scenario, I will repeat it in the next days before trying it in production.
Excellent ! If you have time, I’d like to assist if you have any issue with the migrate tool
Hi Yann
What I did today: Deployed new Nabox with 4.05. repeat the migration, and was successful. But it was successful yesterday too, so nothing was proofed. So restored the old nabox to 3.5.2-level. New Try ---old failure: No victoriametric image available -> Fix is Upgrade to 3.5.4
Upgraded to 3.5.4 and repeat the migration with the migrate-tool (from v.4.0.5) Error is :
failed to create promeheus client: failed to open snapshot "/prometheus/data/snapshot": opening the db dir: stat /prometheus/data/snapshot/: no such file or directory
ERROR Failed to migrate metrics error="exit status 1"
of course, I know now what to do, as this is the same state as we had it yesterday. But perhaps we can find the reason with the "more logging" capabilities of the new migration tool.
So I ask you two questions, that are not clear to me:
- Why the old tool throw an error when it searches the victoriametrics image (which is later than the snapshot step) , but there is no snapshot directory and no snapshot made.
- when I set the parameter - --web.enable-admin-api in the docker-compose.override.yaml and restart the container, then it looks like the parameter is active. But with the manual step 2. I was never successful. What could be the reason?
Thank you for your help.
- I think NAbox 3 has some challenges when it comes to self upgrade the images and for some reason, probably logged somewhere in nabox container, the docker images failed to load in the nabox docker daemon. Sometimes re-applying the upgrade works, even with the same version
- You should have this in the docker compose file :
- ${WEB_ENABLE_ADMIN_API:---log.level=info}, that means if running "normal" docker compose up, it'll use--log.level=infoand if usingWEB_ENABLE_ADMIN_API=--web.enable-admin-api docker compose ...then it'll run prometheus with the right flag. Is it possible that your docker compose file (not the override) is missing- ${WEB_ENABLE_ADMIN_API:---log.level=info}?
"failed to create promeheus client" doesn't make much sense to me if there is no error before that, it means snapshot creation suceeded, plus "/prometheus/data/snapshot" is wrong as it's missing the snapshot ID
this is in the docker-compose.yaml. I saw it, but after the error I thougt that there is perhaps an order problem, when this parameter is set two times.
But in this docker file you mentioned, there is also the parameter - --storage.tsdb.retention.time=2y
And with docker inspect I saw, that in the prometheus container the parameter --storage.tsdb.retention.time=1y is active. That is why I had the idea to edit the docker-compose.override.yaml, because this is the parameter file where this 1y is set (only place)
oh yes it shouldn't be set in the override file
for environment variables, you don't have to repeat the variables from the main file, they will be merged
but only in setting it in the override file I was able to enable admin api (which is the real problem in our case I think)
yes that doesn't make sense indeed. You wouldn't have a screenshot or a verbatim of the logs ? It looks like you typed it out ?
I could create a screenshot, but i did not understand what you want to see
we can do that over a teams if you can
Yes, We have teams
Quick test I did here.
/usr/local/nabox/docker-compose # WEB_ENABLE_ADMIN_API=--web.enable-admin-api docker compose --env-file .env --env-file .env.custom up -d prometheus
WARN[0000] The "SNAPSHOT_PATH" variable is not set. Defaulting to a blank string.
WARN[0000] The "NABOX4_IP" variable is not set. Defaulting to a blank string.
[+] Running 1/1
✔ Container prometheus Started 1.3s
We see it says "Started" meaning that it has been restarted.
Then :
/usr/local/nabox/docker-compose # WEB_ENABLE_ADMIN_API=--web.enable-admin-api docker compose --env-file .env --env-file .env.custom up -d prometheus
WARN[0000] The "SNAPSHOT_PATH" variable is not set. Defaulting to a blank string.
WARN[0000] The "NABOX4_IP" variable is not set. Defaulting to a blank string.
[+] Running 1/0
✔ Container prometheus Running 0.0s
It says "Running" and didn't do anything because the environment stayed the same.
If I change it back :
/usr/local/nabox/docker-compose # docker compose --env-file .env --env-file .env.custom up -d prometheus
WARN[0000] The "SNAPSHOT_PATH" variable is not set. Defaulting to a blank string.
WARN[0000] The "NABOX4_IP" variable is not set. Defaulting to a blank string.
[+] Running 1/1
✔ Container prometheus Started 0.9s
Variable is gone, it is starting it again.
I see. I rechecked it in my screen session, and after adapting the parameter in the docker-compose.yaml, it was prompting "starting". But with only give the parameter in fromt of the docker command, it only posted "running" even it it would be a new state for the container. So I will check this after the migration in a non-screen session again.
Ok I think I'm getting somewhere. The ovveride file is throwing it off
ok well, that's unfortunate. As it happens, command values in override file are not merged, it replaces everything.
For this to work, you would have to repeat the wholeparagraph in command:
- --config.file=/etc/prometheus/prometheus.yml
- --web.route-prefix=/
- --web.external-url=http://example.com/prometheus
- --storage.tsdb.retention.time=2y
- ${WEB_ENABLE_ADMIN_API:---log.level=info}
I'll update the documentation
Ah well... it's already documented that way actually 😄
I'll check it as soon the migration was running. But the above is 1:1 in the docker-compose.yaml. but it was not taken, only the parameters from the docker.compose.override were active , as far as I remember
Exactly, and it needs to repeat the parameters (all lines under command: section) otherwise it'll ignore anything else in main compsoe file.
So if you put the paragraph up there, that's verbaitim the main compose file, with just the retention changed, you'll be fine
and there was the issue with the wrong container names anyways
Small precision, in your context, you were running command argument as only --storage.tsdb.retention.time=1y which actually works, except that maybe you can't access prometheus with /prometheus URL (well, that is a problem for the migration tool) and the config file I believe is the default value anyway so maybe it doesn't matter.
Good news is: migration terminated successfully.
Bad news: I am still confused about your "Command" section.
I started again:
- disable admin-api ---- delete parameter in
docker-compose.override.yaml: #- --web.enable-admin-api
Then of course the admin-api is not available
curl -XPOST -k https://localhost/prometheus/api/v1/admin/tsdb/snapshot
{"status":"error","errorType":"unavailable","error":"admin APIs disabled"}nabox-api:/usr/local/nabox/docker-compose#
migrate tool does not work without snapshot(which is no failure)
Restarting Prometheus with admin API enabled... OK
Waiting for Prometheus OK
Taking snapshot... OK
Restarting Prometheus with admin API disabled... OK
Waiting for Prometheus OK
Prometheus import mode
Prometheus snapshot stats:
blocks found: 0;
blocks skipped by time filter: 0;
min time: 0 (1970-01-01T00:00:00Z);
max time: 0 (1970-01-01T00:00:00Z);
samples: 0;
series: 0.
2024/08/20 15:06:54 found no blocks to import
2024/08/20 15:06:54 ERROR Failed to migrate metrics error="exit status 1"
Again acivating the admin-api:
vi docker-compose.override.yaml
restart container:
docker compose --env-file .env --env-file .env.custom up -d prometheus
WARN[0000] The "SNAPSHOT_PATH" variable is not set. Defaulting to a blank string.
WARN[0000] The "NABOX4_IP" variable is not set. Defaulting to a blank string.
[+] Running 1/1
✔ Container prometheus Started
admin-api is active again.
Now where you want me to put the command section from your point above?
As long as you don't have all the lines, it won't work as expected
Basically the command: section in override totally masks the command: section from the main file
and you need ${WEB_ENABLE_ADMIN_API:---log.level=info} so the variable passed in the docker run command works, and you need web.route-prefix=/ and web.external-url=http://example.com/prometheus so prometheus is reachable through API
Ahh I think I got you know...You want me to adapt the docker-compose.override.yaml with "your command section", right?
Yes, just like https://3.nabox.org/faq/#change-default-retention
Yess !! Now its working!!
this ${WEB_ENABLE_ADMIN_API:---log.level=info} was the point if I understood it right.
So tomorrow I will repeat the hole procedure. Then everything should work out of the box, after adapting this parameter right after upgrade. I will inform you. Thank you so much for your help. I really appreciated it a lot.
Yes, the only thing I don't get is the migration should have stopped with a "Failed to take prometheus snapshot" or "Failed to parse snapshot response" which I think we never got.
even the container name thing should have worked, it was "just" the web path and the environment variable for admin api
Hi Yann!!
I have reset all systems , adapted the lessons learned, and now I got a working path:
Upgrade to 3.5.4 to fix missing victoriametric-image (which could be also fixed by redeploy 3.5.2, but I did not found the file)
insert ${WEB_ENABLE_ADMIN_API:---log.level=info} in docker-compose.override.yaml in /usr/local/nabox/docker-compose
then follow the standard procedure like in the manual for migrate tool.
This works perfect for me.
Thank You for your kind support. That there is a place to ask some (stupid) questions is such a great help. Thank You again.