After adding new nodes to 3 out of 6 sites(since 3 are in a different geo location than the other). The rebalance doesn't seem to be working. All the new nodes were added to existing sites and are sitting on the same network as their counter parts in the same site. No firewalls between these nodes across sites either with traffic being trunked up to the core between the sites with filtering enabled only on IP ranges and no blocking on any port level access. I read in the storagegrid documentation that if we have Erasure coding enabled, the balancing must be triggerred automatically.
I have my ILM rule set with 4 days of replication and then after that erasure coding for ever. There is another ILM with direct erasure coding too. I don't know exactly how much new data is incoming everyday since the end users are outside not within the organization per say.
I see that the new nodes have storage cosumption of about 2-3% percent and the older ones are at 78-80%. It has been about 2 weeks since the nodes are added to the grid. When I run the "rebalance status" command it comes back with a Net::ReadTimeout with #<Socket:(closed)> error. Like the error suggested, I contact netapp support. They think it is a firewall or port blocking issue. But there is no port level blocking facility across these networks what so ever. So I am not sure what could be going wrong here. Wondering if this is something someone has seen and fixed may be? Apprecaite advice.
#Rebalance fails after adding new nodes in multiple sites
1 messages · Page 1 of 1 (latest)
Hi, how did you started rebalance?
I tried this and it fails with the below error.
`root@primaryadmin1:~ # rebalance-data start --site "<sitename>"
Requested rebalance operation failed.Internal server error. The server encountered an error and could not complete your request. Try again. If the problem persists, contact support. Net::ReadTimeout with #Socket:(closed)`
This KB matches the error, but maybe better to ask support... https://kb.netapp.com/hybrid/StorageGRID/Grid_Tenant_Manager/Unable_to_create_new_buckets_in_StorageGRID_with_net_read_timeout_error
Damn. the error looks exactly the same except the issue is completely different. Wonder how to translate this. I can ask support and see what they think about this
@arctic loom Thanks for your time and effort. I tried to do the most basic things like check if the new nodes I added had the correct routes and connectivity to DNS and NTP in place and also I ran the rebalance-data status and start commands and collected the tcpdump from the primary admin and looked into them on Wireshark. It seemed like there were a lot of cases of 100% packet drops. These are not firewall related, just parts of an entire communication seemed to be breaking here and there especially when the communication is across countries. I did suspect it was something on the network since the "socket" error makes it sound like it. I randomly ran the sg-leader-nodes.py command on multiple grid appliance nodes to see if all of them know who their EC job manager, group manager were etc (something that i got as command to check from support) and found that one of the nodes was not able to return any output. It seemed like some service was not running on it or was hung. Also the number of port connections from the primary admin to that specific node(compared to the rest) were lot lesser. Not knowing what to look for I just rebooted that node. It took a frighteningly long time to come back online. But when it did, the rebalance command just worked like magic. So it was some process that was not communicating back to the primary admin node from one node that made the entire command to not work. Seems weird. But that seemed to have been the issue. 🙂
Hi @dark canopy , happy you found the problem: is the sg-leader-nodes.py available and usable for all, or is something like secret to be used under support guidance? 🙂
I think it is available on the storagegrid software. Got it from this (https://kb.netapp.com/hybrid/StorageGRID/Object_Mgmt/How_to_find_the_EC_leader_in_a_StorageGRID) article that support had sent me.