#Rebalance fails after adding new nodes in multiple sites

1 messages · Page 1 of 1 (latest)

dark canopy
#

After adding new nodes to 3 out of 6 sites(since 3 are in a different geo location than the other). The rebalance doesn't seem to be working. All the new nodes were added to existing sites and are sitting on the same network as their counter parts in the same site. No firewalls between these nodes across sites either with traffic being trunked up to the core between the sites with filtering enabled only on IP ranges and no blocking on any port level access. I read in the storagegrid documentation that if we have Erasure coding enabled, the balancing must be triggerred automatically.
I have my ILM rule set with 4 days of replication and then after that erasure coding for ever. There is another ILM with direct erasure coding too. I don't know exactly how much new data is incoming everyday since the end users are outside not within the organization per say.
I see that the new nodes have storage cosumption of about 2-3% percent and the older ones are at 78-80%. It has been about 2 weeks since the nodes are added to the grid. When I run the "rebalance status" command it comes back with a Net::ReadTimeout with #<Socket:(closed)> error. Like the error suggested, I contact netapp support. They think it is a firewall or port blocking issue. But there is no port level blocking facility across these networks what so ever. So I am not sure what could be going wrong here. Wondering if this is something someone has seen and fixed may be? Apprecaite advice.

arctic loom
#

Hi, how did you started rebalance?

dark canopy
# arctic loom Hi, how did you started rebalance?

I tried this and it fails with the below error.

`root@primaryadmin1:~ # rebalance-data start --site "<sitename>"
Requested rebalance operation failed.

Internal server error. The server encountered an error and could not complete your request. Try again. If the problem persists, contact support. Net::ReadTimeout with #Socket:(closed)`

arctic loom
dark canopy
#

Damn. the error looks exactly the same except the issue is completely different. Wonder how to translate this. I can ask support and see what they think about this

dark canopy
#

@arctic loom Thanks for your time and effort. I tried to do the most basic things like check if the new nodes I added had the correct routes and connectivity to DNS and NTP in place and also I ran the rebalance-data status and start commands and collected the tcpdump from the primary admin and looked into them on Wireshark. It seemed like there were a lot of cases of 100% packet drops. These are not firewall related, just parts of an entire communication seemed to be breaking here and there especially when the communication is across countries. I did suspect it was something on the network since the "socket" error makes it sound like it. I randomly ran the sg-leader-nodes.py command on multiple grid appliance nodes to see if all of them know who their EC job manager, group manager were etc (something that i got as command to check from support) and found that one of the nodes was not able to return any output. It seemed like some service was not running on it or was hung. Also the number of port connections from the primary admin to that specific node(compared to the rest) were lot lesser. Not knowing what to look for I just rebooted that node. It took a frighteningly long time to come back online. But when it did, the rebalance command just worked like magic. So it was some process that was not communicating back to the primary admin node from one node that made the entire command to not work. Seems weird. But that seemed to have been the issue. 🙂

arctic loom
#

Hi @dark canopy , happy you found the problem: is the sg-leader-nodes.py available and usable for all, or is something like secret to be used under support guidance? 🙂

dark canopy