#FAS 2750 - VMWare Datastore

1 messages · Page 1 of 1 (latest)

glass nest
#

Hi,

We have a FAS2750 with some hyrbid RAID (I can provide more information if required). We have about 300VMs running on this box and we suffer from Latency between 20-100ms at time.. What is the recommended latency before it will start causing issues, We are on ONTAP 9.8P21

austere wasp
#

what do you mean "before it will start causing issues"? I mean you are probably already experiencing issues on your VMs right now, right? The storage system is fine and won't be bothered by any high latencies, it always tries its best to deliver the data to your servers, so you won't see any real "issues" there (other than the high I/O latencies themseves)

red copper
#

This needs more details...

#

Start there.

half spruce
#

How many and what kind of disks are behind it and are you running it active active co trollers, or active standby?

glass nest
#

Sorry basically, We are expecting latency issues, With our VMWare enviroment reporting 100+ ms of delay (Even into the 1000s sometimes)

Thanks for that Paul I will certainly start looking through this list...

We have 2 FAS2750 shelves and we are using both FSAS and SSD disks (Attached the CSV) where abouts would the settings be for active-active or active-passive

Sorry please excuse me if I'm seeming abit new, I'm realtively new to the Storage side of things so if I am explaining things wrong or you need abit more information please do let me know

half spruce
#

Ok and do you have one data store or two? Is the system used for anything else?

#

This is an active active setup, so some disks on one controller, some on another

#

You should try to keep latency below 20ms. 100 to 1000 is pretty bad.

#

My 1000ft review is that this is too much workload for this configuration. Assume a 10TB SATA drive can do about 75 IOPs and you’ve got like 22 of them tops for data per datastore. You presumably have a flashpool config which will help a bit and backend IOPs don’t equal front end IOPs.. but.. for VMs doing random data it’s less dissimilar than would be required for this to work well

#

You might be able to tune workloads a bit by finding heavy workloads and splitting them into a QoS’ed volume.. buut.. the maths still aren’t awesome

glass nest
#

Thank you Alex there are 2 aggregates with 2 volumes on each for a total of 4 Volumes
Yeah so this is running approx 280ish VMs with varying IOP requirements (Some domain controllers, Web server, VDI, SQL DBs)

I've tried looking but how does the flash pool work in terms of how it tiers the data as we are connected to VMWare via NFS4.1 (I have expressed concern with this as normally it's iSCSi for block level)

Oddly enought your advice seems to match with what I've heard from other people outside of the business that the hardware just doesn't match what we are trying to achieve, I'm going to attempt to convince the higher ups that we need a change as I believe we put this in 2019 so we are almost at 5 years of usage at this point.

Honestly thank you so much for this insight

runic juniper
#

If you are using ONTAP 9.8 and nfsv4.1, that’s a problem waiting to happen. Nfs4.1 is known not to be stable on that release especially during takeover/giveback. The major thing that happens is the esxi nodes will basically lock up as do all the VMs on the hosts. An esxi power cycle is the only recovery.

austere wasp
#

NFS4.1 had problems even in newer releases so I would generally suggest not to use it. There's no real benefit from an ESX point of view anyway

glass nest
#

The odd thing, We were on 9.6 beforehand and we have kinda 3rd party NetApp support via another company, And they are recommended we went to 9.6 to 9.8 and oddly enough we had an issue with the takeover and like 60% of the estate dropped 😅

I'm starting to think it's better going for something all-flash and iSCSi'ing rather than hybrid 😅

red copper
#

As far as upgrading, C series is honestly a great value.

half spruce
#

NFSv3 is fine and gives you snapshot integration which is pretty cool.

half spruce
torn hull
#

Where do you see latency? In vmware "Monitoring" page or storage monitoring tool? In case you see latency on Vmware side, check network.
I'd advice to add volumes (at least 6 volumes per aggregate) and redistribute VMs. You may distribute VMs per Qos and set tiering for every volume.
Disable "atime-update" feature if you don't need that.

red copper
#

You can't tier with a FlashPool (Hybrid Aggregate)

glass nest
#

Good morning chaps,

Just to update (Sorry had some PTO) we have had discussions with our supplier and we have had the head of NetApp presales and support on the call and they have had a full look through the system and created a list of stuff that we need to do,

NFSv3 migration/balancing of workloads
Jumbo frames end to end
Extra utilisation of unused network ports (cabling needed, may also be part of PS)
Thin provisioning (some of this was done on the call, check the rest)
cluster e0m not sending home --> this could be a bug just looking at this
Upgrade Ontap

We believe this is the right path forward, Thank you all for the assistance we have decided to start getting professional services involved as currently we are very limited team wise.

It's much appreciated for all your hard work and explaining every step recommended all for free over discord because you are passionate about your jobs, It's much appreciated and I've learnt so much about our storage array thanks to you lot!

half spruce
#

Let us know how it goes - I’m still hesitant about that workload on that hardware, but I’d like to see the results all the same

runic juniper
#

e0M not sending home could also indicate a L2 issue or port connection issue to the switch

#

You could try making a test lif and moving it each e0M port and at least test ping it

tall pollen
# glass nest Good morning chaps, Just to update (Sorry had some PTO) we have had discussions...

I could see how
balancing of workloads could help
Jumbo frame end to end should be there on all vsphere iSCSI Configs
Controller port utilisation as well as host nic port utilisation via vswitch yes as well
But how does thin provisioning help , if any, it should make it worse not to mention in a dense vm configuration. Size could grow out of control pretty quickly
I would recommend limiting X number of VMs per vol/agg
Do keep us posted as to what was done

old abyss
#

And do not forget to disable update access time and look-ahead reader on the volumes

austere wasp
old abyss
austere wasp
#

disabling atime update can help, but from our experience only on volumes with many hundreds to thousands of active connections, not in an ESX environment where the metadata overhead of atime updates is pretty minor

old abyss
#

Thanks @austere wasp
Would love an insight from the NetApp documentation owner.
I have doing this for over 10 years as it was also stated in the NetApp & VMware TR documentation and now on the Docs site.
My belief was that the many VMs on a datastore provide not very predictable IOPS so that read-ahead was a waste of time. And off course in an all-flash environment it does not matter. But in this case (hybrid/spinning) every IO counts.

#

Just tested with ONTAP tools for VMware vSphere, when creating a new volume, it is automatically set to false

austere wasp
austere wasp
old abyss
#

My mistake. It always states to set -min-readahead to false.

glass nest
glass nest
# tall pollen I could see how balancing of workloads could help Jumbo frame end to end shoul...

Yeah, I can see how it would help abit however I'm feeling the same as everyone in this thread and I don't think the hardware is up to scratch at all however management aren't seeing it that (Despite it being 5 years old at this point)

Yeah we have been using NFS 4.1 or NFS3 for this, No iSCSi at all (Not sure why however this is a histortical thing)

The Thin provisioning was mainly to allow us to carve up another volume so we can attempt to make these changes, Currently we have a carved out a 2TB volume and mapped it via NFS3 with the IP of Controller B to test the controller wasn't dead (It wasn't) however (No suprise to me) we are still seeing hightened latency on this), I'm going to try and get the nabox.org VM up and running with NetApp Harvest so I can prove to the management that's the physical disks that are causing us the issues

glass nest
# old abyss Just tested with ONTAP tools for VMware vSphere, when creating a new volume, it ...

Yeah after this comment, I found out that ONTAP tools for vSphere exsists so we ended up creating another volume and mapping it via the ONTAP-Tools rather than manual so thanks for pointing that out! I'm sorry I've ended up coming at this from an angle that it was all setup by management (Ex Senior Engineer at the time) so I'm having to decipher but to be honest, My Storage knowledge is very lacking (It's something I really need to improve on). I believe all our existing vols were set up from the GUI rather than ONTAP-TOOLS or the CLI... I've confirmed with a member of the A-Team (From the support provider) and he has confirmed we have set the new volume up correctly

glass nest
#

Sorry to reopen an old-ish conversation. I've finally got around to sorting out NABox in our enviroment (Sorry staffing really hasn't improved much) upon looking it's certainly a disk issue. I've been semi-pressure from above to start moving things so both controllers are in use via both Netapp controllers and moving us back to NFS v3 but this is a very long drawn out process involving lots of out of hours work which I don't have time for right now.

Looking at the screenshot it looks like we are always at a 100% disk busy regardless, Is there is any way to break this down to the file level so I can see if there is a paritcular VM that is causing the issue? Oddly 21:00 - 2:00am is our backup window and we look least busy during that period which is abit odd.

Rebalancing between the RAID groups is slowly being worked on as you can see the 1st Node is massively overloaded. The flashpool is hardly ever getting hit on the 1st node which makes sense as to why we are seeing massive latency spikes at random.