Hi all, would love some input on next troubleshooting steps to take if anyone has ideas! Basically I have an AFF-A220 that is acting as a datastore for our vCenter, pretty basic. On that NetApp i've begun getting alerts regarding CRC errors, and running "node run -node Node1 ifstat e0d" we can see about ~50 CRC errors coming in per minute. Networking wise, this is all on a backend network that is simply Netapp > Cisco Nexus switch > Dell Chassis w/ the ESXi hosts. On the switch interfaces facing our Node1 ports that our NFS data LIF sits on, we can see lots of outbound errors when running show interface. During our troubleshooting I migrated the LIF over to Node2, on another ifgroup we have going to that same switch (this ifgroup is unused, whole other story there), and rebooted Node1 to see if that would solve the CRC errors. While Node1 was rebooting, Node2 began getting the same CRC errors on the ports that our data LIF was sitting on. When Node1 finished rebooting, I migrated the LIF back to its home port and we continued getting errors on those Node1 ports, and ceased getting them on the Node2 ports. the This leads me to believe the issue does not lie within the cables/ports as a lot of digging online has suggested. Unfortunately this NetApp is at a site across the country and it will take us a week or two to get out there to physically look at it. Currently waiting a while for NetApp support to reply to my reply on my ticket, but wanted to see what you pros thought! Phew that was a mouthful!
#CRC Errors
1 messages · Page 1 of 1 (latest)
if you are using cut-through switches (instead of store-and-forward), the CRC errors can come from anywhere in the path from source to destination, since the switches can only notice the errors after they have already sent the packet out so it's too late to drop it. So you really have to check the full network path for cable/transceiver issues
our VMware guy ran some CLI commands and confirmed that there weren't any CRC errors popping up on the ESXi host ports. The only places we saw consistent errors were on the NetApp side, and also on the Switch interfaces facing the NetApp, reporting "outbound errors". Our theory is that the NetApp isn't receiving the frames properly and is telling the switch that
Did you check the switch ports where the ESXi servers are connected? The usual problem with connections is that the sender can't detect if he is causing an issue. So your ESXi guy might not see any problems but is causing it anyway.
Yes we did, no input or output errors there.
You could try to evacuate the ESXi servers one by one. See if that stops the errors for one of the hosts.
yeah, that was one of our next steps. i would think that the problem would stem from one of the VM's, not from the host itself. We may need to submit a CRQ to turn the VM's off/on one at a time, but saving that for last as our CRQ process would take a while to get approved
So what kind of errors do you actually see? CRC is very low level, I think those can't be caused by a VM.
There's no way the NetApp can "tell" the switches that it received CRC errors. The only way the switch counter can increase is if the switch itself notices that the packets are broken. I would try and reverse the path of the errors back to the source (i.e. check which other port(s) on the switch also have increasing CRC errors, then go back from there)
via ifstat on our NetApp, CRC errors that keep increasing
good to know! yes, my colleagues actually looked into the cut-thru vs store and forward after I left for the day. The switch is in cut-thru mode. Which according to their digging means the switch won't actually see the CRC issue (in the ehternet footer) until it is sending the frame out. Which resonates that we are only seeing output erros on switch interfaces, and no input errors.
Wheat are the physical connections? Optical or twinax?
Where are the crc errors being reported? Netapp? Switch ports? Incoming or outgoing?
Version on Netapp? Are you using a port channel? So many questions
Twinax, CRC errors on the NetApp ports that data is flowing through (e0d/e0c on Node1) as well as outgoing errors on the switch towards those NetApp ports, and we're on 9.9.1p18 and yes e0d/e0c share an ifgroup
You might want to upgrade. There were some bugs that related to this. Let me see if I can find something
these are generally quite tricky to troubleshoot. you could check the other switches that are connected to the one you've looked at, and see if any of those also report output CRC errors towards the downstream switch. note that the numbers could be smaller (like half or even less) than the numbers on the downstream server, depending on the pathc the packets take. If you are really curious, you can maybe (temporarily) convert the switch to store-and-forward and see which client is having his packets dropped more frequently... as long as the errors are not significantly higher this shouldn't affect any network traffic as TCP will simply retry sending the lost packets
That'd be sweet if that's all that was required!
Ok. Are the crc errors actually on the port or the ifgrp?
Switch model? Are they doing vpc? Version of code?
What are the twinax cables you are using? Are they Cisco? Netapp? Or something wise?
Cisco
Cisco what?
Verifying this now, I did an ifstat on the underlying e0d port for the ifgroup but didnt run one on the ifgroup itself, although they were both getting the same error
No, the CRC errors are only populating on the underlying e0d port of the ifgroup
Cisco SFP-H10GB-CU3M
what are we looking for in the output here?
Depending on the load maybe only e0d is used
You said it moves to the second node during failover. More likely it comes from the source
yeah, don't quote me on this but with ifgroup i think only one port actually gets used at a time, so its basically for redundancy?
Other errors, pause frames, xon/xoff
Depends on the the mode
Hence the instance if you can send the output
Same source-target combination usually uses the same link.
total errors is same # as CRC errors
for e0d
don't see anything else
It depends on a number of factors. In FlexPod we set up everything in the switch and ONTAP to use src/dat/port which better distributes
OK since you can’t or won’t get send output, what ports are in your ifgrp? What distribution is it using? What mode is it using?
Are you using vlans?
Maybe the switch is not correct. If I can see how the ifgrp is configured I can offer suggestions for the switch side
e0d/e0c for the ifgroup, distribution function is IP, create policy is multimode_lacp and yes there is a vlan sitting on top of the ifgroup we are using. The error populates for e0d, a0a, and a0a-30
Try shutting down e0d and verify the crc moves
Unfortunately I can't copy paste the output, I'm using discord on personally device as we can't use it on network laptop
Ah ha
I did do this two days ago - we rebooted the whole node that the ifgroup sits after manually migrating the LIF to the 2nd node. The CRC errors did populate on the second node while node1 was rebooting
Any chance you can get a look at the switch config for the Netapp?
and e0c from the same ifgroup on node1 also has CRC errors, just not as many as e0d
yep, i have access to that
Check the uplinks of your Dell server chassis.
Possibly the easiest to pull them one by one and see if it stops. 🙂
Don't overcomplicate it 😉
Try this
system node run -node * options cdpd.enable on
Wait about three minutes and do
Network device-discovery show -protocol cdp
Hopefully you can identify which ports you are in on the switch (which model again?). Maybe an ASIC issue on the switch and ask the ports are in the same ASIC?
@harsh kelp suggestion is really good also
Cisco Nexus9000 C93180YC-FX
yeah, i would have honestly already been yanking cables out but this NetApp is on the other side of the country. Have to go through the motions to get travel approved
TBH don't trust the one guy over there to pull the right cables 😅
For refence @harsh kelp , on a cisco switch I use:
port-channel load-balance src-dst l4port
really helps with distribution
On the NetApp for the ifgrp, waaaay better to use port for distribution
(Network traffic is distributed based on the transport layer (TCP/UDP) ports)
@kind condor , the switch config for each netapp should look something like this:
`interface Po121
description netapp-01
switchport mode trunk
switchport trunk native vlan <native-vlan-id>
switchport trunk allowed vlan 123,456
spanning-tree port type edge trunk
mtu 9216
interface Eth1/21
description netapp-01:e0c
channel-group 121 mode active
no shutdown
interface Eth1/22
description netapp-01:e0d
channel-group 121 mode active
no shutdown`
Please note that whatever your native VLAN is, should NOT be used as a tagged vlan. i.e. if Native is 123, then on the netapp If I need to use 123 I would use a0a (the base interface). If you need to use 123 as a tagged vlan, just make sure the native vlan is something unused.
(native default to 1 if not specified)
@kind condor is it a SINGLE 93180 or a pair in using Virtual Port-Channels (vPC)?
fix it!
Why: overhead
The switches should be 9216 (maximum). Different switch vendors have different max sizes as they treat overhead differently
all clients (netapp, esx) should be 9000. The vlan tagging adds extra and the 9216 account for this
oh, the ESXi hosts should be set to 9000 as well? that is not something i verified with my VMware guy but that makes sense
This will defintely cause at least fragmentation as the netapp is sending 9000 and the switch will need to break that into two frames to pass along
this NetApp was up for 2 years before this started so im sure it must be
This will not cause the CRC, you could potentially get a little faster with the correct switch mtu and prevent fragmentation
Any luck on identiifying which ports on the switch?
Yes i know which ports go to the NetApp. Config for the 4 interfaces headed to our NetApp is:
switchport mode trunk, switchport trunk allowed vlan 30, spanning-tree port type edge trunk, spanning-tree bpduguard enable, mtu 9000, channel-group 1 mode active
and obv channel-group mode 2 for the 2nd port channel
what are the actual port numbers on the switch
eth1/29-32
when you were asking about the ASIC, is that the backplane that the interface sit on? One of our theories was that the backplane was having issues, our switch guy was going to look into how to verify that but no dice yet
yeah, something like that
Looks like that switch has a single ASIC for all 48 10/25G ports
Thank you both for your expertise - we think we may have something. We did find input errors on the switch stemming from one of the ESXi hosts. So looks like it is originating upstream. Also there is only on VM sitting on that host so we will troubleshoot the host and VM and i'll report back!
Excellent. Please "fix" the MTU on your switch ports when you can
Apparently the NetApp implementation engineer that helped my predecessor install this box was adament about the 9000 mtu setting so my switch guy would prefer not changing it.
I think most of us here will tell you that "NetApp implementation engineer" was wrong. You more than likely are getting network fragmentation.
I would say....look at >any< cisco documentation about jumbo frames. Every singel time they tell you to use 9216 on the switch side. All clients stay at 9000
I feel a little stupid reporting back that the issue seems to be resolved. I guess through lack of attention to detail on my part I missed the input errors on the switch coming from one of the ESXi hosts. Once we saw that we attempted to move the sole VM it had living on it off the host but it wouldn't go. We turned off the VM and it migrated just fine. Our VMware guy said he was seeing errors populating on the ESXi host(will need to get a debrief from him later on exactly what he saw), so they rebooted it and moved the VM back and viola, we are no longer getting any new errors on the host, the vm, the switch, or any NetApp ports.
we've one time got problem with several vms on iscsi datastore hosted from netapp. terrible performance, but vmotion solved this. after few days issues reoccured when vm admin was migrating. one link at esxi host negotiated 100mb throughput with switch instead of 10000mb
I was gonna say...you probably gotta check somewhere else. Usually if I get something like that I like to involve all vendors, esp. the network vendor.
Storage guys will say - its network, network guys will say - it's storage. And let the next epic battle begin.