Thursday, July 7, 2011

NFS Slowness Weirdness

We recently deployed a new NFS filer running on top of Nexenta using ZFS and noticed that some of our systems were having performance issues. Writes to the NFS system on most hosts were snappy, but reads from NFS were capping out at around 3M per second on a Gig-E network interface on a handful of hosts.

Systems that were identically configured in every way - kernel version, nfs package version, hardware, mount options, etc. were behaving differently. One would read and write at around 100M yet another would cap out at 3M.

Our systems guys did some troubleshooting and diagnosis on the network and could not find any issue there. So we went ahead and tested an scp from the NFS server to a problematic host.

The scp would run closer to 80M per second, so it seems that the problem was NFS itself.

We checked and double checked our configs, settings, sysctl.conf, versions and could not find anything different between a host where throughput was fine and one where it was horribly slow.

In the end we decided to umount all the NFS mounts, remove the "nfs" kernel module (rmmod) and reload it with modprobe and then remount the NFS mounts.

Lo and behold the throughput was back up to around 100M per second. This approach to fix the problem worked on all the problematic hosts we have tried it on so far.

Still,.. we are left scratching our head as to the real cause of the issue here as we basically have "jiggled the cable" or given NFS the "three fingered salute", if you will.

So though we now have a (rather intrusive) fix, we still don't know how to prevent the issue, if or when it will happen again, etc.

Has anyone out there seen anything similar to this before? Any ideas on what could be the issue?

Hmmm...

Lachlan