Friday, April 09, 2010

Linux, Multipathing, EqualLogic and me

When is a multipathing problem not a multipathing problem?  When it is actually a multiple interface problem.


Mulitpathing with Linux (RHEL5) and EqualLogic.  How hard could that be?


Ok., sure the documentation is lacking in useful details but makes up for it by having extraneous words and vague references.  Nevertheless, it is good enough to get things set up.   However, on the mkfs I start seeing connection errors:
connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4296210335, last ping 4296215335, now 4296220335
connection1:0: detected conn error (1011)
connection1:0: detected conn error (1019)
and I end up being able to replicate this problem by running bonnie++.


A colleague (much smarter than I) found in the packet trace that eth1 was sending on ARP on behalf of eth2. This seemed to confuse the array and the initiator ended up sending a RST to the array which promptly closed the connection. The iSCSI layer wasn't aware that the rug had been pulled out from under it but eventually caught on and logged the message above.


Naturally, it turns out to be a feature not a bug.  Perhaps because both eth1 and eth2 are on the same subnet.  In order to decline the use of said feature, adding the following to /etc/sysctl.conf works wonders in getting multipathing to work:
net.ipv4.conf.all.arp_ignore=1
net.ipv4.conf.all.arp_announce=2
The arp_announce is probably what is really necessary. The arp_ignore seems paired with it on all the other web pages and so when I was doing the "monkey-google, monkey-paste" I put it in and haven't spent the time to understand it fully and test without it.


Everything is great the physical machine is running fine and to double check I replicate the problem on the VM, apply the changes, and notice that everything is working fine.  Unfortunately, within a few minutes, the VM is throwing connection errors again but the physical machine is just fine.  It looks like on the VM, we also need to change the connection tracking behavior:
net.ipv4.netfilter.ip_conntrack_tcp_be_liberal=1
in order for the connection to be stable.  I was informed by the smarter colleague that we have added this to other systems where we've had ssh and NFS issues.




Now if I can only figure out how to get the bnx2i iSCSI offload to work...

2 comments:

Marc Malotke - Dell said...

On EqualLogic.com there is a White Paper on setting up RHEL5 with EqualLogic.

http://www.equallogic.com/resourcecenter/assetview.aspx?id=8727

Marc
Twitter: MarcDELL
Blog: http://www.marcmalotke.net

zfortna said...

I ran into this exact same problem on my CentOS 5.4 installation. Keeping my fingers crossed it fixes the problem as it makes the Oracle db using the LUNs painfully slow.