From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0 Description of problem: This is a long standing issue it seems that appears to have been originally reported in Bugzilla #89885. In that issue we also had a stack abuse of the proprietary ESM driver. The stack of the driver was decreased and we were unable to reproduce the problem. However a new faster machines we have been able to reproduce the orginal issue and have after much effort found the 3-way deadlock condition in the network layer. The issue has been only seen when some kind of NIC teaming is being used. We have reproduced it with Intel's iANS, Broadcomm's BASP and the native bonding. We also have used the bcm5700 driver, intel's e1000 driver, and the tg3 driver. The rest of the software stack has been having Dell's OMSA stack installed and samba. A samba share is either exported from an HD or from RAM. The RAM disk produces the hang a little faster. We have not been able to reproduce the issue without OMSA installed, but after seeing the stack trace and ITP dumps we don't not suspect anything that would implicate its requirement to be there. Also, a search of the internet turned up one other individual that appears to the the same lockup on non-Dell equipment. We have also been able to reproduce the issue on several kernels including ones from AS 2.1. The problem is specifically, a deadlock situation between the BR_NETPROTO_LOCK and dev->xmit_queue locks. I will follow up with trace files of the lockups. Version-Release number of selected component (if applicable): kernel-2.4.21-27.0.2.ELsmp How reproducible: Sometimes Steps to Reproduce: 1. Install RHEL3 U4. 2. Configure network teaming of somekind. 3. Stress SAMBA share of a local RAM disk. 4. Wait 8 hours or longer (sometimes 10 days) Actual Results: System lockup. Sysrq does work. Expected Results: No lockup. Additional info:
Created attachment 111472 [details] Crash occuring with tg3 This is a sysrq output during the hang using a tg3 driver.
Created attachment 111473 [details] SysRQ of a system with most state information This is a sysrq of the system that we have the most information on. Infact at this time it is still hooked to an ITP do we can dump the registers of any memory location or register you might still need. We have collected alot of that information and I will also post it if required. It is based of a modified 2.4.21-27.0.2.ELsmp kernel that added "tracing" for the BR_NETPROTO_LOCK. This lock was getting held but prior to the code we didn't see who was holding the locks.
Created attachment 111475 [details] Objdump of the system with most info. This is a gzip compressed objdump of the modified kernel. The problem again occurs on unmodified kernels. But we changed this one to trace the BR_NETPROTO_LOCK to root cause the issue. It is the one we have the most information on and shows the problem the clearest.
Created attachment 111476 [details] /proc/ksyms sorted of system with most info
Created attachment 111480 [details] Out from the HW ITP for system with most info This is a dump of some of the registers and stacks of the code that is running on the system when it is hung. Using the ITP we were able to determine who had the locks. The ITP doesn't enumerate the CPU's the same as the sysrq output ... so please note that with the ITP: P0 = CPU2 in linux P1 = CPU3 in linux P2 = CPU0 in linux P3 = CPU1 in linux if you are trying to match the ITP dumps with the sysrq output.
Looking at the files of the system "with the most info", this is what we determined: CPU2 has a read lock on BR_NETPROTO CPU2 wants the dev->xmit_queue lock CPU1 owns the dev->xmit_queue lock (we get that from the lock structure dump) CPU1 wants a read lock on BR_NETPROTO CPU0 is attempting to aquire a **write** lock on BR_NETPROTO and has locked 0 and 1 but spinning on 2. CPU3 is wanting a socket lock at appears to be help by CPU1 via tcp_recvmsg(). The main problem I see is that CPU1 owns the dev lock but is trying to get the NETPROTO lock and CPU2 did the exact opposite.. it has the NETPROTO lock but now wants the dev lock. These locks should be obtained in the same order. This wouldn't be a problem since they are read locks... but then we get that rare instance of another CPU (in this case CPU0) that has "half way" done its write lock where it goes through from 0 to 3 and in this case only obtained the first 2 (CPU0 and CPU1) and is now waiting to get the 3rd and 4th to complete the write lock.
Created attachment 111493 [details] Hack to R/W locks to track netproto This is just a patch to the base 2.4.21-27.EL kernel that was done to track the NETPROTO locks to see what was occuring... Probably not useful to you now but just in case you were curious for what changes were made.
It is a known deficiency in the atomic version of the brlock implementation. This is what is used on x86 and it's cousins. The non-atomic variant of brlocks do not have this problem. Essentially, brlocks must give exactly equivalent semantics are rwlocks. This means, in particular, that when writers try to enter, they must back off if readers are present so that readers can make forward progress. The atomic variant of brlocks do not do this. This fix is to eliminate the atomic variant of brlocks and always use the non-atomic variant. But, we can never include this fix since it's an incredible kABI breaker. I'll attach the patch, it's in 2.4.30-preX already. There are other ways to easily reproduce this bug, mostly involving adding or removing netfilter rules while input packet processing is running. It requires 3 or more cpus.
Created attachment 111585 [details] Fix for brlock deadlocks. Fix for brlock deadlocks on x86/x86_64/ia64
Visual inspection of the patch certainly confirms that it will fix this particular hang. What update will this patch be put in place, RHEL3 U5? The issue seems to occur in some of our stress test environments. Since it is a hang without a panic, it might be difficult for diagnosing by phone techsupport. Our test cases have been running without failure now since the patch was applied on the 2nd. One of test cases would fail in about 8 hours consistantly.
If you read my comment, it states that since this patch is a kABI breaker, it is unlikely we can place it into any RHEL kernel, ever. And I quote from comment #9: ----- This fix is to eliminate the atomic variant of brlocks and always use the non-atomic variant. But, we can never include this fix since it's an incredible kABI breaker. -----
Sorry, that is odd. I never noticed comment #9 before my post. I only saw my comment #7 and your comment #10. I assume comment #8 and #11 are internal since I cannot view those either right now. This (comment #9) is unsettling news. Is there anyway to re-order the locks so that both threads will lock the device xmit queue before or after the NETPROTO lock? So we don't have one thread obtaining the locks in defferent orders.
It isn't a private comment, I bet they just get renumbered when the private ones are not displayed, sorry. I was referring to the comment I made right before I attached the patch. There is no way to reorder the locks. Netfilter needs to grab the BR_NETPROTO_LOCK recursively as a reader and there is no way around that and as long as it is the case the deadlock can be triggered.
To explain the kABI situation, since we are changing entirely the layout and behavior of the BR_NETPROTO_LOCK, any kernel module taking these locks will stop working and need to be recompiled. The lock is defined in include/linux/brlock.h and lock acquisition occurs in inline functions callable from any kernel module, so kernel modules know the layout of these locks. The suggested patch would break such modules on x86, ia64, and x86_64 platforms. As one can see, the lock routines and the locks themselves are exported to modules in kernel/ksyms.c Any implementation of nontrivial networking (such as a protocl stack) would need to grab these locks. It is therefore very likely that some 3rd party kernel module out there makes reference to them and would break.
4/11/05: Per Sue Denham, RHEL3 U5 Release Notes will have an entry for this issue.
I'm sorry to report that the release notes had closed down by the time I got this one to the doc team. We will, however, immediately create a KnowledgeBase entry that customers and our support folks can view. I'll have this info added to the U2 notes. Again, my apologies.
Update KBASE entry when bug fixed: FAQ Question: Why would my system dead lock under network pressure with NIC bonding? Topic(s): * Red Hat Enterprise Linux : AS/ES/WS v. 3 - http://kbase.redhat.com/faq/FAQ_79_5697.shtm * Red Hat Enterprise Linux : Hardware - http://kbase.redhat.com/faq/FAQ_46_5697.shtm
With RHEL3 now in maintenance mode, where only critical customer issues can be fixed, this bugzilla has been closed as wont fix due to its severity level and a lack of recent activity.