问题简介
在ASP.NET程序中我们经常遇到错误消息为"Request Timed Out"的HttpException,浏览器中的出错信息如下图所示,
同时应用程序日志中也会出现一条警告消息如下
data:image/s3,"s3://crabby-images/86190/861904fe83e7bb6db911e40717db0456ec0105fe" alt=""
如何重现这种异常?
线程在处理逻辑的过程中是如何被打断而抛出超时异常?
如何调试这种异常?
这篇文章我们从CLR源代码级别来分析一下该问题发生的原因以及相应的调试方法。
重现问题
首先我们先尝试重现这个问题,在ASP.NET程序web.config中有一个executionTimeout的设置,默认值为110秒,也就是说一个请求的执行时间超过110秒就会timeout从而产生HttpException "Request Timed Out"。这个设置可以通过IIS Configuration Editor或者web.config进行更改。
<?xml version="1.0" encoding="UTF-8"?> <configuration> <appSettings /> <connectionStrings /> <system.web> <compilation> </compilation> <authentication mode="Windows" /> <httpRuntime executionTimeout="10" /> </system.web> </configuration>
为了快速重现这个问题我们将executionTimeout默认值改为10秒。然后建立一个让请求超时的asp.net页面如下。
<%@ Page Language="C#" AutoEventWireup="true" Inherits="System.Web.UI.Page" %> <script runat="server"> protected void Page_Load(object sender, EventArgs e) { System.Threading.Thread.Sleep(120000); } </script> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head runat="server"> <title></title> </head> <body> <form id="form1" runat="server"> <div> </div> </form> </body> </html>
将页面部署在IIS上,然后请求页面,等待10秒到25秒左右问题重现。
产生原理
这里有一个值得注意的地方,为什么我们页面逻辑在执行Sleep却抛出了HttpException,执行线程被谁打断了,是怎么打断的?
研究这个问题我们首先要用windbg来察看一下这个异常发生的调用栈信息。
通过windbg attach到w3wp进程,然后输入命令'sxe clr'捕获clrfirst chance exception
CLR exception - code e0434f4d (first chance)First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
KERNELBASE!RaiseException+0x68:
000007fb`bbb789cc 488b8c24c0000000 mov rcx,qword ptr [rsp+0C0h] ss:000000da`6bf0da20=38a1bff3b7640000
0:023> KL
Child-SP RetAddr Call Site
000000da`6bf0d960 000007fb`a0da98ef KERNELBASE!RaiseException+0x68
000000da`6bf0da40 000007fb`a0d03e07 mscorwks!RaiseTheExceptionInternalOnly+0x2ff
000000da`6bf0db30 000007fb`a0cc58cf mscorwks!Thread::HandleThreadAbort+0x19b
000000da`6bf0dbd0 000007fb`a0cbe815 mscorwks!Thread::HandleThreadInterrupt+0x7b
000000da`6bf0dc30 000007fb`a1280769 mscorwks!Thread::UserSleep+0x89
000000da`6bf0dc90 000007fb`415d09ee mscorwks!ThreadNative::Sleep+0xf9
000000da`6bf0de40 000007fb`41630f39 App_Web_kng1ridv!ASP.sleep120_aspx.Page_Load(System.Object, System.EventArgs)+0xe
…
000000da`6bf0fae0 000007fb`beb53501 KERNEL32!BaseThreadInitThunk+0x1a
000000da`6bf0fb10 00000000`00000000 ntdll!RtlUserThreadStart+0x21
栈上面的first chanceexception为ThreadAbortException。
0:023> !dsoOS Thread Id: 0x12e0 (23)
RSP/REG Object Name
000000da6bf0d968 000000da0aac3ea0 System.Threading.ThreadAbortException
000000da6bf0da30 000000da0aac3ea0 System.Threading.ThreadAbortException
000000da6bf0da48 000000da0aac3ea0 System.Threading.ThreadAbortException
000000da6bf0db30 000000da0aac3ea0 System.Threading.ThreadAbortException
000000da6bf0dc98 000000da0aa93c90 System.Web.HttpContext
000000da6bf0dca0 000000da0aac3498 System.ComponentModel.EventHandlerList
000000da6bf0dd70 000000da0aac3498 System.ComponentModel.EventHandlerList
000000da6bf0dd90 000000da0aa93c90 System.Web.HttpContext
从这个调用栈上我们可以看出来这个异常来自于Thread::UserSleep -> Thread::HandleThreadInterrupt ->Thread::HandleThreadAbort
由于这个调用栈是CLR非托管代码,我们需要一份.NETFrameworkCLR的源代码作为参考,可以到微软下载一份Share Source CLI。找到\sscli20_20060311\sscli20\clr\src\vm\threads.cpp这个源文件,这个类封装的一系列CLR线程相关的属性和方法。
因为我们调用了Sleep方法,首先看一下Sleep函数的实现。原来线程调用Sleep方法之后进入系统内核,内核将其置于等待状态,一旦CLRSleepEx返回WAIT_IO_COMPLETION并且通过m_State & TS_Interrupted两个标志位来判断是interrupt的话,系统将开始执行HandleThreadInterrupt方法,进而在HandleThreadInterrupt方法中抛出ThreadAbortException。
// Implementation of Thread.Sleep(). void Thread::UserSleep(INT32 time) { CONTRACTL { THROWS; GC_TRIGGERS; } CONTRACTL_END; INCONTRACT(_ASSERTE(!GetThread()->GCNoTrigger())); DWORD res; GCX_PREEMP(); // A word about ordering for Interrupt. If someone tries to interrupt a thread // that's in the interruptible state, we queue an APC. But if they try to interrupt // a thread that's not in the interruptible state, we just record that fact. So // we have to set TS_Interruptible before we test to see whether someone wants to // interrupt us or else we have a race condition that causes us to skip the APC. FastInterlockOr((ULONG *) &m_State, TS_Interruptible); // If someone has interrupted us, we should not enter the wait. if (IsUserInterrupted()) { HandleThreadInterrupt(FALSE); } ThreadStateHolder tsh(TRUE, TS_Interruptible | TS_Interrupted); FastInterlockAnd((ULONG *) &m_State, ~TS_Interrupted); retry: res = ClrSleepEx (time, TRUE); if (res == WAIT_IO_COMPLETION) { // We could be woken by some spurious APC or an EE APC queued to // interrupt us. In the latter case the TS_Interrupted bit will be set // in the thread state bits. Otherwise we just go back to sleep again. if ((m_State & TS_Interrupted)) { HandleThreadInterrupt(FALSE); } // Don't bother with accurate accounting here. Just ensure we make progress. // Note that this is not us doing the APC. if (time == 0) { res = WAIT_TIMEOUT; } else { if (time != (INT32) INFINITE) { time--; } goto retry; } } _ASSERTE(res == WAIT_TIMEOUT || res == WAIT_OBJECT_0); }
接下来一个问题是到底是什么使得ClrSleepEx方法返回了呢?根据注释我们可以看到APC很可能是打断等待线程的原因。接下来我们继续通过debug验证一下。根据msdn用户模式可以调用kernel32!QueueUserAPC方法发送一个APC到指定线程。我们在这个方法上设置一个断点然后重现这个问题。结果抓到了这个调用栈。
0:022> kLChild-SP RetAddr Call Site
000000da`6bbfe808 000007fb`a10ce75b KERNEL32!QueueUserAPCStub
000000da`6bbfe810 000007fb`a0e37e88 mscorwks!Thread::Alert+0xdb
000000da`6bbfe880 000007fb`a103d0d4 mscorwks!Thread::UserInterrupt+0x30
000000da`6bbfe8b0 000007fb`a11f5699 mscorwks!Thread::UserAbort+0x3984d0
000000da`6bbfe9d0 000007fb`997c4519 mscorwks!ThreadNative::Abort+0x169
000000da`6bbfebd0 000007fb`988f835f mscorlib_ni!System.Threading.Thread.Abort(System.Object)+0x39
000000da`6bbfec20 000007fb`97e94547 System_Web_ni!System.Web.RequestTimeoutManager+RequestTimeoutEntry.TimeoutIfNeeded(System.DateTime)+0xa63d4f
000000da`6bbfec60 000007fb`99772bdb System_Web_ni!System.Web.RequestTimeoutManager.CancelTimedOutRequests(System.DateTime)+0x167
…
000000da`6bbff8a0 00000000`00000000 ntdll!RtlUserThreadStart+0x21
看一下栈上的信息,究竟是哪个线程正在被abort。
0:022> !dso
OS Thread Id: 0x14e4 (22)
RSP/REG Object Name
000000da6bbfe9c0 000000da0aa84400 System.Threading.Thread
000000da6bbfea28 000000d9cab1efe8 System.Web.HttpApplication+CancelModuleException
000000da6bbfea38 000000da0aa84400 System.Threading.Thread
000000da6bbfea88 000000d9caa4c810 System.String
0:022> !do 000000da0aa84400
Name: System.Threading.Thread
MethodTable: 000007fb998b86c0
EEClass: 000007fb994bea18
Size: 104(0x68) bytes
(C:\windows\assembly\GAC_64\mscorlib\2.0.0.0__b77a5c561934e089\mscorlib.dll)
Fields:
MT Field Offset Type VT Attr Value Name
000007fb998f01b0 4000690 8 ....Contexts.Context 0 instance 000000d9caa285c0 m_Context
000007fb998b2bd0 4000691 10 ....ExecutionContext 0 instance 000000da0aace570 m_ExecutionContext
000007fb998b7c20 4000692 18 System.String 0 instance 0000000000000000 m_Name
000007fb998b8438 4000693 20 System.Delegate 0 instance 0000000000000000 m_Delegate
000007fb998aefe0 4000694 28 System.Object[][] 0 instance 0000000000000000 m_ThreadStaticsBuckets
000007fb998bed40 4000695 30 System.Int32[] 0 instance 0000000000000000 m_ThreadStaticsBits
000007fb998c05a0 4000696 38 ...ation.CultureInfo 0 instance 0000000000000000 m_CurrentCulture
000007fb998c05a0 4000697 40 ...ation.CultureInfo 0 instance 0000000000000000 m_CurrentUICulture
000007fb998b7510 4000698 48 System.Object 0 instance 0000000000000000 m_ThreadStartArg
000007fb998ba628 4000699 50 System.IntPtr 1 instance
da6b789d60 DONT_USE_InternalThread
000007fb998bee90 400069a 58 System.Int32 1 instance 2 m_Priority
000007fb998bee90 400069b 5c System.Int32 1 instance 7 m_ManagedThreadId
000007fb998ac8d8 400069c 2d8 ...LocalDataStoreMgr 0 shared static s_LocalDataStoreMgr
>> Domain:Value 000000d9c90512d0:0000000000000000 000000d9ca176610:0000000000000000 <<
000007fb998b7510 400069d 2e0 System.Object 0 shared static s_SyncObject
>> Domain:Value 000000d9c90512d0:000000d9caa23bf0 000000d9ca176610:000000da0aa2d410 <<
0:022> !threads
ThreadCount: 10
UnstartedThread: 0
BackgroundThread: 9
PendingThread: 0
DeadThread: 1
Hosted Runtime: no
PreEmptive Lock
ID OSID ThreadOBJ State GC GC Alloc Context Domain Count APT Exception
3 1 150c 000000d9ca125a90 8220 Enabled 0000000000000000:0000000000000000 000000d9c90512d0 0 Ukn
19 2 23c8 000000d9ca12f510 b220 Enabled 000000da0aa8a238:000000da0aa8c200 000000d9c90512d0 0 MTA (Finalizer)
20 3 1a98 000000d9ca174ab0 80a220 Enabled 0000000000000000:0000000000000000 000000d9c90512d0 0 MTA (Threadpool Completion Port)
21 4 23f4 000000d9ca175b60 1220 Enabled 0000000000000000:0000000000000000 000000d9c90512d0 0 Ukn
22 5 14e4 000000da6b779c50 180b220 Disabled 000000d9cab1f000:000000d9cab20b18 000000d9ca176610 1 MTA (Threadpool Worker)
XXXX 6 0 000000da6b77d9d0 10820 Enabled 0000000000000000:0000000000000000 000000d9c90512d0 0 Ukn
23 7 12e0 000000da6b789d60
3a0b221 Enabled 000000da0aafd860:000000da0aafe2a8 000000d9ca176610 1 MTA (Threadpool Worker)
15 8 2180 000000da6b78a330 880a220 Enabled 0000000000000000:0000000000000000 000000d9c90512d0 0 MTA (Threadpool Completion Port)
24 9 13a4 000000da6b7903a0 220 Enabled 000000d9caae2e48:000000d9caae4aa8 000000d9c90512d0 0 Ukn
25 a e5c 000000d9ca107180 220 Enabled 000000d9cab20c68:000000d9cab22b18 000000d9c90512d0 0 Ukn
结果我们看到是23号线程是被abort的目标。23号线程正在做什么?查看23号线程调用栈正好是我们的Sleep方法线程。
0:023> kL
Child-SP RetAddr Call Site
000000da`6bf0db08 000007fb`bbb611f2 ntdll!NtDelayExecution+0xa
000000da`6bf0db10 000007fb`a10c369d KERNELBASE!SleepEx+0xaa
000000da`6bf0dbb0 000007fb`a0cbe7fd mscorwks!EESleepEx+0x2d
000000da`6bf0dc30 000007fb`a1280769 mscorwks!Thread::UserSleep+0x71
000000da`6bf0dc90 000007fb`415d09ee mscorwks!ThreadNative::Sleep+0xf9
000000da`6bf0de40 000007fb`41630f39
App_Web_kng1ridv!ASP.sleep120_aspx.Page_Load(System.Object, System.EventArgs)+0xe
…
000000da`6bf0f750 000007fb`a0d1970a mscorwks!UnManagedPerAppDomainTPCount::DispatchWorkItem+0x157
000000da`6bf0f7f0 000007fb`a0dcf3a4 mscorwks!ThreadpoolMgr::WorkerThreadStart+0x1ba
000000da`6bf0f890 000007fb`bc53167e mscorwks!Thread::intermediateThreadProc+0x78
000000da`6bf0fae0 000007fb`beb53501 KERNEL32!BaseThreadInitThunk+0x1a
000000da`6bf0fb10 00000000`00000000 ntdll!RtlUserThreadStart+0x21
到此我们可以证明我们的想法了,原来ASP.NET启动了一个RequestTimeoutManager来负责查看超时线程,然后通过发送APC的方式打断超时线程使之抛出ThreadAbortException结束等待,最后ThreadAbortException被包装成HttpException抛出,记录在日志以及显示在客户端。
调试方法
现实过程中Timeout不一定是出在线程Sleep上,也不是所有的ThreadAbortException都代表Timeout的问题发生。我们还是要总结一下类似问题的调试方法。
第一,观察timeout的时间,是否是100秒左右,与配置中的数值是否相符(注意这里比一定是一秒不差,因为查看RequestTimeoutManager逻辑可以发现他是每隔15秒查看一下超时线程,所有存在15秒左右的误差。
第二,如果不是超时时间设置过小的原因,那就要通过抓dump的方式来查看具体的超时请求超时在什么操作上。因为我们最终得到的ThreadAbortException中没有调用栈信息。
如何抓到对的dump是个值得研究的问题,我们可以把断点设在以下调用栈上的某个点上,注意断点越接近栈顶,这个方法可能被调用的次数就越多,抓到没有用的dump的可能性越大。
0:022> kLChild-SP RetAddr Call Site
000000da`6bbfe808 000007fb`a10ce75b KERNEL32!QueueUserAPCStub
000000da`6bbfe810 000007fb`a0e37e88 mscorwks!Thread::Alert+0xdb
000000da`6bbfe880 000007fb`a103d0d4 mscorwks!Thread::UserInterrupt+0x30
000000da`6bbfe8b0 000007fb`a11f5699 mscorwks!Thread::UserAbort+0x3984d0
000000da`6bbfe9d0 000007fb`997c4519 mscorwks!ThreadNative::Abort+0x169
000000da`6bbfebd0 000007fb`988f835f mscorlib_ni!System.Threading.Thread.Abort(System.Object)+0x39
000000da`6bbfec20 000007fb`97e94547 System_Web_ni!System.Web.RequestTimeoutManager+RequestTimeoutEntry.TimeoutIfNeeded(System.DateTime)+0xa63d4f
000000da`6bbfec60 000007fb`99772bdb System_Web_ni!System.Web.RequestTimeoutManager.CancelTimedOutRequests(System.DateTime)+0x167
…
000000da`6bbff8a0 00000000`00000000 ntdll!RtlUserThreadStart+0x21
通过ILSpy查看CancelTimedOutRequests和TimeoutIfNeeded源代码,可以看到进入这个方法并不一定代表有线程会被abort,所以我们要从下面一层抓。最终结论是如果我们想要研究Timeout的ThreadAbortException,我们可以在mscorlib_ni!System.Threading.Thread.Abort上面设置断点抓dump,然后查看堆栈上面的线程,最终找到正在被abort的线程以及相应执行的调用栈信息。
// System.Web.RequestTimeoutManager private void CancelTimedOutRequests(DateTime now) { if (Interlocked.CompareExchange(ref this._inProgressLock, 1, 0) != 0) { return; } ArrayList arrayList = new ArrayList(this._requestCount); for (int i = 0; i < this._lists.Length; i++) { DoubleLinkList obj; Monitor.Enter(obj = this._lists[i]); try { DoubleLinkListEnumerator enumerator = this._lists[i].GetEnumerator(); while (enumerator.MoveNext()) { arrayList.Add(enumerator.GetDoubleLink()); } } finally { Monitor.Exit(obj); } } int count = arrayList.Count; for (int j = 0; j < count; j++) { ((RequestTimeoutManager.RequestTimeoutEntry)arrayList[j]).TimeoutIfNeeded(now); } Interlocked.Exchange(ref this._inProgressLock, 0); } // System.Web.RequestTimeoutManager.RequestTimeoutEntry internal void TimeoutIfNeeded(DateTime now) { Thread thread = this._context.MustTimeout(now); if (thread != null) { this.RemoveFromList(); thread.Abort(new HttpApplication.CancelModuleException(true)); } }