An exchange hardware fault caused my (always on) broadband connection to drop last night. BOINC 6.2.14 (protected install on XP and Vista) had stopped running on both systems when I checked them this morning.
Looking at stdoutdae.txt it's clear that BOINC didn't detect the network failure and everything was fine as long as it was only attempting scheduler requests. As soon as an upload was added into the equation it crashed.
Here's a scheduler request from stdoutdae.txt after the connection had failed:
29-Jul-2008 00:50:08 [CPDN Beta] Sending scheduler request: To send trickle-up message. Requesting 0 seconds of work, reporting 0 completed tasks
29-Jul-2008 00:50:11 [---] Project communication failed: attempting access to reference site
29-Jul-2008 00:50:12 [---] Internet access OK - project servers may be temporarily down.
29-Jul-2008 00:50:13 [CPDN Beta] Scheduler request failed: Server returned nothing (no headers, no data)
Note that the reference site check is being made before the scheduler request has failed and is being marked as successful.
The trickle-up and reference file check was retried 9 times before the following sequence when boinc.exe crashed ('normal' scheduler requests take priority over trickle-ups):
29-Jul-2008 02:51:10 [malariacontrol.net] Computation for task wu_133_524_149170_0_1217280246_0 finished
29-Jul-2008 02:51:10 [malariacontrol.net] Sending scheduler request: To fetch work. Requesting 818 seconds of work, reporting 1 completed tasks
29-Jul-2008 02:51:12 [---] Project communication failed: attempting access to reference site
29-Jul-2008 02:51:12 [malariacontrol.net] Started upload of wu_133_524_149170_0_1217280246_0_0
BOINC Windows Runtime Debugger didn't generate any stack traces on the XP system but on the Vista system the trace in stderrdae.txt indicates that the crash was in the libcurl function curl_multi_remove_handle():
BOINC Windows Runtime Debugger Version 6.2.14
Dump Timestamp : 07/29/08 02:51:13
Debugger Engine : 4.0.5.0
*** Dump of thread ID 44492 (state: Waiting): ***
- Information -
Status: Wait Reason: UserRequest, , Kernel Time: 87828560.000000, User Time: 71604456.000000, Wait Time: 19143612.000000
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0016D9FC read attempt to address 0x27273D84
- Registers -
eax=01e40278 ebx=00d3fe00 ecx=00d3fe00 edx=00001caa esi=27273d74 edi=00000000
eip=0016d9fc esp=0129fda0 ebp=0129fe6c
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010206
- Callstack -
ChildEBP RetAddr Args to Child
0129fe6c 0040c86f 00468c18 00000000 3fc68730 7554eab9 libcurl!curl_multi_remove_handle+0x0
0129fef0 00431e51 00000000 3fc68730 76cae0c5 00000000 boinc!+0x0
0129ff68 0043b467 00000000 001d19a0 76cad1da 00000001 boinc!+0x0
0129ff88 75854911 001d19a0 0129ffd4 76fce4b6 001d19a0 boinc!+0x0
0129ff94 76fce4b6 001d19a0 7dc5be09 00000000 00000000 kernel32!BaseThreadInitThunk+0x0
0129ffd4 76fce489 76cad1b9 001d19a0 00000000 00000000 ntdll!RtlInitializeExceptionChain+0x0
0129ffec 00000000 76cad1b9 001d19a0 00000000 43534552 ntdll!RtlInitializeExceptionChain+0x0
While I was waiting for the faulty line card to be replaced I tried to get BOINC running again on both systems. Shortly after network operations started all tasks stopped running with heartbeat failures:
No heartbeat from core client for 31 sec - exiting
CPDN Monitor - No 'heartbeat' from BOINC...
Tasks could only be kept running by suspending networking until the exchange problem was fixed.