No Heartbeat seems to be causing DLL error messages

Message boards : BOINC Manager : No Heartbeat seems to be causing DLL error messages
Message board moderation

To post messages, you must log in.

AuthorMessage
PaperDragon

Send message
Joined: 13 Sep 05
Posts: 10
Message 15108 - Posted: 25 Jan 2008, 5:03:48 UTC

I have been getting those DLL initialization errors on 'no heart beat' conditions.

Take a look at this malariacontrol

Looking at the stderr section, you notice the heartbeat condition. This matches up with my log messages:
Lair2.paperdragon.ca malariacontrol.net beta 2008-01-24 01:20:34 Task wu_84_316_89784_0_1200991208_0 exited with a DLL initialization error.

Or this one from SETI beta
Log entry for it:
Lair2.paperdragon.ca SETI@home Beta Test 2008-01-24 01:20:34 Task 11oc06aa.25099.530432.3.10.87_0 exited with a DLL initialization error.

So for some reason the no heartbeat errors are being reported as DLL errors

ID: 15108 · Report as offensive
SekeRob

Send message
Joined: 25 Aug 06
Posts: 1596
Message 15110 - Posted: 25 Jan 2008, 8:38:26 UTC - in response to Message 15108.  
Last modified: 25 Jan 2008, 8:48:03 UTC

In past the report was loss of heartbeat and not this. Had one yesterday on a 19 hour job at another project. Only found out after the job was reported and declared invalid and looking in the log.

24/01/2008 22:45:18||Task ach1_16_10_1 exited with a DLL initialization error.
24/01/2008 22:45:18||If this happens repeatedly you may need to reboot your computer.
24/01/2008 22:45:19||Restarting task ach1_16_10_1 using acah version 514

It's the only one case though where i found it caused the job to end invalid. All other, shorter jobs somehow were able to recover. Was able to reconstruct that it happened during a remote control session (not of BOINC) that went sluggish en eventually had to be killed.
Coelum Non Animum Mutant, Qui Trans Mare Currunt
ID: 15110 · Report as offensive
MikeMarsUK

Send message
Joined: 16 Apr 06
Posts: 386
United Kingdom
Message 15114 - Posted: 25 Jan 2008, 8:52:37 UTC


The mislabeling of 'status zero exit' messages as 'dll initialisation error' messages was fixed shortly after 5.10.30 was released, although these later versions are still in testing.

ID: 15114 · Report as offensive
Nicolas

Send message
Joined: 19 Jan 07
Posts: 1179
Argentina
Message 15119 - Posted: 25 Jan 2008, 16:21:35 UTC - in response to Message 15108.  

I have been getting those DLL initialization errors on 'no heart beat' conditions.

What BOINC version do you have?
ID: 15119 · Report as offensive
Pepo
Avatar

Send message
Joined: 3 Apr 06
Posts: 547
Slovakia
Message 16942 - Posted: 28 Apr 2008, 23:15:45 UTC - in response to Message 15114.  

The mislabeling of 'status zero exit' messages as 'dll initialisation error' messages was fixed shortly after 5.10.30 was released, although these later versions are still in testing.

I was and am still seeing the "DLL initialization error" exit messages through 5.10.41, 5.10.42 and 5.10.45, on Win XP SP2, and definitely on 'no heart beat' conditions. Happens mostly with a busy system during a wakeup from hibernation (it does not matter whether the applications are suspended or not, whereas the client in the mean time often happily comunicates with various schedulers or uploads files). But occasionally just while the client tries to communicate (possibly the 'single blocked thread' problem), like today:

28-Apr-2008 18:39:48 [QCN Alpha Test] [task_debug] result qcne_002178_0 checkpointed
28-Apr-2008 18:42:34 [SETI@home] [task_debug] result 24mr08ab.22129.22976.15.8.197_1 checkpointed
28-Apr-2008 18:50:42 [QCN Alpha Test] [task_debug] result qcne_002178_0 checkpointed
28-Apr-2008 18:53:24 [SETI@home] [task_debug] result 24mr08ab.22129.22976.15.8.197_1 checkpointed
28-Apr-2008 19:00:00 [---] Resuming network activity
28-Apr-2008 19:00:03 [The Lattice Project] [sched_op_debug] Fetching master file
28-Apr-2008 19:00:03 [The Lattice Project] Fetching scheduler list
28-Apr-2008 19:00:03 [Milkyway@home] Started upload of gs_560_1209170002_132848_0_0
28-Apr-2008 19:00:03 [Cels@Home] Started upload of N16-5_m50c001_160_0_S.gz_0_0
28-Apr-2008 19:00:37 [QCN Alpha Test] [task_debug] Process for qcne_002178_0 exited
28-Apr-2008 19:00:37 [QCN Alpha Test] Task qcne_002178_0 exited with a DLL initialization error.
28-Apr-2008 19:00:37 [QCN Alpha Test] If this happens repeatedly you may need to reboot your computer.
28-Apr-2008 19:00:37 [QCN Alpha Test] [task_debug] task_state=UNINITIALIZED for qcne_002178_0 from handle_exit_external
28-Apr-2008 19:00:37 [SETI@home] [task_debug] Process for 24mr08ab.22129.22976.15.8.197_1 exited
28-Apr-2008 19:00:37 [SETI@home] Task 24mr08ab.22129.22976.15.8.197_1 exited with a DLL initialization error.
28-Apr-2008 19:00:37 [SETI@home] If this happens repeatedly you may need to reboot your computer.
28-Apr-2008 19:00:37 [SETI@home] [task_debug] task_state=UNINITIALIZED for 24mr08ab.22129.22976.15.8.197_1 from handle_exit_external
28-Apr-2008 19:00:37 [Cels@Home] [task_debug] Process for N16-1_m35c001_180_2_S.gz_0 exited
28-Apr-2008 19:00:37 [Cels@Home] Task N16-1_m35c001_180_2_S.gz_0 exited with a DLL initialization error.
28-Apr-2008 19:00:37 [Cels@Home] If this happens repeatedly you may need to reboot your computer.
28-Apr-2008 19:00:37 [Cels@Home] [task_debug] task_state=UNINITIALIZED for N16-1_m35c001_180_2_S.gz_0 from handle_exit_external
28-Apr-2008 19:00:37 [QCN Alpha Test] [cpu_sched] Starting qcne_002178_0(resume)
28-Apr-2008 19:00:37 [QCN Alpha Test] [task_debug] task_state=EXECUTING for qcne_002178_0 from start
28-Apr-2008 19:00:37 [QCN Alpha Test] Restarting task qcne_002178_0 using qcnalpha version 246
28-Apr-2008 19:00:37 [SETI@home] [cpu_sched] Starting 24mr08ab.22129.22976.15.8.197_1(resume)
28-Apr-2008 19:00:37 [SETI@home] [task_debug] task_state=EXECUTING for 24mr08ab.22129.22976.15.8.197_1 from start
28-Apr-2008 19:00:37 [SETI@home] Restarting task 24mr08ab.22129.22976.15.8.197_1 using setiathome_enhanced version 527
28-Apr-2008 19:00:37 [Cels@Home] [cpu_sched] Starting N16-1_m35c001_180_2_S.gz_0(resume)
28-Apr-2008 19:00:37 [Cels@Home] [task_debug] task_state=EXECUTING for N16-1_m35c001_180_2_S.gz_0 from start
28-Apr-2008 19:00:37 [Cels@Home] Restarting task N16-1_m35c001_180_2_S.gz_0 using cels version 100
28-Apr-2008 19:00:39 [---] Project communication failed: attempting access to reference site
28-Apr-2008 19:00:50 [The Lattice Project] [sched_op_debug] Deferring communication for 1 min 0 sec
28-Apr-2008 19:00:50 [The Lattice Project] [sched_op_debug] Reason: Scheduler list fetch failed: http error
28-Apr-2008 19:00:51 [Milkyway@home] Temporarily failed upload of gs_560_1209170002_132848_0_0: http error
28-Apr-2008 19:00:51 [Milkyway@home] Backing off 1 min 0 sec on upload of gs_560_1209170002_132848_0_0
28-Apr-2008 19:00:56 [ralph@home] [sched_op_debug] Fetching master file
28-Apr-2008 19:00:56 [ralph@home] Fetching scheduler list
28-Apr-2008 19:01:08 [Cels@Home] Temporarily failed upload of N16-5_m50c001_160_0_S.gz_0_0: system connect
28-Apr-2008 19:01:08 [Cels@Home] Backing off 1 min 0 sec on upload of N16-5_m50c001_160_0_S.gz_0_0
28-Apr-2008 19:01:11 [---] Access to reference site failed - check network connection or proxy configuration.
28-Apr-2008 19:01:32 [ralph@home] [sched_op_debug] Deferring communication for 1 min 0 sec
28-Apr-2008 19:01:32 [ralph@home] [sched_op_debug] Reason: Scheduler list fetch failed: http error


Network comm was set to "auto" and was off by rule until 19:00. At that moment client started its communication attempts, which were not possible due to the machine being behind a proxy, but everything else including DNS was fully functional. The machine was otherwise idle (just me internetbrowsing), and Seti and Cels were still consuming some 80-90% of CPU until at least 19:00:25 (confirmed by Process Explorer logs). The client noticed it at 19:00:37 and declared them dead.

I'd like to find out, what's exactly behind these lost heartbeats during wakeup...

Peter
ID: 16942 · Report as offensive
MikeMarsUK

Send message
Joined: 16 Apr 06
Posts: 386
United Kingdom
Message 16945 - Posted: 29 Apr 2008, 7:39:21 UTC


I've heard more reports of 5.10.45 raising the wrong message recently, so I'd have to guess that they fixed the V6 branch but didn't backfit the same (simple) fix to the V5 branch.

ID: 16945 · Report as offensive
Pepo
Avatar

Send message
Joined: 3 Apr 06
Posts: 547
Slovakia
Message 16946 - Posted: 29 Apr 2008, 9:47:58 UTC - in response to Message 16945.  

I've heard more reports of 5.10.45 raising the wrong message recently, so I'd have to guess that they fixed the V6 branch but didn't backfit the same (simple) fix to the V5 branch.

This is quite possible.

I've indeed noticed few possibly related code changes in client/app_control.C between 5.10.14 and trunk, like [trac]changeset:14348[/trac] in 3. Dec 2007

<pre style="white-space:pre-wrap; ">
--- /trunk/boinc/client/app_control.C (revision 14310)
+++ /trunk/boinc/client/app_control.C (revision 14348)
@@ -178,7 +178,7 @@
 
 static void limbo_message(ACTIVE_TASK& at) {
 #ifdef _WIN32
-    if (at.result->exit_status = STATUS_DLL_INIT_FAILED) {
+    if (at.result->exit_status == STATUS_DLL_INIT_FAILED) {
         msg_printf(at.result->project, MSG_INFO,
             "Task %s exited with a DLL initialization error.",
             at.result->name
         );</pre>

(would just print "Task %s exited with zero status but no 'finished' file" instead),
or [trac]changeset:14552[/trac]

<pre style="white-space:pre-wrap; ">--- trunk/boinc/client/app_control.C (revision 14549)
+++ trunk/boinc/client/app_control.C (revision 14552)
@@ -269,9 +269,10 @@
         case 0x40010004:        // vista shutdown?? can someone explain this?
         case STATUS_DLL_INIT_FAILED:
             // This can happen because:
-            // - The OS is shutting down, so attempting to start
-            // any new application fails automatically.
+            // - The OS is shutting down, and attempting to start
+            //   any new application fails automatically.
             // - The OS has run out of desktop heap
+            // - (reportedly) The computer has just come out of hibernation
             //
             handle_premature_exit(will_restart);
             break;
</pre>


(just a comment confirming the problem).

So it's time to to give the 6.1.17 a try, maybe already the current 6.1.16? (Few comments on email lists still keep me waiting.)

Peter
ID: 16946 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15004
Netherlands
Message 16947 - Posted: 29 Apr 2008, 12:01:04 UTC - in response to Message 16946.  

So it's time to to give the 6.1.17 a try, maybe already the current 6.1.16? (Few comments on email lists still keep me waiting.)

Good luck.. do backup your data, do not have an internet connection when you have installed BOINC. It'll make sure you won't burst out in needless tears. ;-)
ID: 16947 · Report as offensive

Message boards : BOINC Manager : No Heartbeat seems to be causing DLL error messages

Copyright © 2022 University of California. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.