Ticket #336 (new Defect)

Opened 10 months ago

Last modified 3 days ago

replace heartbeat mechanism

Reported by: davea Assigned to: davea
Priority: Critical Milestone: 6.0
Component: BOINC - API Version:
Keywords: Cc: Pepo

Description

Problem with the heartbeat mechanism: if the client does something that blocks for > 30 secs (e.g. a synchronous DNS lookup, a disk-space scan, a debugger break) then all apps quit, producing confusing messages and possibly wasting CPU time.

Proposed solution: remove heartbeat mechanism. Include client process ID in the app_init_data file. The API periodically sees if that process is still alive, and exits if not.

Change History

09/25/07 12:37:54 changed by Nicolas

Disagreed. What if the app has a strange deadlock and CPU time is being wasted, but the process is still alive?

The actual "bug" is that the client should NEVER block for > 30 seconds.

Lack of heartbeats causing apps to quit is only one of the many problems that show up (it's just a symptom). For example, the manager also blocks waiting for RPCs. That means if the client blocks doing something, the manager will hang too (= unresponsive GUI, bad for the user; this is another symptom caused by two or three different problems).

The client shouldn't stop responding RPCs because it's doing something else. The manager shouldn't stop responding user input because it's waiting for the RPC.

11/02/07 23:27:34 changed by Didactylos

How about improving the heartbeat mechanism? I haven't studied it in depth, but two thoughts come to mind.

  1. Blocking functions in the client or app should be asynchronous.
  2. Would it be possible perhaps to temporarily suspend the heartbeat for an app when it is about to block? Either for a set period or until it stops blocking? There are dangers with this, but the messages would be a lot more informative.

(in reply to: ↑ description ) 11/11/07 11:06:54 changed by Ananas

Replying to davea:

... Proposed solution: remove heartbeat mechanism. ...

If the heartbeat would be redefined to be expected within 30 CPU seconds instead of 30 seconds, heartbeats would be expected less often when the host itself is unresponsive (7-zip a huge file with max. compression has that effect). As high load affects both core client and project application, using the CPU time would probably be more appropriate. The project application would expect less heartbeats when it gets less CPU time itself.

The process CPU time should (hopefully) not be influenced by adjusting the PC clock.

There should not even be any compatibility issues as elapsed CPU time and elapsed wallclock time are not too different most of the time.

So the (unmodified) core clients still try to send a heartbeat within at least 30 seconds but the API would be more patient on overloaded systems.

That might fix the heartbeat problem when adjusting the PC time too

01/12/08 22:37:26 changed by duanra

This ticket well deserves its critical priority, because it is a cross-project problem. It happens with some WCG apps and boincsimap, for example. Each time my computer comes out of hibernation mode, it says :

Task 8011101.024330_1 exited with a DLL initialization error. If this happens repeatedly you may need to reboot your computer.

(No effect on rebooting, of course)

But in fact, in the stderr.txt file : No heartbeat from core client for 31 sec - exiting

05/08/08 06:47:49 changed by Pepo

  • cc set to Pepo.

If this page is incomplete or incorrect, please edit it or add it to the wiki to-do list. To do this, you must be logged in; click Login or Register above.