Ticket #336 (new Defect)

Opened 2 years ago

Last modified 3 months ago

replace heartbeat mechanism

Reported by: davea Assigned to: davea
Priority: Critical Milestone: Undetermined
Component: BOINC - API Version:
Keywords: Cc: Pepo

Description

Problem with the heartbeat mechanism: if the client does something that blocks for > 30 secs (e.g. a synchronous DNS lookup, a disk-space scan, a debugger break) then all apps quit, producing confusing messages and possibly wasting CPU time.

Proposed solution: remove heartbeat mechanism. Include client process ID in the app_init_data file. The API periodically sees if that process is still alive, and exits if not.

Change History

(follow-up: ↓ 8 ) 09/25/07 12:37:54 changed by Nicolas

Disagreed. What if the app has a strange deadlock and CPU time is being wasted, but the process is still alive?

The actual "bug" is that the client should NEVER block for > 30 seconds.

Lack of heartbeats causing apps to quit is only one of the many problems that show up (it's just a symptom). For example, the manager also blocks waiting for RPCs. That means if the client blocks doing something, the manager will hang too (= unresponsive GUI, bad for the user; this is another symptom caused by two or three different problems).

The client shouldn't stop responding RPCs because it's doing something else. The manager shouldn't stop responding user input because it's waiting for the RPC.

11/02/07 23:27:34 changed by Didactylos

How about improving the heartbeat mechanism? I haven't studied it in depth, but two thoughts come to mind.

  1. Blocking functions in the client or app should be asynchronous.
  2. Would it be possible perhaps to temporarily suspend the heartbeat for an app when it is about to block? Either for a set period or until it stops blocking? There are dangers with this, but the messages would be a lot more informative.

(in reply to: ↑ description ; follow-up: ↓ 6 ) 11/11/07 11:06:54 changed by Ananas

Replying to davea:

... Proposed solution: remove heartbeat mechanism. ...

If the heartbeat would be redefined to be expected within 30 CPU seconds instead of 30 seconds, heartbeats would be expected less often when the host itself is unresponsive (7-zip a huge file with max. compression has that effect). As high load affects both core client and project application, using the CPU time would probably be more appropriate. The project application would expect less heartbeats when it gets less CPU time itself.

The process CPU time should (hopefully) not be influenced by adjusting the PC clock.

There should not even be any compatibility issues as elapsed CPU time and elapsed wallclock time are not too different most of the time.

So the (unmodified) core clients still try to send a heartbeat within at least 30 seconds but the API would be more patient on overloaded systems.

That might fix the heartbeat problem when adjusting the PC time too

01/12/08 22:37:26 changed by duanra

This ticket well deserves its critical priority, because it is a cross-project problem. It happens with some WCG apps and boincsimap, for example. Each time my computer comes out of hibernation mode, it says :

Task 8011101.024330_1 exited with a DLL initialization error. If this happens repeatedly you may need to reboot your computer.

(No effect on rebooting, of course)

But in fact, in the stderr.txt file : No heartbeat from core client for 31 sec - exiting

05/08/08 06:47:49 changed by Pepo

  • cc set to Pepo.

(in reply to: ↑ 3 ; follow-up: ↓ 7 ) 05/15/08 16:43:44 changed by Nicolas

Replying to Ananas:

If the heartbeat would be redefined to be expected within 30 CPU seconds instead of 30 seconds, heartbeats would be expected less often when the host itself is unresponsive (7-zip a huge file with max. compression has that effect). As high load affects both core client and project application, using the CPU time would probably be more appropriate. The project application would expect less heartbeats when it gets less CPU time itself.

Heartbeats in the science app are handled by a separate thread that is not low-priority, so other computer load shouldn't affect it.

(in reply to: ↑ 6 ) 05/15/08 18:21:48 changed by Pepo

Replying to Nicolas:

Replying to Ananas:

As high load affects both core client and project application, using the CPU time would probably be more appropriate. The project application would expect less heartbeats when it gets less CPU time itself.

Heartbeats in the science app are handled by a separate thread that is not low-priority, so other computer load shouldn't affect it.

From my experiences upon resuming from hibernation, the apps often seem to be heavily crunching before deciding to disappear, so the switch from wall clock to CPU time would not change the behavior.

Maybe the client will finally also get a separate heartbeat thread, that is not low-priority...

(in reply to: ↑ 1 ) 05/28/08 02:35:12 changed by jbk

Replying to Nicolas:

Disagreed. What if the app has a strange deadlock and CPU time is being wasted, but the process is still alive?

The proposal is to include the process ID of the core client so that the science app can check it every now and then. The situation where the science app enters a deadlock is irrelevant to both the current and proposed system. It's true that the core client could enter a busy-waiting state and keep science apps artificially running. But in that case we've pretty much failed anyways, so it probably doesn't matter.

To keep track of runaway science apps you could add an additional system that checks on the reports coming from science apps: - You could limit how long an app is allowed to run without producing any progress reports - You could limit how long an app is allowed to run without producing any forward progress - Both of the above limits could be made on both CPU-time and wall-time (with tests for time-skips caused by hibernation or summertime/wintertime kicking in). - The limits could be part of the workunit XML to allow projects to configure them for their apps.

02/25/09 14:41:57 changed by romw

  • milestone changed from 6.6 to Undetermined.

If this page is incomplete or incorrect, please edit it or add it to the wiki to-do list. To do this, you must be logged in; click Login or Register above.