Thread 'Heartbeat replacement'

Message boards : BOINC client : Heartbeat replacement
Message board moderation

To post messages, you must log in.

AuthorMessage
ProfileAnanas

Send message
Joined: 27 Jun 06
Posts: 305
Germany
Message 17465 - Posted: 25 May 2008, 8:15:19 UTC
Last modified: 25 May 2008, 8:27:41 UTC

Picking up an old bug ...

The "no heartbeat" problem is still annoying, I think there would be a different method that would serve the same purpose with less risk of trashing results now and then :

Either send the PID of the core client to the project application (shmem, will cause compatibility issues) or make the project application save its PPID on startup (better backwards compatibility).

Then, instead of this heartbeat query, just check if that task is still up and running.


It isn't even necessary to check wether the process is a BOINC client or not, the round robin cycle of the process IDs is fairly long, so the chance that a different program grabs the same PID between a crash of the CC and the check, wether it is still running, is zero.


p.s.: This would even be backwards compatible to current core clients, the only difference is, that the API would ignore the heartbeat and check the PID instead.

(This is an API and a CC issue, so I wasn't sure where to put it)
ID: 17465 · Report as offensive
Pepo
Avatar

Send message
Joined: 3 Apr 06
Posts: 547
Slovakia
Message 17496 - Posted: 28 May 2008, 16:00:04 UTC - in response to Message 17465.  

The "no heartbeat" problem is still annoying

indeed...

I think there would be a different method that would serve the same purpose with less risk of trashing results now and then :
...send the PID of the core client to the project application...
Then, instead of this heartbeat query, just check if that task is still up and running.

The problem (which Dr. A. seems to belittle) is, that the applications (and the client vice versa too) should check, whether the particuar process with PID is really aware of what it is doing, not just being around...

Otherwise I could see no point in applications checking the client's PID validity. Whether they ignore it or the client is stuck, they would just crunch until finished and finally the machine would fall asleep anyway. (OK, then we would need some way to kill orphaned applications.)

Peter
ID: 17496 · Report as offensive
ProfileAnanas

Send message
Joined: 27 Jun 06
Posts: 305
Germany
Message 17548 - Posted: 29 May 2008, 23:25:53 UTC
Last modified: 29 May 2008, 23:33:28 UTC

Have a look at those results :

Einstein on 4.19 / Linux

No heartbeat, not even any shared memory communication at all, and it still works like a charm :-)
(I think they changed a keyword so the current API doesn't recognize the shared memory segment of the old CC anymore)


Checking the core client CPID from the client applications is necessary.

At the times of CC 4.13 a lot of CPDN models crashed because CPDN was still running when CC4.13 had died. A BOINC restart usually killed the CPDN model, if you didn't care to kill the processes. It tried to start a second CPDN application on the same model and of course the applications didn't want to share.

Don't forget - the heartbeat makes the project application check the core client, not vice versa! As long as the CC runs, it actually plays no role wether it works or got stuck in a dead loop, the project application is allowed to run, that's the important part. So it is not about orphaned project applications.

So checking the plain existance of the core client task is absolutely sufficient, the heartbeat stuff is just a trap for trashing workunits, not useful in any way.
ID: 17548 · Report as offensive
ProfileAnanas

Send message
Joined: 27 Jun 06
Posts: 305
Germany
Message 17763 - Posted: 10 Jun 2008, 17:29:30 UTC
Last modified: 10 Jun 2008, 17:34:30 UTC

Another possible bugfix would probably be :

void boinc_sleep(double seconds) {
seconds = min(seconds, 25.0); // never sleep longer than the heartbeat interval
#ifdef _WIN32
::Sleep(min((DWORD)(1000*seconds), 1);
#else
...

Note :

The Sleep function suspends the execution of the current thread for a specified interval. ... A value of zero causes the thread to relinquish the remainder of its time slice to any other thread of equal priority that is ready to run.
ID: 17763 · Report as offensive
ProfileAnanas

Send message
Joined: 27 Jun 06
Posts: 305
Germany
Message 18827 - Posted: 25 Jul 2008, 5:37:13 UTC
Last modified: 25 Jul 2008, 5:41:16 UTC

I know now when it happens, at least one of the possible situations :

When the name server cannot be accessed, the core clients seem to be completely stuck, so that it has no time to write the heartbeat timestamp into the memory.

My ISP has some "forced disconnect" after 24 hours, after which an immediate reconnect isn't possible for about 10 minutes. It does not happen at a certain time of the day, just now and then.

Those core clients that try to connect just at that time, produce heartbeat messages from the project applications on that computer.


I have configured several name servers, not just the router. I'm not sure wether that plays a role.
ID: 18827 · Report as offensive
ProfileAnanas

Send message
Joined: 27 Jun 06
Posts: 305
Germany
Message 25573 - Posted: 21 Jun 2009, 0:48:56 UTC
Last modified: 21 Jun 2009, 0:56:20 UTC

I think I know our enemy now, it's one (or both) of those two :

svchost.exe -k NetworkService

svchost.exe -k netsvcs

The same occurs on my PC at work, not with BOINC there but other networking applications become unresponsive when one of these two takes 99% CPU load.

Having multiple CPUs seems not to help, they seem to slow down some core services that affect all TCP applications.


p.s.: As the API communicates through memory, not local TCP, it will not help to use CPU seconds instead of realtime seconds for the heartbeat timeout :-(
ID: 25573 · Report as offensive

Message boards : BOINC client : Heartbeat replacement

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.