Message boards : BOINC client : Heartbeat replacement
Message board moderation
Author | Message |
---|---|
Send message Joined: 27 Jun 06 Posts: 305 |
Picking up an old bug ... The "no heartbeat" problem is still annoying, I think there would be a different method that would serve the same purpose with less risk of trashing results now and then : Either send the PID of the core client to the project application (shmem, will cause compatibility issues) or make the project application save its PPID on startup (better backwards compatibility). Then, instead of this heartbeat query, just check if that task is still up and running. It isn't even necessary to check wether the process is a BOINC client or not, the round robin cycle of the process IDs is fairly long, so the chance that a different program grabs the same PID between a crash of the CC and the check, wether it is still running, is zero. p.s.: This would even be backwards compatible to current core clients, the only difference is, that the API would ignore the heartbeat and check the PID instead. (This is an API and a CC issue, so I wasn't sure where to put it) |
Send message Joined: 3 Apr 06 Posts: 547 |
The "no heartbeat" problem is still annoying indeed... I think there would be a different method that would serve the same purpose with less risk of trashing results now and then : The problem (which Dr. A. seems to belittle) is, that the applications (and the client vice versa too) should check, whether the particuar process with PID is really aware of what it is doing, not just being around... Otherwise I could see no point in applications checking the client's PID validity. Whether they ignore it or the client is stuck, they would just crunch until finished and finally the machine would fall asleep anyway. (OK, then we would need some way to kill orphaned applications.) Peter |
Send message Joined: 27 Jun 06 Posts: 305 |
Have a look at those results : Einstein on 4.19 / Linux No heartbeat, not even any shared memory communication at all, and it still works like a charm :-) (I think they changed a keyword so the current API doesn't recognize the shared memory segment of the old CC anymore) Checking the core client CPID from the client applications is necessary. At the times of CC 4.13 a lot of CPDN models crashed because CPDN was still running when CC4.13 had died. A BOINC restart usually killed the CPDN model, if you didn't care to kill the processes. It tried to start a second CPDN application on the same model and of course the applications didn't want to share. Don't forget - the heartbeat makes the project application check the core client, not vice versa! As long as the CC runs, it actually plays no role wether it works or got stuck in a dead loop, the project application is allowed to run, that's the important part. So it is not about orphaned project applications. So checking the plain existance of the core client task is absolutely sufficient, the heartbeat stuff is just a trap for trashing workunits, not useful in any way. |
Send message Joined: 27 Jun 06 Posts: 305 |
Another possible bugfix would probably be : void boinc_sleep(double seconds) { seconds = min(seconds, 25.0); // never sleep longer than the heartbeat interval #ifdef _WIN32 ::Sleep(min((DWORD)(1000*seconds), 1); #else ... Note : The Sleep function suspends the execution of the current thread for a specified interval. ... A value of zero causes the thread to relinquish the remainder of its time slice to any other thread of equal priority that is ready to run. |
Send message Joined: 27 Jun 06 Posts: 305 |
I know now when it happens, at least one of the possible situations : When the name server cannot be accessed, the core clients seem to be completely stuck, so that it has no time to write the heartbeat timestamp into the memory. My ISP has some "forced disconnect" after 24 hours, after which an immediate reconnect isn't possible for about 10 minutes. It does not happen at a certain time of the day, just now and then. Those core clients that try to connect just at that time, produce heartbeat messages from the project applications on that computer. I have configured several name servers, not just the router. I'm not sure wether that plays a role. |
Send message Joined: 27 Jun 06 Posts: 305 |
I think I know our enemy now, it's one (or both) of those two : svchost.exe -k NetworkService svchost.exe -k netsvcs The same occurs on my PC at work, not with BOINC there but other networking applications become unresponsive when one of these two takes 99% CPU load. Having multiple CPUs seems not to help, they seem to slow down some core services that affect all TCP applications. p.s.: As the API communicates through memory, not local TCP, it will not help to use CPU seconds instead of realtime seconds for the heartbeat timeout :-( |
Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.