Message boards : BOINC client : boinc over nfs: too short timeout
Message board moderation
Author | Message |
---|---|
Send message Joined: 8 Aug 09 Posts: 4 |
Hello! When disk I/O is slow (either because of NFS, or because of heavy disk load), clients often restart. I suspect some timeout is too short. I have two linux PCs: A with internet connectivity, B without. Both tun setiathome. B is nfs-mounted to A. Relatively slow, long distance connection. A is Q6600 box, running Slackware linux. Both run boinc 6.4.5, 32 bit linux. To communicate with project, boinc is stopped on A and B. Then B's boinc is restarted on A, using B's boinc directory. setiathome clients are very often restarted, especially when boinc receives some command (update, quit etc). Up to the point where no upload/download is done due to continous restarting. On another PC, quite loaded, with SATA disks in raid, clients "die" often when there is a lot of disk I/O. BR Iztok example of clients dying: 08-Aug-2009 08:46:36 [SETI@home] Task 21dc08ad.21914.7843.16.10.24_0 exited with zero status but no 'finished' file 08-Aug-2009 08:46:36 [SETI@home] If this happens repeatedly you may need to reset the project. 08-Aug-2009 08:46:38 [SETI@home] Restarting task 21dc08ad.21914.7843.16.10.24_0 using setiathome_enhanced version 528 |
Send message Joined: 19 Jul 07 Posts: 17 |
What are the values of the following computing preferences set to? Switch between applications every ? Write to disk at most every ? You should increase these values to 4x what they are by default. Also, what does the mount line in your /etc/fstab have? Specifically I'm interested in all mount options for NFS. |
Send message Joined: 8 Aug 09 Posts: 4 |
What are the values of the following computing preferences set to? Hi! Switch: no switch, running only setiathome. Set to 180 minutes. writes every 180 seconds. (three times is close to 4) mount -t nfs far_away_pc:/home/boinc /home/boinc/far_away_box -o rsize=8192,wsize=8192,soft ping: 14 packets transmitted, 14 received, 0% packet loss, time 13104ms rtt min/avg/max/mdev = 49.049/49.258/50.403/0.382 ms Anyhow, second case (PC with quite some disk IO on sata raid-5 disks) has nothing to do with 50ms away nfs mount. Thanks, Iztok |
Send message Joined: 19 Jul 07 Posts: 17 |
Try adding the following mount options, replacing what you have already: soft,proto=tcp,locallocks If that doesn't help, add these: rsize=1024,wsize=1024,retrans=10 Is computer B using an ethernet or a wireless connection? If it is wireless, packet loss may be hurting performance/causing problems, and the second set of options should help. But reliability will be at the cost of throughput. |
Send message Joined: 8 Aug 09 Posts: 4 |
Try adding the following mount options, replacing what you have already: Hi, it is ethernet + liesed lines. How about another one, where seti clients restart during disk I/O by other tasks? What is timeout for tasks to restart? Can it be changed? BR Iztok |
Send message Joined: 8 Jan 06 Posts: 36 |
The problem is that sluggish disk I/O (and other I/O for that matter) blocks the CC from sending heartbeats to all the running applications. When this happens the applications are designed on purpose to exit on their own to prevent them from running as 'orphaned' processes. The current timeout for missing heartbeats is 30 seconds. The humourous part about it is the applications are doing just what they are supposed to do, and the CC doesn't realize that it was the one which triggered the anomaly by not sending the heartbeats on time (as evidenced by the log message you posted). You could reset the project until the cows come home and it won't help one iota. ;-) Alinator |
Send message Joined: 8 Aug 09 Posts: 4 |
The problem is that sluggish disk I/O (and other I/O for that matter) blocks the CC from sending heartbeats to all the running applications. When this happens the applications are designed on purpose to exit on their own to prevent them from running as 'orphaned' processes. The current timeout for missing heartbeats is 30 seconds. Hi, thanks! Now... this 30 seconds might be nice as parameter? I frgot to mention: it happens often when requesting update: Sending scheduler request: Requested by user. Requesting 0 seconds of work, reporting 5 completed tasks It might be also corelated by number of WUs cached locally? Maybe boinc is sending it every 29.5 seconds, so it is small time for error? Maybe sending it every 20 seconds would help? BR Iztok |
Send message Joined: 8 Jan 06 Posts: 36 |
Actually, in 'normal' circumstances I think the CC sends a heartbeat to the apps once a second. The thirty second timeout is built into the app, and if it doesn't see a heartbeat in that amount of time it will exit. You are correct in surmising the problem is related to the amount of work a host is carrying. The reason is this increases the amount of overhead the CC has to handle locally (disk I/O), as well as the amount of data which gets transferred back to the projects on every scheduler contact session (network I/O). Those functions 'lock' the shared memory (the method used by the apps and CC to communicate), and if they take longer than thirty seconds to complete will result in the 'forced' exit by the app. As a side note, if you use a remote monitoring tool like BOINCview and set the update interval too short you can effectively bring task processing to a halt with the network I/O overhead! ;-) Alinator |
Copyright © 2025 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.