boinc over nfs: too short timeout

Message boards : BOINC client : boinc over nfs: too short timeout
Message board moderation

To post messages, you must log in.

AuthorMessage
s52d

Send message
Joined: 8 Aug 09
Posts: 4
Slovenia
Message 26533 - Posted: 8 Aug 2009, 8:14:15 UTC

Hello!

When disk I/O is slow (either because of NFS, or
because of heavy disk load), clients often restart.
I suspect some timeout is too short.



I have two linux PCs: A with internet connectivity, B without.
Both tun setiathome.

B is nfs-mounted to A. Relatively slow, long distance connection. A is Q6600 box, running Slackware linux.
Both run boinc 6.4.5, 32 bit linux.

To communicate with project, boinc is stopped on A
and B. Then B's boinc is restarted on A,
using B's boinc directory.

setiathome clients are very often restarted, especially when boinc receives some command (update, quit etc). Up to the point where no upload/download
is done due to continous restarting.

On another PC, quite loaded, with SATA disks in raid,
clients "die" often when there is a lot of disk I/O.

BR

Iztok




example of clients dying:


08-Aug-2009 08:46:36 [SETI@home] Task 21dc08ad.21914.7843.16.10.24_0 exited with zero status but no 'finished' file
08-Aug-2009 08:46:36 [SETI@home] If this happens repeatedly you may need to reset the project.
08-Aug-2009 08:46:38 [SETI@home] Restarting task 21dc08ad.21914.7843.16.10.24_0 using setiathome_enhanced version 528










ID: 26533 · Report as offensive
DJStarfox

Send message
Joined: 19 Jul 07
Posts: 17
Message 26594 - Posted: 13 Aug 2009, 18:44:39 UTC - in response to Message 26533.  
Last modified: 13 Aug 2009, 18:45:55 UTC

What are the values of the following computing preferences set to?

Switch between applications every ?
Write to disk at most every ?

You should increase these values to 4x what they are by default.

Also, what does the mount line in your /etc/fstab have? Specifically I'm interested in all mount options for NFS.
ID: 26594 · Report as offensive
s52d

Send message
Joined: 8 Aug 09
Posts: 4
Slovenia
Message 26612 - Posted: 15 Aug 2009, 14:59:43 UTC - in response to Message 26594.  

What are the values of the following computing preferences set to?

Switch between applications every ?
Write to disk at most every ?

You should increase these values to 4x what they are by default.

Also, what does the mount line in your /etc/fstab have? Specifically I'm interested in all mount options for NFS.


Hi!

Switch: no switch, running only setiathome. Set to 180 minutes.
writes every 180 seconds. (three times is close to 4)

mount -t nfs far_away_pc:/home/boinc /home/boinc/far_away_box -o rsize=8192,wsize=8192,soft

ping:
14 packets transmitted, 14 received, 0% packet loss, time 13104ms
rtt min/avg/max/mdev = 49.049/49.258/50.403/0.382 ms


Anyhow, second case (PC with quite some disk IO on sata raid-5 disks)
has nothing to do with 50ms away nfs mount.


Thanks,

Iztok


ID: 26612 · Report as offensive
DJStarfox

Send message
Joined: 19 Jul 07
Posts: 17
Message 26641 - Posted: 17 Aug 2009, 15:24:27 UTC - in response to Message 26612.  

Try adding the following mount options, replacing what you have already:
soft,proto=tcp,locallocks

If that doesn't help, add these:
rsize=1024,wsize=1024,retrans=10

Is computer B using an ethernet or a wireless connection? If it is wireless, packet loss may be hurting performance/causing problems, and the second set of options should help. But reliability will be at the cost of throughput.
ID: 26641 · Report as offensive
s52d

Send message
Joined: 8 Aug 09
Posts: 4
Slovenia
Message 26642 - Posted: 17 Aug 2009, 15:48:55 UTC - in response to Message 26641.  

Try adding the following mount options, replacing what you have already:
soft,proto=tcp,locallocks

If that doesn't help, add these:
rsize=1024,wsize=1024,retrans=10

Is computer B using an ethernet or a wireless connection? If it is wireless, packet loss may be hurting performance/causing problems, and the second set of options should help. But reliability will be at the cost of throughput.


Hi, it is ethernet + liesed lines.

How about another one, where seti clients restart during disk I/O by other tasks?

What is timeout for tasks to restart? Can it be changed?

BR
Iztok
ID: 26642 · Report as offensive
Alinator

Send message
Joined: 8 Jan 06
Posts: 36
United States
Message 26645 - Posted: 17 Aug 2009, 16:10:05 UTC - in response to Message 26642.  
Last modified: 17 Aug 2009, 16:12:26 UTC

The problem is that sluggish disk I/O (and other I/O for that matter) blocks the CC from sending heartbeats to all the running applications. When this happens the applications are designed on purpose to exit on their own to prevent them from running as 'orphaned' processes. The current timeout for missing heartbeats is 30 seconds.

The humourous part about it is the applications are doing just what they are supposed to do, and the CC doesn't realize that it was the one which triggered the anomaly by not sending the heartbeats on time (as evidenced by the log message you posted). You could reset the project until the cows come home and it won't help one iota. ;-)

Alinator
ID: 26645 · Report as offensive
s52d

Send message
Joined: 8 Aug 09
Posts: 4
Slovenia
Message 26655 - Posted: 18 Aug 2009, 6:27:25 UTC - in response to Message 26645.  

The problem is that sluggish disk I/O (and other I/O for that matter) blocks the CC from sending heartbeats to all the running applications. When this happens the applications are designed on purpose to exit on their own to prevent them from running as 'orphaned' processes. The current timeout for missing heartbeats is 30 seconds.

The humourous part about it is the applications are doing just what they are supposed to do, and the CC doesn't realize that it was the one which triggered the anomaly by not sending the heartbeats on time (as evidenced by the log message you posted). You could reset the project until the cows come home and it won't help one iota. ;-)

Alinator


Hi, thanks!

Now... this 30 seconds might be nice as parameter?


I frgot to mention: it happens often when requesting update:
Sending scheduler request: Requested by user. Requesting 0 seconds of work, reporting 5 completed tasks

It might be also corelated by number of WUs cached locally?

Maybe boinc is sending it every 29.5 seconds, so it is small time for error?
Maybe sending it every 20 seconds would help?

BR
Iztok
ID: 26655 · Report as offensive
Alinator

Send message
Joined: 8 Jan 06
Posts: 36
United States
Message 26665 - Posted: 18 Aug 2009, 14:33:03 UTC - in response to Message 26655.  


Hi, thanks!

Now... this 30 seconds might be nice as parameter?


I frgot to mention: it happens often when requesting update:
Sending scheduler request: Requested by user. Requesting 0 seconds of work, reporting 5 completed tasks

It might be also corelated by number of WUs cached locally?

Maybe boinc is sending it every 29.5 seconds, so it is small time for error?
Maybe sending it every 20 seconds would help?

BR
Iztok


Actually, in 'normal' circumstances I think the CC sends a heartbeat to the apps once a second. The thirty second timeout is built into the app, and if it doesn't see a heartbeat in that amount of time it will exit.

You are correct in surmising the problem is related to the amount of work a host is carrying. The reason is this increases the amount of overhead the CC has to handle locally (disk I/O), as well as the amount of data which gets transferred back to the projects on every scheduler contact session (network I/O).

Those functions 'lock' the shared memory (the method used by the apps and CC to communicate), and if they take longer than thirty seconds to complete will result in the 'forced' exit by the app.

As a side note, if you use a remote monitoring tool like BOINCview and set the update interval too short you can effectively bring task processing to a halt with the network I/O overhead! ;-)

Alinator
ID: 26665 · Report as offensive

Message boards : BOINC client : boinc over nfs: too short timeout

Copyright © 2022 University of California. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.