BOINC starving while waiting for memory

Message boards : Questions and problems : BOINC starving while waiting for memory
Message board moderation

To post messages, you must log in.

AuthorMessage
Augustine
Avatar

Send message
Joined: 10 Mar 06
Posts: 73
Message 49022 - Posted: 7 May 2013, 18:46:00 UTC
Last modified: 7 May 2013, 18:46:33 UTC



This is a curious situation. As you can see, the high-priority RNA WU is waiting for memory to run, but all the other projects are suspended, probably to make room for the RNA WU, but since they are suspended in memory, this is probably a dead-lock.

Perhaps when the NCI WUProp WU ends the conditions to run the RNA WU will be met. But if they aren't, what then?

Still, it's kind of bizarre to suspend almost all projects for a possibly dead-locked situation.

Please, advise.

TIA
ID: 49022 · Report as offensive
Augustine
Avatar

Send message
Joined: 10 Mar 06
Posts: 73
Message 49023 - Posted: 7 May 2013, 18:48:40 UTC - in response to Message 49022.  
Last modified: 7 May 2013, 18:57:19 UTC

Suspending the RNA WU lets other WUs run. Resuming it again lets the other WUs to continue to run.
ID: 49023 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 49027 - Posted: 7 May 2013, 19:45:49 UTC

What kind of system is this on?
How much memory?
Amount of CPUs BOINC can use?
Any GPUs?
Which Linux?
Resource shares?
Any multi-threading applications?
Anything I forgot to ask? ;-)
ID: 49027 · Report as offensive
Augustine
Avatar

Send message
Joined: 10 Mar 06
Posts: 73
Message 49029 - Posted: 7 May 2013, 20:05:54 UTC - in response to Message 49027.  

This is the system in question: http://bit.ly/18U8oTd.

HTH
ID: 49029 · Report as offensive
Augustine
Avatar

Send message
Joined: 10 Mar 06
Posts: 73
Message 49030 - Posted: 7 May 2013, 20:19:32 UTC - in response to Message 49023.  

As BOINC got back to the situation in the previous image, suspending the NCI WUProp WU lets the RNA WU resume. Then, resuming the WUProp WU lets the RNA WU running and the NFS WU resumes.
ID: 49030 · Report as offensive
Augustine
Avatar

Send message
Joined: 10 Mar 06
Posts: 73
Message 49031 - Posted: 7 May 2013, 20:20:29 UTC - in response to Message 49027.  

Anything I forgot to ask? ;-)

I'm Pisces. :-)

ID: 49031 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 49032 - Posted: 7 May 2013, 20:45:25 UTC - in response to Message 49030.  
Last modified: 7 May 2013, 20:47:31 UTC

Those RNA tasks, how much memory do they want?
With only 1 GB in the system, you may have trouble.

Does that task stay in memory while waiting for more memory?
How much memory do any of the tasks want when they run?
ID: 49032 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 49033 - Posted: 7 May 2013, 20:45:43 UTC - in response to Message 49031.  

Anything I forgot to ask? ;-)

I'm Pisces. :-)

LOL, thanks, already got one of those. ;-)
ID: 49033 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 49034 - Posted: 7 May 2013, 20:56:58 UTC
Last modified: 7 May 2013, 20:58:56 UTC

I had a quick chat with one of the developers, and he thinks it's a bug. So sending mail to all developers.

In the mean time, can you post a log with only <cpu_sched_debug> activated?
Then one with only <rr_simulation>
And one with only <sched_op_debug>

Thanks.
ID: 49034 · Report as offensive
Augustine
Avatar

Send message
Joined: 10 Mar 06
Posts: 73
Message 49065 - Posted: 9 May 2013, 16:28:33 UTC - in response to Message 49034.  
Last modified: 9 May 2013, 16:28:48 UTC

In the mean time, can you post a log with only <cpu_sched_debug> activated?

Here's the log with <mem_usage_debug> too when I performed the actions in http://bit.ly/YsPstm: http://pastebin.com/zUvwM5xv.

HTH
ID: 49065 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 49066 - Posted: 9 May 2013, 22:48:24 UTC

http://boinc.berkeley.edu/trac/changeset/4323afee1fcde44055dc35d03aefbdfae84fd220/boinc-v2:

client: task schedule tweak to avoid starvation case

In enforce_run_list(), don't count the RAM usage of NCI tasks. NCI tasks run sporadically, so it doesn't make to count it; doing so can starve regular jobs in some cases.

Do you still build your own BOINC versions? You may want to git (yes, pun) the latest version of BOINC, build it and see if that fixes your situation.

Of course, we thank you for bringing it to the front. :-)
ID: 49066 · Report as offensive
Augustine
Avatar

Send message
Joined: 10 Mar 06
Posts: 73
Message 49138 - Posted: 15 May 2013, 23:46:51 UTC - in response to Message 49066.  
Last modified: 15 May 2013, 23:58:16 UTC

This patch seems to minimize the case when there's limited memory. However, what I see now is that the RNA WU is legitimately suspended while the WUProp WU is running alongside two other WUs of other projects:

Also, pausing the WUProp WU does nothing, the RNA WU remains suspended and the other WUs continue running, as expected.

But shouldn't this situation, when the RNA WU is suspended due to low memory, lead to a WU from some project being fetched since there are 3 processors for 2 CI WUs and 1 NCI WU?

How can I help?

TIA
ID: 49138 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 49139 - Posted: 16 May 2013, 0:02:53 UTC - in response to Message 49138.  

Try the simulator: http://boinc.berkeley.edu/dev/sim_web.php

By the way, the NCI doesn't use a CPU core. It'll run always, even if there's enough work to fill all cores. So e.g. on a 4 core CPU, you can have 4 CPU intensive tasks and 1 non-CPU intensive task running at the same time.
ID: 49139 · Report as offensive
Augustine
Avatar

Send message
Joined: 10 Mar 06
Posts: 73
Message 49140 - Posted: 16 May 2013, 0:18:35 UTC - in response to Message 49139.  

Try the simulator: http://boinc.berkeley.edu/dev/sim_web.php

Created one here.
ID: 49140 · Report as offensive
Augustine
Avatar

Send message
Joined: 10 Mar 06
Posts: 73
Message 49141 - Posted: 16 May 2013, 0:21:33 UTC - in response to Message 49139.  

By the way, the NCI doesn't use a CPU core. It'll run always, even if there's enough work to fill all cores. So e.g. on a 4 core CPU, you can have 4 CPU intensive tasks and 1 non-CPU intensive task running at the same time.

Precisely, only 2 of the 3 available processors are being used by CI WUs. Shouldn't BPOINC try to put that free processor to good use or does the fact that the RNA WU is impeding it?

TIA
ID: 49141 · Report as offensive
David Ball

Send message
Joined: 2 Dec 06
Posts: 69
United States
Message 49152 - Posted: 16 May 2013, 22:53:53 UTC

When I looked at your RNA WU, it says that
estimated runtime on reference system 11w 2d 22h 12m 14s (6905534.7950536 s)

That's 11 weeks of runtime and RNA doesn't checkpoint!!!

Read the following thread in the RNA message boards:

checkpoints for long WU

The recommended method of running these RNA WUs seems to be to run a Virtualbox virtual machine (20 GB virtual HDD and 1 GB memory + 2 GB memory per cpu core on the virtual machine so for a dual core machine you're talking 5 GB memory allocated for the VM). The virtual machine will need to run on a 64 bit OS.

Then you snapshot the virtual machine regularly and restart it from the latest virtual machine snapshot each time you restart the physical machine.

David
David Ball
ID: 49152 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 49179 - Posted: 17 May 2013, 21:45:25 UTC - in response to Message 49141.  

Shouldn't BOINC try to put that free processor to good use or does the fact that the RNA WU is impeding it?

According to the developers, it should be used for another task.

Flagging your thread for David again.
ID: 49179 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 49220 - Posted: 20 May 2013, 12:04:56 UTC

David Anderson wrote:
I'll fix this problem the next time I revise the job scheduling logic (should be in 2-3 months)

Not the answer you wanted, but it's being worked on nonetheless.

ID: 49220 · Report as offensive
Augustine
Avatar

Send message
Joined: 10 Mar 06
Posts: 73
Message 49223 - Posted: 20 May 2013, 15:47:05 UTC - in response to Message 49220.  

It is the answer I wanted, just not soon enough. ;-)
ID: 49223 · Report as offensive

Message boards : Questions and problems : BOINC starving while waiting for memory

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.