Task status changes from 'Running' to 'Waiting to run' automatically

Message boards : Questions and problems : Task status changes from 'Running' to 'Waiting to run' automatically
Message board moderation

To post messages, you must log in.

AuthorMessage
Art Brown

Send message
Joined: 17 Jan 12
Posts: 6
United States
Message 42125 - Posted: 17 Jan 2012, 21:28:43 UTC

Greetings.

The status of a running BOINC task is being automatically changed from "Running" to "Waiting to run", and I don't know why.

I am running BOINC 6.12.34 for Windows x86-64 on an Intel Core i7 920 running 4 single-threaded cores.

The OS is Windows 7 Home Premium x64 SP 1 with all updates applied through 15 Jan 2012.

I use BoincTasks 1.30 to view my BOINC projects.

I run two World Community Grid (WCG) CPU tasks, one Test4Theory (T4T) version 7.05 2-core CPU task, and two GPUGRID tasks on two GPUs.

Occasionally one of the WCG tasks has its status changed from "Running" to "Waiting to run", yet there is an open core available for it to use (the core it was just Running on); another task is not started in its place. The number of Running tasks just drops from 5 (2 GPUGRID, 2 WCG, and one T4T) to 4 (2 GPUGRID, 1 WCG, and 1 T4T) along with a corresponding drop in CPU utilization.

The automatic status changes are not coincident with any BOINC task completions or transfers.

When one WCG task is automatically changed to "Waiting to run", if I Suspend the other running WCG task, the "Waiting to run" task runs. Then when I Resume the WCG task I just suspended, it runs, too.

The only entries in the Messages log are the manual status changes I initiate in response to an automatic status change.

I can induce this behavior when one WCG task is "Running", and the other WCG task is "Running High Priority". I Suspend the "Running" task, then Resume the same task. It's status changes to "Waiting to run" for an indefinite period (many minutes, perhaps until the other WCG task completes). Then I suspend the "Running High Priority" task, and then Resume it. Now both WCG tasks will run until the next automatic status change. Suspending the "Running High Priority" task and Resuming it works as expected: the task resumes its previous "Running High Priority" status.

I noticed that if the WCG tasks have a status of "Running" when no T4T task is running, the WCG statuses change to "Running High Priority" as soon as a T4T tasks runs. This is independent from and possibly unrelated to the automatic status changes to "Waiting to run", but I don't know why this behavior is occurring.

I could have a configuration issue with BOINC or my PC that I don't know how to change to prevent these automatic status changes, or there might be another explanation.

Any comments or suggestions would be appreciated.

Thanks for your help, and let me know if you need more info.

Regards, Art Brown
ID: 42125 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15480
Netherlands
Message 42126 - Posted: 17 Jan 2012, 21:38:53 UTC - in response to Message 42125.  

Occasionally one of the WCG tasks has its status changed from "Running" to "Waiting to run", yet there is an open core available for it to use (the core it was just Running on); another task is not started in its place.

Well, no, there isn't a core free. This is done by the T4T wrapper code to ensure that the system isn't overloaded when it is running and a GPU task is running. The T4T task runs on two cores automatically on systems with 2 or more cores or CPUs and the CPU being able and enabled to run Virtualization code.

When you then add up the threads running, it would be 2 (WCG) + 2 (T4T) + X (GPUGrid). You'd end up with running more threads than you have CPU cores available. I don't know how much CPU the GPUGrid tasks take, but even if it were 1% only (0.01 CPU + 1 NVIDIA GPU), that would make the above calculation 2 + 2 + 0.02 = 4.02 threads running on a system that's only capable of running 4 threads.

The above is an example, of course. Usually the amount of (reserved) CPU is more. So you do the math. :-)
ID: 42126 · Report as offensive
Art Brown

Send message
Joined: 17 Jan 12
Posts: 6
United States
Message 42133 - Posted: 18 Jan 2012, 16:25:42 UTC - in response to Message 42128.  

Thanks for the suggestion.

I use ProcessLasso to set the GPUGRID CPU priorities to Above Normal, and affinities to specific cores if I see any CPU usage conflicts between GPU and CPU tasks. A GPUGRID task only consumes up to about 4% of a CPU, so this way I can use all CPUs for BOINC CPU tasks and can run GPUGRID tasks at the same time.

I think that BOINC has a task management issue that causes the task statuses to change unexpectedly.

Regards, Art Brown
ID: 42133 · Report as offensive
Art Brown

Send message
Joined: 17 Jan 12
Posts: 6
United States
Message 42134 - Posted: 18 Jan 2012, 17:02:47 UTC - in response to Message 42126.  

Thanks for the background info.

1) Since I can suspend and resume World Community Grid tasks following an automatic status change such that both WCG tasks run, it is curious that the thread count would be different after both WCG tasks resume running than before the status change, but perhaps that is what happens.

On a dual core machine I have, I run one GPUGRID task, one T4T single core task, and one WCG CPU task, and I have not seen the automatic status change for the WCG task as yet. All three tasks run at the same time continuously and without any apparent conflicts.

2) The WCG status change is not coincident with a T4T task starting. The status change happens at some later, apparently random time that I have not as yet been able to correlate anything to. Additionally on the 4-core machine, once the two WCG tasks are running, along with the two GPUGRID tasks and the 2-core T4T task, all five BOINC tasks continue to run for hours until an automatic status change happens.

3) I had not considered the thread count issue, so I checked the thread count for the BOINC tasks I run. Each GPUGRID tasks shows 6 threads, each WCG task shows 2 or 3 threads, and virtualbox.exe for the T4T task shows 24 threads. I don't know if these threads are the same kind of threads you referred to, however.

Thanks again. Art Brown
ID: 42134 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15480
Netherlands
Message 42136 - Posted: 18 Jan 2012, 17:38:31 UTC - in response to Message 42134.  
Last modified: 18 Jan 2012, 17:54:00 UTC

On a dual core machine I have, I run one GPUGRID task, one T4T single core task, and one WCG CPU task, and I have not seen the automatic status change for the WCG task as yet.

No, it's because this only happens when you run a VM + tasks on a CPU, plus tasks on a GPU. Not when you only run VM + tasks on a CPU.

2) The WCG status change is not coincident with a T4T task starting. The status change happens at some later, apparently random time that I have not as yet been able to correlate anything to.
Then run with <cpu_sched_debug/> in a cc_config.xml file, to see the scheduling decisions and when the change happens.

I don't know if these threads are the same kind of threads you referred to, however.

What I meant to explain was that this new thing --which is built into the wrapper code by the way, not into BOINC, and if it needs to be disabled it should be done so by the project-- will not allow that more sources than you have processors will make use of your processors.

Yes, before this change it used to be that you could easily run 2 CPU tasks + the VM on 2 cores + 2 GPU tasks together using 1 core (5 cores) on a 4 core machine, but this would also slow that machine down and/or make it unusable. With complaints in that regard.

Now, there is a project preference at T4T with which you can set how many CPU cycles you want the VM to use (0 - 100), but even at 100% it means it's using 1 core at full bore, which will slow a system down. That's why it was chosen to give the VM two cores, to make the system overall more stable. When you then go and allow that again more resources are used on that system than it is rich, back at the slow-downs and back at the complaints/walkouts (by some).

It's better for system stability that when you use the VM plus the rest of the CPUs plus one or more GPUs, to omit the use of one CPU (core), and in a way give that one exclusively to the GPUs.

Edit: I see that the T4T project has resorted to choose which VM to use: http://lhcathome2.cern.ch/test4theory/forum_thread.php?id=737.
ID: 42136 · Report as offensive
Art Brown

Send message
Joined: 17 Jan 12
Posts: 6
United States
Message 42143 - Posted: 18 Jan 2012, 23:52:10 UTC - in response to Message 42136.  

Thanks for the detailed explanations. They are very useful and appreciated.

1) I added <cpu_sched_debug>1</cpu_sched_debug> to cc_config.xml, so we'll see what shows up the next time the WCG status is changed.

Currently, all 5 tasks are running happily on the 4-core CPU: 2 GPUGRID, 2 WCG and one 2-core T4T.

2) I changed the new T4T option to assign 2-core tasks so I can continue to monitor the automatic status change issue.

3) I noticed the 2-core T4T task uses about 35% of the total CPU capacity, or only about 70% of the 2 cores it uses. There is a lot of idle time available in those two cores. The WCG tasks use virtually all of each core they are assigned to.

Do you have any insight on why the 2-core T4T task doesn't use more of the CPU resource it has available to it?

Regarding the usability issue on a heavily loaded machine, I set the WCG task priorities to Below Normal (using Process Lasso), so they only run when nothing more important is taking place. I leave the BOINC tasks in memory so they can be run without going to disk. I have assigned the GPU tasks to specific cores in the past, but I am not seeing degradation in CPU or GPU performance if I don't, so right now the CPU affinity for the GPUGRID tasks is not restricted by me. So far, my PC is responsive to human input, yet can run as many BOINC tasks as possible.

Before the 2-core T4T tasks were available, I ran all 4 cores at virtually 100% load, and the PC was still responsive, using the Below Normal WCG priority strategy above.

Thanks again. Art Brown
ID: 42143 · Report as offensive
Art Brown

Send message
Joined: 17 Jan 12
Posts: 6
United States
Message 42149 - Posted: 19 Jan 2012, 23:12:34 UTC - in response to Message 42136.  

Hi Jord,

The scheduler debug Message log captured the events that you mentioned regarding limiting CPU usage (lines 112 & 113):

112 World Community Grid 2012-01-19 06:12 [cpu_sched] avoiding overcommit with multithread job, skipping c4cw_target05_058607454_0
113 2012-01-19 06:12 [cpu_sched] using 3.75 out of 4 CPUs

Thanks for picking up on this issue for me.

The result of this behavior is to change the status of a task from "Running" to "Waiting to run", effectively shutting the task down until another task completes, which could be several hours.

I think you said that this behavior comes from the 2-core Test4Theory (T4T) application.

If that is right, I'd like to see T4T (or whoever controls it) let the user choose whether they want this automatic CPU utilization control, or if they would prefer no T4T CPU utilization control.

In my case, the T4T automatic status control:
1) shuts down a non-T4T task which results in one core idling with no BOINC task to run.
- This has two negative side effects from my perspective:
- a) T4T takes CPU time away from a non-T4T project, which seems unfair.
- b) it also reduces the overall BOINC CPU efficiency of my PC; my 4-core CPU becomes a 3-core CPU.
2) the T4T 2-core application only consumes 70% of the CPU time of the two cores it runs on. This low T4T efficiency effectively reduces the "3-core CPU" to a "1 + 0.7 + 0.7-core CPU" or "2.4-core CPU".

Overall, that is a 1.6 core (40%) reduction in BOINC CPU time on a 4-core system as the direct result of running the 2-core T4T task and its CPU scheduling process. Loosing 1.6 cores is a very steep price to pay to increase the T4T CPU time from 100% of 1 core to 70% of 2 cores, a gain of just 0.4 cores.

I think a contributing factor to the scheduling problem is the CPU "Use" estimate that the GPUGRID tasks have. In my case, each GPUGRID task was rated at 0.37 CPU + 1 Nvidia GPU. However the long-term CPU load for the GPU tasks is under 4%, not the 37% the rating indicates. It almost looks like the CPU rating has the decimal point in the wrong place. At these low measured CPU loadings, the GPU tasks could be ignored when calculating CPU loads. That would have made line 113 above calculate to needing 3.01 cores (not 3.75) for 4 tasks (2 GPUGRID, 1 WCG and one 2-core T4T), which should allow 4 cores to run 5 tasks. No tasks would have to be suspended, and the PC would still be responsive, at least the way I run my BOINC tasks.

I maintain good PC responsiveness running 2 GPUGRID tasks and 4 BOINC CPU tasks by setting World Community Grid tasks to Below Normal, so in my case I don't need the T4T control, and would not use it if I had the choice.

I think I'll go back to the 1-core T4T tasks until a solution for this problem is available. I paid for a 4-core CPU, and this is currently the only way to get all of the benefits from it. I hope this is temporary, however.

In any case, thanks again for solving the mystery of automatic task status changes.

If there is anything that I can do to help resolve this issue, please let me know.

Art
ID: 42149 · Report as offensive
Art Brown

Send message
Joined: 17 Jan 12
Posts: 6
United States
Message 42188 - Posted: 23 Jan 2012, 20:35:31 UTC

Problem solved!

The BOINC client software that runs continuously in the background periodically checks to see if more or fewer tasks can be run (among many other things).

In my case of a 4-core single-threaded CPU, BOINC was adding up the CPU Usage values for all of the tasks that could be run. When that total CPU Usage exceeded the number of cores (4), BOINC suspended a task until the task could run and not exceed the core count.

You can see the calculation in the Messages tab of Boinc Tasks if you add a line in the cc_config.xml file (see below) in the <log_flags> section:

<cpu_sched_debug>1</cpu_sched_debug>

Change the argument from 1 to 0 to turn off the logging function. A BOINC restart is required for <log_flags> changes to take effect.

The solution was to simply add one line in the cc_config.xml file that changes the number of CPU cores that BOINC uses to schedule tasks. In my case I set the core count to 5:

<ncpus>5</ncpus>

This entry goes in the <options> section of the cc_config.xml file (below).

Once entered, you need to execute the "Read configuration file" command in your Boinc Manager or Boinc Tasks program for changes to the <options> section to take effect.

From then on, Boinc will use the new core count value to determine how many tasks it will schedule. The higher value prevented the problem I was seeing, which was a task would automatically change from "Running" to "Waiting to run".

There is a Message thread on this issue on the Test4Theory (T4T) web site: http://lhcathome2.cern.ch/test4theory/forum_thread.php?id=742

With <ncpus> set at 5, I noticed that the total CPU load was about 97%, so I changed the <ncpus> argument from 5 to 6, and now the CPU load stays at 100%. The extra task that is run is simply "time shared" among the available cores by Window's normal scheduling algorithms. Sometimes there is an extra World Community Grid (WCG) task in the "Ready to start" or "Waiting to run" state, but that doesn't cause any problems. That task just waits its turn to run.

I run the WCG tasks at Low CPU priority, VirtualBox.exe for the Test4Theory task at Below Normal (between Low and Normal priorities), and the GPUGRID tasks at Normal priority. These settings produce a fully responsive PC to human inputs, yet use all of the available CPU capacity available. I use Process Lasso, a free utility from Bitsum Technologies, to set the CPU priorities in my Windows 7 system.

Now the 4-core PC runs 3 WCG 1-core tasks, one 2-core T4T task, and 2 GPUGRID 0.37-core tasks concurrently, which fully utilize the CPU and the two GPUs.

Here is the cc_config.xml file I am currently using:

<cc_config>
<log_flags>
<cpu_sched_debug>0</cpu_sched_debug>
</log_flags>
<options>
<client_version_check_url>http://www.worldcommunitygrid.org/download.php?xml=1</client_version_check_url>
<client_download_url>http://www.worldcommunitygrid.org/download.php</client_download_url>
<network_test_url>http://www.ibm.com/</network_test_url>
<use_all_gpus>1</use_all_gpus>
<ncpus>6</ncpus>
<report_results_immediately>1</report_results_immediately>
<start_delay>60</start_delay>
</options>
</cc_config>

Art Brown
ID: 42188 · Report as offensive
SekeRob2

Send message
Joined: 6 Jul 10
Posts: 585
Italy
Message 42201 - Posted: 24 Jan 2012, 9:46:36 UTC - in response to Message 42188.  

If the computer still remains responsive to user input, that's a great solution, but not what BOINC was designed to do. You've artificially told a quad [through a testing designed option] to do more than it can chew. Now you can use Process Lasso to automate the affinities of the sciences, so that for instance 2 lighter sciences or 1 light, 1 heavy run on 1 core. With this trick you could get BOINC to think you've got an octo... run 2 tasks each on 1 core. The CPU will be fully loaded for sure, but also for sure the chance of driving increasing disk and memory swapping could lead to a substantial drop in efficiency... the gap between the reported CPU time and the time on the wallclock that the tasks were running. To put it this way, 97% load may give a greater efficiency than 100% with lots of swapping and increased competition between tasks. Don't try this while running for instance WCG's Clean Energy Project would be my estimation. Beyond half the cores, you'll see rapid efficiency degradation on most devices.

--//--
ID: 42201 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15480
Netherlands
Message 42203 - Posted: 24 Jan 2012, 10:02:20 UTC - in response to Message 42188.  

Problem solved!

There was never a problem, things were working as they were supposed to!

There is a Message thread on this issue on the Test4Theory (T4T) web site

Where the moderator in question is demanding that the wrong people have this 'removed'. He should yell at the T4T developers, since it's part of the wrapper that does this setting of multiple cores, it's just BOINC that follows it.

One stern warning for anyone following advice like this: The <ncpus/> option is in the client to temporarily simulate that you run more tasks than you have processors. It's for debugging science applications only. It does not change how BOINC schedules things, if anything it will put an extremely higher load on the system, through added slowness, added heat build-up, added wear and tear, with a higher possibility of returning erroneous work.

Therefore it should only be used by people with advanced knowledge of BOINC, people who know what they are doing.

Read Client Configuration.
<ncpus>N</ncpus>
Act as if there were N CPUs; i.e. to simulate 2 CPUs on a machine that has only 1. To use the number of available CPUs, set the value to -1 (was 0 which in newer clients really means zero to e.g. only allow GPU computing).

ID: 42203 · Report as offensive

Message boards : Questions and problems : Task status changes from 'Running' to 'Waiting to run' automatically

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.