Thread 'Getting too may WCG tasks on systems that had been working ok'

Message boards : Questions and problems : Getting too may WCG tasks on systems that had been working ok
Message board moderation

To post messages, you must log in.

AuthorMessage
ProfileJoseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 106429 - Posted: 15 Dec 2021, 18:21:00 UTC

Going to switch to latest version as I cannot account for why too many tasks are being downloaded when share is set to 0.

I have several 7.16.3 and the linux ones do not show a problem. Three win10 systems:
- 70 days, 834 tasks
- 322 days, 1658 tasks and I had abort 700+ tasks a few days ago.
- 2 days, 16 tasks

The above was not on new builds where share is set to 100 for a few minutes.

I went over to the WCG forum but did not see any similar problems. They do not have a "question and problems" forum so I had to poke around
It does not look like a problem at their end caused by the move from IBM. If it happens with 7.16.20 then I can try to debug it if I knew what to look for.

[edit] I just started boinc back up on a windows system that rebooted due to windows feature update. It has 7.16.3 and i just watched it download additional WCG tasks when there was no need. Share was 0 and there were already a weeks worth of tasks. Maybe when rebooting the %0 is not noticed ???
ID: 106429 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5124
United Kingdom
Message 106430 - Posted: 15 Dec 2021, 18:47:30 UTC - in response to Message 106429.  

WHAT DID THE EVENT LOG SAY ABOUT FETCHING?
ID: 106430 · Report as offensive
ProfileJoseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 106431 - Posted: 15 Dec 2021, 20:01:17 UTC - in response to Message 106430.  
Last modified: 15 Dec 2021, 20:11:00 UTC

WHAT DID THE EVENT LOG SAY ABOUT FETCHING?


[EDIT] i fixed the version numbers I had garbled up. Note that ALL system had share set to 0 and had been that way for a long time.

OK, I turned on <work_fetch_debug> on three systems. One I had to stop and restart as the chatter went off the event screen and the "top" was missing
LOOKS LIKE I DUPLICATED THE PROBLEM FROM 7.16.3 ON 7.16.20!!

The two I just upgraded to 7.16.20 and the one I just recently restarted. There was a difference ON ALL THREE

This one running 7.16.20 downloaded one task. I had just aborted 1600+ and was afraid I would not get any because of daily limit, but I did get one. So actually, this is normal

bjysdualx2

84			12/15/2021 1:43:40 PM	[work_fetch] target work buffer: 86400.00 + 0.00 sec	
85			12/15/2021 1:43:40 PM	[work_fetch] --- project states ---	
91	World Community Grid	12/15/2021 1:43:40 PM	[work_fetch] REC 26763.703 prio -0.000 can request work	
92			12/15/2021 1:43:40 PM	[work_fetch] --- state for CPU ---	
93			12/15/2021 1:43:40 PM	[work_fetch] shortfall 1869495.34 nidle 0.00 saturated 230.49 busy 0.00	
99	World Community Grid	12/15/2021 1:43:40 PM	[work_fetch] share 0.000 zero resource share 	
100			12/15/2021 1:43:40 PM	[work_fetch] --- state for AMD/ATI GPU ---	
101			12/15/2021 1:43:40 PM	[work_fetch] shortfall 344087.02 nidle 0.00 saturated 230.49 busy 0.00	
107	World Community Grid	12/15/2021 1:43:40 PM	[work_fetch] share 0.000 zero resource share 	
108			12/15/2021 1:43:40 PM	[work_fetch] ------- end work fetch state -------	
120	World Community Grid	12/15/2021 1:43:40 PM	choose_project: scanning	
121	World Community Grid	12/15/2021 1:43:40 PM	can't fetch CPU: zero resource share	
122	World Community Grid	12/15/2021 1:43:40 PM	can't fetch AMD/ATI GPU: zero resource share	
123			12/15/2021 1:43:40 PM	[work_fetch] No project chosen for work fetch	
124			12/15/2021 1:44:41 PM	choose_project(): 1639597481.509739	


The above does not show any download because I had to restart to get the "TOP"

The next is for another 7.16.20 that unfortunately downloaded more stuff. I had just restarted after putting in 7.16.20 and then I aborted 50 day worth and that must have triggered more downloads. I did not have work_fetch_debug in the cc so I missed what happened when it got extra stuff.. I then changed %cpu to allow more tasks and got more downloads THAT SHOULD NOT HAVE HAPPENED (note the 14 cpu a change from 12 caused more tasks)

JYSArea51

1779			12/15/2021 1:57:59 PM	   max CPUs used: 14	
1780			12/15/2021 1:57:59 PM	   (to change preferences, visit a project web site or select Preferences in the Manager)	
1781			12/15/2021 1:57:59 PM	[work_fetch] Request work fetch: Prefs update	
1782			12/15/2021 1:57:59 PM	[work_fetch] Request work fetch: Preferences override	
1783			12/15/2021 1:58:00 PM	choose_project(): 1639598280.665096	
1784			12/15/2021 1:58:00 PM	[work_fetch] ------- start work fetch state -------	
1785			12/15/2021 1:58:00 PM	[work_fetch] target work buffer: 8640.00 + 43200.00 sec	
1786			12/15/2021 1:58:00 PM	[work_fetch] --- project states ---	
1810	World Community Grid	12/15/2021 1:58:00 PM	[work_fetch] REC 6981.661 prio -1000.053 can't request work: scheduler RPC backoff (13.04 sec)	
1812			12/15/2021 1:58:00 PM	[work_fetch] --- state for CPU ---	
1813			12/15/2021 1:58:00 PM	[work_fetch] shortfall 700695.25 nidle 7.00 saturated 0.00 busy 0.00	
1837	World Community Grid	12/15/2021 1:58:00 PM	[work_fetch] share 0.000  	
1839			12/15/2021 1:58:00 PM	[work_fetch] --- state for NVIDIA GPU ---	
1840			12/15/2021 1:58:00 PM	[work_fetch] shortfall 51647.39 nidle 0.00 saturated 192.61 busy 0.00	
1864	World Community Grid	12/15/2021 1:58:00 PM	[work_fetch] share 0.000 zero resource share 	
1866			12/15/2021 1:58:00 PM	[work_fetch] ------- end work fetch state -------	
1914	World Community Grid	12/15/2021 1:58:00 PM	choose_project: scanning	
1915	World Community Grid	12/15/2021 1:58:00 PM	skip: scheduler RPC backoff	
1919			12/15/2021 1:58:00 PM	[work_fetch] No project chosen for work fetch	
1920			12/15/2021 1:58:13 PM	[work_fetch] Request work fetch: Backoff ended for World Community Grid	
1921			12/15/2021 1:58:15 PM	choose_project(): 1639598295.784178	
1922			12/15/2021 1:58:15 PM	[work_fetch] ------- start work fetch state -------	
1923			12/15/2021 1:58:15 PM	[work_fetch] target work buffer: 8640.00 + 43200.00 sec	
1924			12/15/2021 1:58:15 PM	[work_fetch] --- project states ---	
1948	World Community Grid	12/15/2021 1:58:15 PM	[work_fetch] REC 6981.661 prio -1000.052 can request work	
1950			12/15/2021 1:58:15 PM	[work_fetch] --- state for CPU ---	
1951			12/15/2021 1:58:15 PM	[work_fetch] shortfall 700709.42 nidle 7.00 saturated 0.00 busy 0.00	
1975	World Community Grid	12/15/2021 1:58:15 PM	[work_fetch] share 1.000  	
1977			12/15/2021 1:58:15 PM	[work_fetch] --- state for NVIDIA GPU ---	
1978			12/15/2021 1:58:15 PM	[work_fetch] shortfall 51661.53 nidle 0.00 saturated 178.47 busy 0.00	
2002	World Community Grid	12/15/2021 1:58:15 PM	[work_fetch] share 1.000  	
2004			12/15/2021 1:58:15 PM	[work_fetch] ------- end work fetch state -------	
2052	World Community Grid	12/15/2021 1:58:15 PM	choose_project: scanning	
2053	World Community Grid	12/15/2021 1:58:15 PM	can fetch CPU	
2054	World Community Grid	12/15/2021 1:58:15 PM	CPU needs work - buffer low	





The system still running 7.16.3 downloaded another week worth. The is the chatter:

lenovos20

43			12/15/2021 1:24:01 PM	choose_project(): 1639596241.273872	
44			12/15/2021 1:24:01 PM	[work_fetch] ------- start work fetch state -------	
45			12/15/2021 1:24:01 PM	[work_fetch] target work buffer: 86400.00 + 0.00 sec	
46			12/15/2021 1:24:01 PM	[work_fetch] --- project states ---	
48	World Community Grid	12/15/2021 1:24:01 PM	[work_fetch] REC 4124.711 prio -0.112 can request work	
49			12/15/2021 1:24:01 PM	[work_fetch] --- state for CPU ---	
50			12/15/2021 1:24:01 PM	[work_fetch] shortfall 695894.09 nidle 1.00 saturated 0.00 busy 0.00	
52	World Community Grid	12/15/2021 1:24:01 PM	[work_fetch] share 1.000  	
53			12/15/2021 1:24:01 PM	[work_fetch] --- state for NVIDIA GPU ---	
54			12/15/2021 1:24:01 PM	[work_fetch] shortfall 18361.75 nidle 0.00 saturated 68038.25 busy 0.00	
56	World Community Grid	12/15/2021 1:24:01 PM	[work_fetch] share 0.500  	
57			12/15/2021 1:24:01 PM	[work_fetch] ------- end work fetch state -------	
58	World Community Grid	12/15/2021 1:24:01 PM	choose_project: scanning	
59	World Community Grid	12/15/2021 1:24:01 PM	can fetch CPU	
60	World Community Grid	12/15/2021 1:24:01 PM	CPU needs work - buffer low	
61	World Community Grid	12/15/2021 1:24:01 PM	checking CPU	
62	World Community Grid	12/15/2021 1:24:01 PM	[work_fetch] using MC shortfall 591132.340164 instead of shortfall 695894.087949	
63	World Community Grid	12/15/2021 1:24:01 PM	[work_fetch] set_request() for CPU: ninst 10 nused_total 227.00 nidle_now 1.00 fetch share 1.00 req_inst 0.00 req_secs 591132.34	
64	World Community Grid	12/15/2021 1:24:01 PM	CPU set_request: 591132.340164	
65	World Community Grid	12/15/2021 1:24:01 PM	checking NVIDIA GPU	
66	World Community Grid	12/15/2021 1:24:01 PM	[work_fetch] using MC shortfall 18361.747788 instead of shortfall 18361.747788	
67	World Community Grid	12/15/2021 1:24:01 PM	[work_fetch] set_request() for NVIDIA GPU: ninst 1 nused_total 0.00 nidle_now 0.00 fetch share 0.50 req_inst 0.00 req_secs 18361.75	
68	World Community Grid	12/15/2021 1:24:01 PM	NVIDIA GPU set_request: 18361.747788	
ID: 106431 · Report as offensive
ProfileJoseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 106432 - Posted: 15 Dec 2021, 20:23:36 UTC
Last modified: 15 Dec 2021, 20:31:11 UTC

[edit] I had to delete most of what I wrote as I had been looking at the wrong system.
The system that had downloaded just one tass has now gone and downloaded a few more for a total of 4. That is probably ok. During that time another 7.16.20 downloaded another weeks worth
ID: 106432 · Report as offensive
ProfileJoseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 106434 - Posted: 15 Dec 2021, 21:29:41 UTC
Last modified: 15 Dec 2021, 21:52:50 UTC

Found something strange in the code

Looking for "[work_fetch] share 0.000 "

I found the above was printed by the function
void RSC_WORK_FETCH::print_state(const char* name) {
...
....
        msg_printf(p, MSG_INFO,
            "[work_fetch] share %.3f %s %s",
            rpwf.fetchable_share,
            rsc_reason_string(rpwf.rsc_project_reason),
            buf
...


where that variable the has the value of 0.0000 (or 1.0 or 0..5) is defined here
double fetchable_share;
        // this project's share relative to projects from which
        // we could probably get work for this resource;
        // determines how many instances this project deserves


and it can be set to "1" here based on "project reason"
           if (!p->rsc_pwf[j].rsc_project_reason) {
                p->rsc_pwf[j].fetchable_share = rsc_work_fetch[j].total_fetchable_share?p->resource_share/rsc_work_fetch[j].total_fetchable_share:1;
 

so if "project reason" is true (just noticed the negation) then share is set to 1.0

I do not know where the 0.5 came from. However, someone has hard coded a 1.0 for the project share which is suspicious. If I knew more about "project reason" maybe there is a "reason"

[edit] just realized that "rsc_reason_string(rpwf.rsc_project_reason)" is null since nothing was printed after the 1.0000 so it is "false" ?? and a 1.0 seems to have been assigned to project share ?

HTH
ID: 106434 · Report as offensive
ProfileJoseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 106436 - Posted: 16 Dec 2021, 1:14:31 UTC
Last modified: 16 Dec 2021, 1:19:47 UTC

Some thoughts on the following code, worth about 2c (my thoughts, not the code)

        if (!p->rsc_pwf[j].rsc_project_reason) {
                p->rsc_pwf[j].fetchable_share = rsc_work_fetch[j].total_fetchable_share?p->resource_share/rsc_work_fetch[j].total_fetchable_share:1;
...
...
        msg_printf(p, MSG_INFO,
            "[work_fetch] share %.3f %s %s",
            rpwf.fetchable_share,
            rsc_reason_string(rpwf.rsc_project_reason),
            buf


The following indicates that a "1" was not put into the resource
That means the IF part reason was "false" and consequently the project_reason was "true"
[work_fetch] share 0.000 zero resource share



The following indicates that not only was the IF true (project_reason was false)
but in addition the "rsc_reason_string" is empty as nothing was printed.
[work_fetch] share 1.000 


Anyway, I edited that code and changed "1" to "0" and put a copy "7.16.19" on two of my worst WCG offender systems.
After rebooting the system with 20 cores downloaded 4 new apps and the system with only 8 cores downloaded only 2 apps. This was after I aborted abot 75 days of work most of which could not have been completed by the deadline.

Will know tomorrow for sure if my "fix" worked.
ID: 106436 · Report as offensive
ProfileJoseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 106438 - Posted: 16 Dec 2021, 3:57:26 UTC

The code change had no real effect. While the 1.000 no longer showed up in the log file, the system with only 8 cores went and got 10 days worth of work. The system with 20 cores got just one day. Neither system should have download more than 1 WCG task at a time with share set to 0. None of my other projects have this behavior. There is a problem somewhere..
ID: 106438 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5124
United Kingdom
Message 106440 - Posted: 16 Dec 2021, 8:42:44 UTC

Oh dear. The resource share you read in a <work_fetch_debug> segment of the event log has NOTHING TO DO with the resource share you set on a project web site.

The <work_fetch_debug> usage is an instantaneous snapshot - literally, "what can we do now, this second, in this single instance of work-fetch decision making?". The first question is: "are we allowed, now, this instant, to fetch work from this project?" If no, the value will be zero: if yes, the value will be positive. The routine loops over all attached projects, and counts the positives. If N projects are in the 'can fetch' state, each will show a share of 1/N.

The project resource share is a long-term value. That is designed to balance out the project work allocation over days, weeks, months - not second by second.

When I asked "what does the event log say about fetching?", I was hoping for a broad-brush overview, at least in the first instance. When you said you "had to abort" a number of tasks, how did they arrive? Did they arrive in one huge dump? Or did they arrive in a trickle, a few at a time, again and again and again?

To a first approximation, a dump indicates a server problem: a trickle indicates a client problem. We need to know where to start looking.
ID: 106440 · Report as offensive
ProfileJoseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 106442 - Posted: 16 Dec 2021, 12:50:20 UTC - in response to Message 106440.  
Last modified: 16 Dec 2021, 13:40:16 UTC

Oh dear. The resource share you read in a <work_fetch_debug> segment of the event log has NOTHING TO DO with the resource share you set on a project web site.
To a first approximation, a dump indicates a server problem: a trickle indicates a client problem. We need to know where to start looking.


Well, at least I was correct about the 2c.

Hmm - I do not remember problems like this in other projects NOR in WCG before they implemented GPU for that COVID app.
There seems to be no rhyme nor reason to this problem as some systems are not affected:

- 16 core linux with 7.16.3 and 2 AMD boards never has a problem. When one WU is uploaded another is downloaded
- Pair of windows 10 with NVIdia likewise no problem Been running perfectly for a long time. A single download for every upload.

All the new system I recently built have problems except one

The one with no problem runs win10 and BOINC as a service and WCG is %100 share. 3 cores of 4 are allocated and checking I just saw that 20 tasks are waiting which is OK.

There are 3 system with problems Two are newly minted win10 and my main desktop that I just upgraded to win11. All have a single NVidia and are set for "No New Tasks" on WCG until the problem gets fixed.

I suspect there is a dump of WCG tasks. I did not see the 1600 download all at once as a dump as the event log was too big and got truncated.

On one system (my desktop, share = 0) I watched about 10 days worth download while 20 days worth were waiting to run.
After it stabilized at 30 days worth I increased the core count and watched another 10 days worth download.
I had a limit of only 6 concurrent WCG tasks. There should have been no need to download anything on account of share=0 and the limit of 6
Not sure if this counts as a trickle.

Should I be running that version you posted about two weeks ago? I tested it out on one of the new systems but the problem was the initial startup after installing BOINC which is not the same as I am seeing here.

[edit] I deleted my boinc.exe and copied over your version, the "max fix" one to try it.
Question: when building the x64 release I get an executable that is 2x as big as the 7.16.20 that Berkeley has. There must be some setting the in my VS2019 that is different from Berkeley's. Usually the debug version is the size hog.

[edit-2] All three systems used app_config to limit the number of concurrent WCG tasks. I replaced boinc.exe with that version you posted. Maybe this will fix the problem?
ID: 106442 · Report as offensive
ProfileJoseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 106443 - Posted: 16 Dec 2021, 14:37:50 UTC

After installing that "max fix" version on 3 system I got one system that responded after "allow new work"

On the LenovoS20 that had 8 apps running (max concurrent is 8) and with no apps waiting there were two back to back downloads that totaled 14 days / 84 work units.
That actually can be done as calculating 4 hours per core and 8 cores with deadline of 12/22 through 12/23. All 84 apps should finish in about 42 hours.
The problem is that NONE should have downloaded with share of 0.

The other two systems I put the "max fix" on had a day of WCG already waiting so I assume that affected the "allow new work" differently and they were not tempted into downloadiing more stuff.

The net effect is that (1) I am confident my 7.16.3 "special" that contains a coding "mod" for the Milkyway idle problem did not cause the WCG problem. I do plan to update that app eventually.
(2) there is a problem with WCG and/or the client config as some of my systems work perfectly with share=0 on WCG and others do not. I suspect most users do not use share=0 so no complaints.
ID: 106443 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5124
United Kingdom
Message 106444 - Posted: 16 Dec 2021, 16:23:08 UTC - in response to Message 106442.  

I suspect there is a dump of WCG tasks. I did not see the 1600 download all at once as a dump as the event log was too big and got truncated.
Yes, WCG work (especially, GPU work on Covid, task name prefix OPNG) tends to get released in batches - and the batches are getting bigger: I got 35 in one go at lunchtime

16/12/2021 12:17:49 | World Community Grid | Scheduler request completed: got 35 new tasks
But I keep my requests reasonable, and never get more than I request.

16/12/2021 12:17:48 | World Community Grid | [sched_op] NVIDIA GPU work request: 42684.92 seconds; 2.00 devices
16/12/2021 12:17:49 | World Community Grid | [sched_op] estimated total NVIDIA GPU task duration: 21396 seconds
The Event Log flag <sched_op_debug> is active on all my machines, and can be useful in tracking down issues like this.

I find it unlikely that WCG would issue 1600 tasks in response to a single request: most projects set a lower limit in their server's feeder configuration (100 or 200).

Even without the current log, you can still track the history.
* Under Windows, in files stdoutdae.txt and stdoutdae.old in the data folder. You can configure those to retain any size you like.
* In the task list (either in the BOINC Manager, or on the project website), by inspecting the deadlines of the allocated tasks. WCG requests a delay of two minutes between fetches: you would be able to sees a discontinuity of 2+ minutes between batches if multiple fetches were involved.

Question: when building the x64 release I get an executable that is 2x as big as the 7.16.20 that Berkeley has. There must be some setting the in my VS2019 that is different from Berkeley's. Usually the debug version is the size hog.
Yes, VS2019 files are bigger. Earlier versions relied on library routines delivered as separate, external, .DLL files: with the VS2019 build, the libraries are embedded in the main executable.

[edit-2] All three systems used app_config to limit the number of concurrent WCG tasks. I replaced boinc.exe with that version you posted. Maybe this will fix the problem?
Yes, that's exactly what the #4592 patch was designed to fix. That's why it's called "client: fix work-fetch logic when max concurrent limits are used". The problem makes itself apparent by causing multiple, repeated, limitless, work fetch requests. Which Is why I keep asking if multiple, repeated, limitless, work fetch requests are visible in your logs.
ID: 106444 · Report as offensive
ProfileJoseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 106445 - Posted: 16 Dec 2021, 16:47:24 UTC
Last modified: 16 Dec 2021, 16:48:05 UTC

Just checked my Lenovo again. More tasks have downloaded but the deadline has not changed. There are 135 tasks waiting. At 4 hours per task and 8 cores that is 68 hours of work and is still within the deadline of 12/23. However, there should have been no downloads with project priority of 0. Perhaps that feature (the "0") is not a "client" specification anymore if it ever was. I have been using it as a fallback project so if Milkyway runs dry (like just happened recently) then Einstein gets to run but as soon as one Einstein finishes Milkyway can take over since it is 100% and Einstein is %0. AFAICT WCG is the only project where the "0" has a problem.

I do not want to babysit WCG. if it wants to download 1700 apps I do not want to crunch apps that will not be used. I have spotted 100's of their apps as "aborted by project" on my system and have been trying to figure out how to prevent it. Project priority of "0" does not work on some system and on others it does. As I have been writing this post that WCG app count went from 135 to 145. If I shut the system down for a long weekend about 1/2 will be expired before they even start.
ID: 106445 · Report as offensive
ProfileJoseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 106446 - Posted: 16 Dec 2021, 17:03:11 UTC - in response to Message 106444.  
Last modified: 16 Dec 2021, 17:05:16 UTC



I find it unlikely that WCG would issue 1600 tasks in response to a single request: most projects set a lower limit in their server's feeder configuration (100 or 200).

Even without the current log, you can still track the history.
* Under Windows, in files stdoutdae.txt and stdoutdae.old in the data folder. You can configure those to retain any size you like.
* In the task list (either in the BOINC Manager, or on the project website), by inspecting the deadlines of the allocated tasks. WCG requests a delay of two minutes between fetches: you would be able to sees a discontinuity of 2+ minutes between batches if multiple fetches were involved.


[edit-2] All three systems used app_config to limit the number of concurrent WCG tasks. I replaced boinc.exe with that version you posted. Maybe this will fix the problem?
Yes, that's exactly what the #4592 patch was designed to fix. That's why it's called "client: fix work-fetch logic when max concurrent limits are used". The problem makes itself apparent by causing multiple, repeated, limitless, work fetch requests. Which Is why I keep asking if multiple, repeated, limitless, work fetch requests are visible in your logs.


I think the only exception to large download limit is that "lost task" download but my tasks were not lost, they were aborted due to no possibility of finishing.

An observation that might be be a clue:
One of my win10 + NVidia systems has app_config with max concurrent of 9 but does not and never had a problem with too many WCG downloads. However, I set max number of core to 9 on that system and it also runs one Einstein. I am guessing the max_concurrent is not used as the # of cores limit takes precedence in the fetch algorithm???
My linux system uses max cores to limit wcg and not app_config and it does not and never had a problem.
ID: 106446 · Report as offensive
ProfileJoseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 106541 - Posted: 26 Dec 2021, 19:13:10 UTC

Follow-up on this problem. All my WCG systems have stabilized after a week of 24/7 and I got a "solution" to setting Priority to 0 with that max_concurrent app option.

Recap: Setting Priority to 0 normally means the queue never exceeds 1 work unit even if each core is working on a WCG task.

During initial configuration of BOINC it is possible a lot of unwanted work units will download but eventually the system will get to where a new download occurs only when there are not other tasks of the same type (CPU) in the queue. This is a different problem.

When using "max_concurrent" in WCG's app_config file, I was able to demonstrate that Priority of "0" is ignored if the number of cores allocated to the system is greater than the value of that max_concurrent parameter.


On my test system, I left # cores at 11 with max_concurrent at 8 and the number of WCG tasks increased to several 100. However, at no time did the number of waiting work units exceed the deadline. As long as I left the system running 24/7 they would all finish in within the dead line. When I set the number of cores down to 8, the same value as max_concurrent, there were no more downloads of work units and eventually the queue got down to 0 at which time a single download occurred. This is the expected behavior for Priority of 0.

Probably not many users have priority set to 0. Should this problem should be reported as an issue over at github? Can someone else verify this behavior?

Thanks for looking!
ID: 106541 · Report as offensive

Message boards : Questions and problems : Getting too may WCG tasks on systems that had been working ok

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.