Message boards : Questions and problems : Getting too may WCG tasks on systems that had been working ok
Message board moderation
Author | Message |
---|---|
Send message Joined: 27 Jun 08 Posts: 641 |
Going to switch to latest version as I cannot account for why too many tasks are being downloaded when share is set to 0. I have several 7.16.3 and the linux ones do not show a problem. Three win10 systems: - 70 days, 834 tasks - 322 days, 1658 tasks and I had abort 700+ tasks a few days ago. - 2 days, 16 tasks The above was not on new builds where share is set to 100 for a few minutes. I went over to the WCG forum but did not see any similar problems. They do not have a "question and problems" forum so I had to poke around It does not look like a problem at their end caused by the move from IBM. If it happens with 7.16.20 then I can try to debug it if I knew what to look for. [edit] I just started boinc back up on a windows system that rebooted due to windows feature update. It has 7.16.3 and i just watched it download additional WCG tasks when there was no need. Share was 0 and there were already a weeks worth of tasks. Maybe when rebooting the %0 is not noticed ??? |
Send message Joined: 5 Oct 06 Posts: 5124 |
WHAT DID THE EVENT LOG SAY ABOUT FETCHING? |
Send message Joined: 27 Jun 08 Posts: 641 |
WHAT DID THE EVENT LOG SAY ABOUT FETCHING? [EDIT] i fixed the version numbers I had garbled up. Note that ALL system had share set to 0 and had been that way for a long time. OK, I turned on <work_fetch_debug> on three systems. One I had to stop and restart as the chatter went off the event screen and the "top" was missing LOOKS LIKE I DUPLICATED THE PROBLEM FROM 7.16.3 ON 7.16.20!! The two I just upgraded to 7.16.20 and the one I just recently restarted. There was a difference ON ALL THREE This one running 7.16.20 downloaded one task. I had just aborted 1600+ and was afraid I would not get any because of daily limit, but I did get one. So actually, this is normal bjysdualx2 84 12/15/2021 1:43:40 PM [work_fetch] target work buffer: 86400.00 + 0.00 sec 85 12/15/2021 1:43:40 PM [work_fetch] --- project states --- 91 World Community Grid 12/15/2021 1:43:40 PM [work_fetch] REC 26763.703 prio -0.000 can request work 92 12/15/2021 1:43:40 PM [work_fetch] --- state for CPU --- 93 12/15/2021 1:43:40 PM [work_fetch] shortfall 1869495.34 nidle 0.00 saturated 230.49 busy 0.00 99 World Community Grid 12/15/2021 1:43:40 PM [work_fetch] share 0.000 zero resource share 100 12/15/2021 1:43:40 PM [work_fetch] --- state for AMD/ATI GPU --- 101 12/15/2021 1:43:40 PM [work_fetch] shortfall 344087.02 nidle 0.00 saturated 230.49 busy 0.00 107 World Community Grid 12/15/2021 1:43:40 PM [work_fetch] share 0.000 zero resource share 108 12/15/2021 1:43:40 PM [work_fetch] ------- end work fetch state ------- 120 World Community Grid 12/15/2021 1:43:40 PM choose_project: scanning 121 World Community Grid 12/15/2021 1:43:40 PM can't fetch CPU: zero resource share 122 World Community Grid 12/15/2021 1:43:40 PM can't fetch AMD/ATI GPU: zero resource share 123 12/15/2021 1:43:40 PM [work_fetch] No project chosen for work fetch 124 12/15/2021 1:44:41 PM choose_project(): 1639597481.509739 The above does not show any download because I had to restart to get the "TOP" The next is for another 7.16.20 that unfortunately downloaded more stuff. I had just restarted after putting in 7.16.20 and then I aborted 50 day worth and that must have triggered more downloads. I did not have work_fetch_debug in the cc so I missed what happened when it got extra stuff.. I then changed %cpu to allow more tasks and got more downloads THAT SHOULD NOT HAVE HAPPENED (note the 14 cpu a change from 12 caused more tasks) JYSArea51 1779 12/15/2021 1:57:59 PM max CPUs used: 14 1780 12/15/2021 1:57:59 PM (to change preferences, visit a project web site or select Preferences in the Manager) 1781 12/15/2021 1:57:59 PM [work_fetch] Request work fetch: Prefs update 1782 12/15/2021 1:57:59 PM [work_fetch] Request work fetch: Preferences override 1783 12/15/2021 1:58:00 PM choose_project(): 1639598280.665096 1784 12/15/2021 1:58:00 PM [work_fetch] ------- start work fetch state ------- 1785 12/15/2021 1:58:00 PM [work_fetch] target work buffer: 8640.00 + 43200.00 sec 1786 12/15/2021 1:58:00 PM [work_fetch] --- project states --- 1810 World Community Grid 12/15/2021 1:58:00 PM [work_fetch] REC 6981.661 prio -1000.053 can't request work: scheduler RPC backoff (13.04 sec) 1812 12/15/2021 1:58:00 PM [work_fetch] --- state for CPU --- 1813 12/15/2021 1:58:00 PM [work_fetch] shortfall 700695.25 nidle 7.00 saturated 0.00 busy 0.00 1837 World Community Grid 12/15/2021 1:58:00 PM [work_fetch] share 0.000 1839 12/15/2021 1:58:00 PM [work_fetch] --- state for NVIDIA GPU --- 1840 12/15/2021 1:58:00 PM [work_fetch] shortfall 51647.39 nidle 0.00 saturated 192.61 busy 0.00 1864 World Community Grid 12/15/2021 1:58:00 PM [work_fetch] share 0.000 zero resource share 1866 12/15/2021 1:58:00 PM [work_fetch] ------- end work fetch state ------- 1914 World Community Grid 12/15/2021 1:58:00 PM choose_project: scanning 1915 World Community Grid 12/15/2021 1:58:00 PM skip: scheduler RPC backoff 1919 12/15/2021 1:58:00 PM [work_fetch] No project chosen for work fetch 1920 12/15/2021 1:58:13 PM [work_fetch] Request work fetch: Backoff ended for World Community Grid 1921 12/15/2021 1:58:15 PM choose_project(): 1639598295.784178 1922 12/15/2021 1:58:15 PM [work_fetch] ------- start work fetch state ------- 1923 12/15/2021 1:58:15 PM [work_fetch] target work buffer: 8640.00 + 43200.00 sec 1924 12/15/2021 1:58:15 PM [work_fetch] --- project states --- 1948 World Community Grid 12/15/2021 1:58:15 PM [work_fetch] REC 6981.661 prio -1000.052 can request work 1950 12/15/2021 1:58:15 PM [work_fetch] --- state for CPU --- 1951 12/15/2021 1:58:15 PM [work_fetch] shortfall 700709.42 nidle 7.00 saturated 0.00 busy 0.00 1975 World Community Grid 12/15/2021 1:58:15 PM [work_fetch] share 1.000 1977 12/15/2021 1:58:15 PM [work_fetch] --- state for NVIDIA GPU --- 1978 12/15/2021 1:58:15 PM [work_fetch] shortfall 51661.53 nidle 0.00 saturated 178.47 busy 0.00 2002 World Community Grid 12/15/2021 1:58:15 PM [work_fetch] share 1.000 2004 12/15/2021 1:58:15 PM [work_fetch] ------- end work fetch state ------- 2052 World Community Grid 12/15/2021 1:58:15 PM choose_project: scanning 2053 World Community Grid 12/15/2021 1:58:15 PM can fetch CPU 2054 World Community Grid 12/15/2021 1:58:15 PM CPU needs work - buffer low The system still running 7.16.3 downloaded another week worth. The is the chatter: lenovos20 43 12/15/2021 1:24:01 PM choose_project(): 1639596241.273872 44 12/15/2021 1:24:01 PM [work_fetch] ------- start work fetch state ------- 45 12/15/2021 1:24:01 PM [work_fetch] target work buffer: 86400.00 + 0.00 sec 46 12/15/2021 1:24:01 PM [work_fetch] --- project states --- 48 World Community Grid 12/15/2021 1:24:01 PM [work_fetch] REC 4124.711 prio -0.112 can request work 49 12/15/2021 1:24:01 PM [work_fetch] --- state for CPU --- 50 12/15/2021 1:24:01 PM [work_fetch] shortfall 695894.09 nidle 1.00 saturated 0.00 busy 0.00 52 World Community Grid 12/15/2021 1:24:01 PM [work_fetch] share 1.000 53 12/15/2021 1:24:01 PM [work_fetch] --- state for NVIDIA GPU --- 54 12/15/2021 1:24:01 PM [work_fetch] shortfall 18361.75 nidle 0.00 saturated 68038.25 busy 0.00 56 World Community Grid 12/15/2021 1:24:01 PM [work_fetch] share 0.500 57 12/15/2021 1:24:01 PM [work_fetch] ------- end work fetch state ------- 58 World Community Grid 12/15/2021 1:24:01 PM choose_project: scanning 59 World Community Grid 12/15/2021 1:24:01 PM can fetch CPU 60 World Community Grid 12/15/2021 1:24:01 PM CPU needs work - buffer low 61 World Community Grid 12/15/2021 1:24:01 PM checking CPU 62 World Community Grid 12/15/2021 1:24:01 PM [work_fetch] using MC shortfall 591132.340164 instead of shortfall 695894.087949 63 World Community Grid 12/15/2021 1:24:01 PM [work_fetch] set_request() for CPU: ninst 10 nused_total 227.00 nidle_now 1.00 fetch share 1.00 req_inst 0.00 req_secs 591132.34 64 World Community Grid 12/15/2021 1:24:01 PM CPU set_request: 591132.340164 65 World Community Grid 12/15/2021 1:24:01 PM checking NVIDIA GPU 66 World Community Grid 12/15/2021 1:24:01 PM [work_fetch] using MC shortfall 18361.747788 instead of shortfall 18361.747788 67 World Community Grid 12/15/2021 1:24:01 PM [work_fetch] set_request() for NVIDIA GPU: ninst 1 nused_total 0.00 nidle_now 0.00 fetch share 0.50 req_inst 0.00 req_secs 18361.75 68 World Community Grid 12/15/2021 1:24:01 PM NVIDIA GPU set_request: 18361.747788 |
Send message Joined: 27 Jun 08 Posts: 641 |
[edit] I had to delete most of what I wrote as I had been looking at the wrong system. The system that had downloaded just one tass has now gone and downloaded a few more for a total of 4. That is probably ok. During that time another 7.16.20 downloaded another weeks worth |
Send message Joined: 27 Jun 08 Posts: 641 |
Found something strange in the code Looking for "[work_fetch] share 0.000 " I found the above was printed by the function void RSC_WORK_FETCH::print_state(const char* name) { ... .... msg_printf(p, MSG_INFO, "[work_fetch] share %.3f %s %s", rpwf.fetchable_share, rsc_reason_string(rpwf.rsc_project_reason), buf ... where that variable the has the value of 0.0000 (or 1.0 or 0..5) is defined here double fetchable_share; // this project's share relative to projects from which // we could probably get work for this resource; // determines how many instances this project deserves and it can be set to "1" here based on "project reason" if (!p->rsc_pwf[j].rsc_project_reason) { p->rsc_pwf[j].fetchable_share = rsc_work_fetch[j].total_fetchable_share?p->resource_share/rsc_work_fetch[j].total_fetchable_share:1; so if "project reason" is true (just noticed the negation) then share is set to 1.0 I do not know where the 0.5 came from. However, someone has hard coded a 1.0 for the project share which is suspicious. If I knew more about "project reason" maybe there is a "reason" [edit] just realized that "rsc_reason_string(rpwf.rsc_project_reason)" is null since nothing was printed after the 1.0000 so it is "false" ?? and a 1.0 seems to have been assigned to project share ? HTH |
Send message Joined: 27 Jun 08 Posts: 641 |
Some thoughts on the following code, worth about 2c (my thoughts, not the code) if (!p->rsc_pwf[j].rsc_project_reason) { p->rsc_pwf[j].fetchable_share = rsc_work_fetch[j].total_fetchable_share?p->resource_share/rsc_work_fetch[j].total_fetchable_share:1; ... ... msg_printf(p, MSG_INFO, "[work_fetch] share %.3f %s %s", rpwf.fetchable_share, rsc_reason_string(rpwf.rsc_project_reason), buf The following indicates that a "1" was not put into the resource That means the IF part reason was "false" and consequently the project_reason was "true" [work_fetch] share 0.000 zero resource share The following indicates that not only was the IF true (project_reason was false) but in addition the "rsc_reason_string" is empty as nothing was printed. [work_fetch] share 1.000 Anyway, I edited that code and changed "1" to "0" and put a copy "7.16.19" on two of my worst WCG offender systems. After rebooting the system with 20 cores downloaded 4 new apps and the system with only 8 cores downloaded only 2 apps. This was after I aborted abot 75 days of work most of which could not have been completed by the deadline. Will know tomorrow for sure if my "fix" worked. |
Send message Joined: 27 Jun 08 Posts: 641 |
The code change had no real effect. While the 1.000 no longer showed up in the log file, the system with only 8 cores went and got 10 days worth of work. The system with 20 cores got just one day. Neither system should have download more than 1 WCG task at a time with share set to 0. None of my other projects have this behavior. There is a problem somewhere.. |
Send message Joined: 5 Oct 06 Posts: 5124 |
Oh dear. The resource share you read in a <work_fetch_debug> segment of the event log has NOTHING TO DO with the resource share you set on a project web site. The <work_fetch_debug> usage is an instantaneous snapshot - literally, "what can we do now, this second, in this single instance of work-fetch decision making?". The first question is: "are we allowed, now, this instant, to fetch work from this project?" If no, the value will be zero: if yes, the value will be positive. The routine loops over all attached projects, and counts the positives. If N projects are in the 'can fetch' state, each will show a share of 1/N. The project resource share is a long-term value. That is designed to balance out the project work allocation over days, weeks, months - not second by second. When I asked "what does the event log say about fetching?", I was hoping for a broad-brush overview, at least in the first instance. When you said you "had to abort" a number of tasks, how did they arrive? Did they arrive in one huge dump? Or did they arrive in a trickle, a few at a time, again and again and again? To a first approximation, a dump indicates a server problem: a trickle indicates a client problem. We need to know where to start looking. |
Send message Joined: 27 Jun 08 Posts: 641 |
Oh dear. The resource share you read in a <work_fetch_debug> segment of the event log has NOTHING TO DO with the resource share you set on a project web site. Well, at least I was correct about the 2c. Hmm - I do not remember problems like this in other projects NOR in WCG before they implemented GPU for that COVID app. There seems to be no rhyme nor reason to this problem as some systems are not affected: - 16 core linux with 7.16.3 and 2 AMD boards never has a problem. When one WU is uploaded another is downloaded - Pair of windows 10 with NVIdia likewise no problem Been running perfectly for a long time. A single download for every upload. All the new system I recently built have problems except one The one with no problem runs win10 and BOINC as a service and WCG is %100 share. 3 cores of 4 are allocated and checking I just saw that 20 tasks are waiting which is OK. There are 3 system with problems Two are newly minted win10 and my main desktop that I just upgraded to win11. All have a single NVidia and are set for "No New Tasks" on WCG until the problem gets fixed. I suspect there is a dump of WCG tasks. I did not see the 1600 download all at once as a dump as the event log was too big and got truncated. On one system (my desktop, share = 0) I watched about 10 days worth download while 20 days worth were waiting to run. After it stabilized at 30 days worth I increased the core count and watched another 10 days worth download. I had a limit of only 6 concurrent WCG tasks. There should have been no need to download anything on account of share=0 and the limit of 6 Not sure if this counts as a trickle. Should I be running that version you posted about two weeks ago? I tested it out on one of the new systems but the problem was the initial startup after installing BOINC which is not the same as I am seeing here. [edit] I deleted my boinc.exe and copied over your version, the "max fix" one to try it. Question: when building the x64 release I get an executable that is 2x as big as the 7.16.20 that Berkeley has. There must be some setting the in my VS2019 that is different from Berkeley's. Usually the debug version is the size hog. [edit-2] All three systems used app_config to limit the number of concurrent WCG tasks. I replaced boinc.exe with that version you posted. Maybe this will fix the problem? |
Send message Joined: 27 Jun 08 Posts: 641 |
After installing that "max fix" version on 3 system I got one system that responded after "allow new work" On the LenovoS20 that had 8 apps running (max concurrent is 8) and with no apps waiting there were two back to back downloads that totaled 14 days / 84 work units. That actually can be done as calculating 4 hours per core and 8 cores with deadline of 12/22 through 12/23. All 84 apps should finish in about 42 hours. The problem is that NONE should have downloaded with share of 0. The other two systems I put the "max fix" on had a day of WCG already waiting so I assume that affected the "allow new work" differently and they were not tempted into downloadiing more stuff. The net effect is that (1) I am confident my 7.16.3 "special" that contains a coding "mod" for the Milkyway idle problem did not cause the WCG problem. I do plan to update that app eventually. (2) there is a problem with WCG and/or the client config as some of my systems work perfectly with share=0 on WCG and others do not. I suspect most users do not use share=0 so no complaints. |
Send message Joined: 5 Oct 06 Posts: 5124 |
I suspect there is a dump of WCG tasks. I did not see the 1600 download all at once as a dump as the event log was too big and got truncated.Yes, WCG work (especially, GPU work on Covid, task name prefix OPNG) tends to get released in batches - and the batches are getting bigger: I got 35 in one go at lunchtime 16/12/2021 12:17:49 | World Community Grid | Scheduler request completed: got 35 new tasksBut I keep my requests reasonable, and never get more than I request. 16/12/2021 12:17:48 | World Community Grid | [sched_op] NVIDIA GPU work request: 42684.92 seconds; 2.00 devices 16/12/2021 12:17:49 | World Community Grid | [sched_op] estimated total NVIDIA GPU task duration: 21396 secondsThe Event Log flag <sched_op_debug> is active on all my machines, and can be useful in tracking down issues like this. I find it unlikely that WCG would issue 1600 tasks in response to a single request: most projects set a lower limit in their server's feeder configuration (100 or 200). Even without the current log, you can still track the history. * Under Windows, in files stdoutdae.txt and stdoutdae.old in the data folder. You can configure those to retain any size you like. * In the task list (either in the BOINC Manager, or on the project website), by inspecting the deadlines of the allocated tasks. WCG requests a delay of two minutes between fetches: you would be able to sees a discontinuity of 2+ minutes between batches if multiple fetches were involved. Question: when building the x64 release I get an executable that is 2x as big as the 7.16.20 that Berkeley has. There must be some setting the in my VS2019 that is different from Berkeley's. Usually the debug version is the size hog.Yes, VS2019 files are bigger. Earlier versions relied on library routines delivered as separate, external, .DLL files: with the VS2019 build, the libraries are embedded in the main executable. [edit-2] All three systems used app_config to limit the number of concurrent WCG tasks. I replaced boinc.exe with that version you posted. Maybe this will fix the problem?Yes, that's exactly what the #4592 patch was designed to fix. That's why it's called "client: fix work-fetch logic when max concurrent limits are used". The problem makes itself apparent by causing multiple, repeated, limitless, work fetch requests. Which Is why I keep asking if multiple, repeated, limitless, work fetch requests are visible in your logs. |
Send message Joined: 27 Jun 08 Posts: 641 |
Just checked my Lenovo again. More tasks have downloaded but the deadline has not changed. There are 135 tasks waiting. At 4 hours per task and 8 cores that is 68 hours of work and is still within the deadline of 12/23. However, there should have been no downloads with project priority of 0. Perhaps that feature (the "0") is not a "client" specification anymore if it ever was. I have been using it as a fallback project so if Milkyway runs dry (like just happened recently) then Einstein gets to run but as soon as one Einstein finishes Milkyway can take over since it is 100% and Einstein is %0. AFAICT WCG is the only project where the "0" has a problem. I do not want to babysit WCG. if it wants to download 1700 apps I do not want to crunch apps that will not be used. I have spotted 100's of their apps as "aborted by project" on my system and have been trying to figure out how to prevent it. Project priority of "0" does not work on some system and on others it does. As I have been writing this post that WCG app count went from 135 to 145. If I shut the system down for a long weekend about 1/2 will be expired before they even start. |
Send message Joined: 27 Jun 08 Posts: 641 |
I think the only exception to large download limit is that "lost task" download but my tasks were not lost, they were aborted due to no possibility of finishing. An observation that might be be a clue: One of my win10 + NVidia systems has app_config with max concurrent of 9 but does not and never had a problem with too many WCG downloads. However, I set max number of core to 9 on that system and it also runs one Einstein. I am guessing the max_concurrent is not used as the # of cores limit takes precedence in the fetch algorithm??? My linux system uses max cores to limit wcg and not app_config and it does not and never had a problem. |
Send message Joined: 27 Jun 08 Posts: 641 |
Follow-up on this problem. All my WCG systems have stabilized after a week of 24/7 and I got a "solution" to setting Priority to 0 with that max_concurrent app option. Recap: Setting Priority to 0 normally means the queue never exceeds 1 work unit even if each core is working on a WCG task. During initial configuration of BOINC it is possible a lot of unwanted work units will download but eventually the system will get to where a new download occurs only when there are not other tasks of the same type (CPU) in the queue. This is a different problem. When using "max_concurrent" in WCG's app_config file, I was able to demonstrate that Priority of "0" is ignored if the number of cores allocated to the system is greater than the value of that max_concurrent parameter. On my test system, I left # cores at 11 with max_concurrent at 8 and the number of WCG tasks increased to several 100. However, at no time did the number of waiting work units exceed the deadline. As long as I left the system running 24/7 they would all finish in within the dead line. When I set the number of cores down to 8, the same value as max_concurrent, there were no more downloads of work units and eventually the queue got down to 0 at which time a single download occurred. This is the expected behavior for Priority of 0. Probably not many users have priority set to 0. Should this problem should be reported as an issue over at github? Can someone else verify this behavior? Thanks for looking! |
Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.