Posts by Joseph Stateson

41) Message boards : Questions and problems : Getting too may WCG tasks on systems that had been working ok (Message 106442)
Posted 16 Dec 2021 by Profile Joseph Stateson
Post:
Oh dear. The resource share you read in a <work_fetch_debug> segment of the event log has NOTHING TO DO with the resource share you set on a project web site.
To a first approximation, a dump indicates a server problem: a trickle indicates a client problem. We need to know where to start looking.


Well, at least I was correct about the 2c.

Hmm - I do not remember problems like this in other projects NOR in WCG before they implemented GPU for that COVID app.
There seems to be no rhyme nor reason to this problem as some systems are not affected:

- 16 core linux with 7.16.3 and 2 AMD boards never has a problem. When one WU is uploaded another is downloaded
- Pair of windows 10 with NVIdia likewise no problem Been running perfectly for a long time. A single download for every upload.

All the new system I recently built have problems except one

The one with no problem runs win10 and BOINC as a service and WCG is %100 share. 3 cores of 4 are allocated and checking I just saw that 20 tasks are waiting which is OK.

There are 3 system with problems Two are newly minted win10 and my main desktop that I just upgraded to win11. All have a single NVidia and are set for "No New Tasks" on WCG until the problem gets fixed.

I suspect there is a dump of WCG tasks. I did not see the 1600 download all at once as a dump as the event log was too big and got truncated.

On one system (my desktop, share = 0) I watched about 10 days worth download while 20 days worth were waiting to run.
After it stabilized at 30 days worth I increased the core count and watched another 10 days worth download.
I had a limit of only 6 concurrent WCG tasks. There should have been no need to download anything on account of share=0 and the limit of 6
Not sure if this counts as a trickle.

Should I be running that version you posted about two weeks ago? I tested it out on one of the new systems but the problem was the initial startup after installing BOINC which is not the same as I am seeing here.

[edit] I deleted my boinc.exe and copied over your version, the "max fix" one to try it.
Question: when building the x64 release I get an executable that is 2x as big as the 7.16.20 that Berkeley has. There must be some setting the in my VS2019 that is different from Berkeley's. Usually the debug version is the size hog.

[edit-2] All three systems used app_config to limit the number of concurrent WCG tasks. I replaced boinc.exe with that version you posted. Maybe this will fix the problem?
42) Message boards : Questions and problems : Getting too may WCG tasks on systems that had been working ok (Message 106438)
Posted 16 Dec 2021 by Profile Joseph Stateson
Post:
The code change had no real effect. While the 1.000 no longer showed up in the log file, the system with only 8 cores went and got 10 days worth of work. The system with 20 cores got just one day. Neither system should have download more than 1 WCG task at a time with share set to 0. None of my other projects have this behavior. There is a problem somewhere..
43) Message boards : Questions and problems : Getting too may WCG tasks on systems that had been working ok (Message 106436)
Posted 16 Dec 2021 by Profile Joseph Stateson
Post:
Some thoughts on the following code, worth about 2c (my thoughts, not the code)

        if (!p->rsc_pwf[j].rsc_project_reason) {
                p->rsc_pwf[j].fetchable_share = rsc_work_fetch[j].total_fetchable_share?p->resource_share/rsc_work_fetch[j].total_fetchable_share:1;
...
...
        msg_printf(p, MSG_INFO,
            "[work_fetch] share %.3f %s %s",
            rpwf.fetchable_share,
            rsc_reason_string(rpwf.rsc_project_reason),
            buf


The following indicates that a "1" was not put into the resource
That means the IF part reason was "false" and consequently the project_reason was "true"
[work_fetch] share 0.000 zero resource share



The following indicates that not only was the IF true (project_reason was false)
but in addition the "rsc_reason_string" is empty as nothing was printed.
[work_fetch] share 1.000 


Anyway, I edited that code and changed "1" to "0" and put a copy "7.16.19" on two of my worst WCG offender systems.
After rebooting the system with 20 cores downloaded 4 new apps and the system with only 8 cores downloaded only 2 apps. This was after I aborted abot 75 days of work most of which could not have been completed by the deadline.

Will know tomorrow for sure if my "fix" worked.
44) Message boards : Questions and problems : Getting too may WCG tasks on systems that had been working ok (Message 106434)
Posted 15 Dec 2021 by Profile Joseph Stateson
Post:
Found something strange in the code

Looking for "[work_fetch] share 0.000 "

I found the above was printed by the function
void RSC_WORK_FETCH::print_state(const char* name) {
...
....
        msg_printf(p, MSG_INFO,
            "[work_fetch] share %.3f %s %s",
            rpwf.fetchable_share,
            rsc_reason_string(rpwf.rsc_project_reason),
            buf
...


where that variable the has the value of 0.0000 (or 1.0 or 0..5) is defined here
double fetchable_share;
        // this project's share relative to projects from which
        // we could probably get work for this resource;
        // determines how many instances this project deserves


and it can be set to "1" here based on "project reason"
           if (!p->rsc_pwf[j].rsc_project_reason) {
                p->rsc_pwf[j].fetchable_share = rsc_work_fetch[j].total_fetchable_share?p->resource_share/rsc_work_fetch[j].total_fetchable_share:1;
 

so if "project reason" is true (just noticed the negation) then share is set to 1.0

I do not know where the 0.5 came from. However, someone has hard coded a 1.0 for the project share which is suspicious. If I knew more about "project reason" maybe there is a "reason"

[edit] just realized that "rsc_reason_string(rpwf.rsc_project_reason)" is null since nothing was printed after the 1.0000 so it is "false" ?? and a 1.0 seems to have been assigned to project share ?

HTH
45) Message boards : Questions and problems : Getting too may WCG tasks on systems that had been working ok (Message 106432)
Posted 15 Dec 2021 by Profile Joseph Stateson
Post:
[edit] I had to delete most of what I wrote as I had been looking at the wrong system.
The system that had downloaded just one tass has now gone and downloaded a few more for a total of 4. That is probably ok. During that time another 7.16.20 downloaded another weeks worth
46) Message boards : Questions and problems : Getting too may WCG tasks on systems that had been working ok (Message 106431)
Posted 15 Dec 2021 by Profile Joseph Stateson
Post:
WHAT DID THE EVENT LOG SAY ABOUT FETCHING?


[EDIT] i fixed the version numbers I had garbled up. Note that ALL system had share set to 0 and had been that way for a long time.

OK, I turned on <work_fetch_debug> on three systems. One I had to stop and restart as the chatter went off the event screen and the "top" was missing
LOOKS LIKE I DUPLICATED THE PROBLEM FROM 7.16.3 ON 7.16.20!!

The two I just upgraded to 7.16.20 and the one I just recently restarted. There was a difference ON ALL THREE

This one running 7.16.20 downloaded one task. I had just aborted 1600+ and was afraid I would not get any because of daily limit, but I did get one. So actually, this is normal

bjysdualx2

84			12/15/2021 1:43:40 PM	[work_fetch] target work buffer: 86400.00 + 0.00 sec	
85			12/15/2021 1:43:40 PM	[work_fetch] --- project states ---	
91	World Community Grid	12/15/2021 1:43:40 PM	[work_fetch] REC 26763.703 prio -0.000 can request work	
92			12/15/2021 1:43:40 PM	[work_fetch] --- state for CPU ---	
93			12/15/2021 1:43:40 PM	[work_fetch] shortfall 1869495.34 nidle 0.00 saturated 230.49 busy 0.00	
99	World Community Grid	12/15/2021 1:43:40 PM	[work_fetch] share 0.000 zero resource share 	
100			12/15/2021 1:43:40 PM	[work_fetch] --- state for AMD/ATI GPU ---	
101			12/15/2021 1:43:40 PM	[work_fetch] shortfall 344087.02 nidle 0.00 saturated 230.49 busy 0.00	
107	World Community Grid	12/15/2021 1:43:40 PM	[work_fetch] share 0.000 zero resource share 	
108			12/15/2021 1:43:40 PM	[work_fetch] ------- end work fetch state -------	
120	World Community Grid	12/15/2021 1:43:40 PM	choose_project: scanning	
121	World Community Grid	12/15/2021 1:43:40 PM	can't fetch CPU: zero resource share	
122	World Community Grid	12/15/2021 1:43:40 PM	can't fetch AMD/ATI GPU: zero resource share	
123			12/15/2021 1:43:40 PM	[work_fetch] No project chosen for work fetch	
124			12/15/2021 1:44:41 PM	choose_project(): 1639597481.509739	


The above does not show any download because I had to restart to get the "TOP"

The next is for another 7.16.20 that unfortunately downloaded more stuff. I had just restarted after putting in 7.16.20 and then I aborted 50 day worth and that must have triggered more downloads. I did not have work_fetch_debug in the cc so I missed what happened when it got extra stuff.. I then changed %cpu to allow more tasks and got more downloads THAT SHOULD NOT HAVE HAPPENED (note the 14 cpu a change from 12 caused more tasks)

JYSArea51

1779			12/15/2021 1:57:59 PM	   max CPUs used: 14	
1780			12/15/2021 1:57:59 PM	   (to change preferences, visit a project web site or select Preferences in the Manager)	
1781			12/15/2021 1:57:59 PM	[work_fetch] Request work fetch: Prefs update	
1782			12/15/2021 1:57:59 PM	[work_fetch] Request work fetch: Preferences override	
1783			12/15/2021 1:58:00 PM	choose_project(): 1639598280.665096	
1784			12/15/2021 1:58:00 PM	[work_fetch] ------- start work fetch state -------	
1785			12/15/2021 1:58:00 PM	[work_fetch] target work buffer: 8640.00 + 43200.00 sec	
1786			12/15/2021 1:58:00 PM	[work_fetch] --- project states ---	
1810	World Community Grid	12/15/2021 1:58:00 PM	[work_fetch] REC 6981.661 prio -1000.053 can't request work: scheduler RPC backoff (13.04 sec)	
1812			12/15/2021 1:58:00 PM	[work_fetch] --- state for CPU ---	
1813			12/15/2021 1:58:00 PM	[work_fetch] shortfall 700695.25 nidle 7.00 saturated 0.00 busy 0.00	
1837	World Community Grid	12/15/2021 1:58:00 PM	[work_fetch] share 0.000  	
1839			12/15/2021 1:58:00 PM	[work_fetch] --- state for NVIDIA GPU ---	
1840			12/15/2021 1:58:00 PM	[work_fetch] shortfall 51647.39 nidle 0.00 saturated 192.61 busy 0.00	
1864	World Community Grid	12/15/2021 1:58:00 PM	[work_fetch] share 0.000 zero resource share 	
1866			12/15/2021 1:58:00 PM	[work_fetch] ------- end work fetch state -------	
1914	World Community Grid	12/15/2021 1:58:00 PM	choose_project: scanning	
1915	World Community Grid	12/15/2021 1:58:00 PM	skip: scheduler RPC backoff	
1919			12/15/2021 1:58:00 PM	[work_fetch] No project chosen for work fetch	
1920			12/15/2021 1:58:13 PM	[work_fetch] Request work fetch: Backoff ended for World Community Grid	
1921			12/15/2021 1:58:15 PM	choose_project(): 1639598295.784178	
1922			12/15/2021 1:58:15 PM	[work_fetch] ------- start work fetch state -------	
1923			12/15/2021 1:58:15 PM	[work_fetch] target work buffer: 8640.00 + 43200.00 sec	
1924			12/15/2021 1:58:15 PM	[work_fetch] --- project states ---	
1948	World Community Grid	12/15/2021 1:58:15 PM	[work_fetch] REC 6981.661 prio -1000.052 can request work	
1950			12/15/2021 1:58:15 PM	[work_fetch] --- state for CPU ---	
1951			12/15/2021 1:58:15 PM	[work_fetch] shortfall 700709.42 nidle 7.00 saturated 0.00 busy 0.00	
1975	World Community Grid	12/15/2021 1:58:15 PM	[work_fetch] share 1.000  	
1977			12/15/2021 1:58:15 PM	[work_fetch] --- state for NVIDIA GPU ---	
1978			12/15/2021 1:58:15 PM	[work_fetch] shortfall 51661.53 nidle 0.00 saturated 178.47 busy 0.00	
2002	World Community Grid	12/15/2021 1:58:15 PM	[work_fetch] share 1.000  	
2004			12/15/2021 1:58:15 PM	[work_fetch] ------- end work fetch state -------	
2052	World Community Grid	12/15/2021 1:58:15 PM	choose_project: scanning	
2053	World Community Grid	12/15/2021 1:58:15 PM	can fetch CPU	
2054	World Community Grid	12/15/2021 1:58:15 PM	CPU needs work - buffer low	





The system still running 7.16.3 downloaded another week worth. The is the chatter:

lenovos20

43			12/15/2021 1:24:01 PM	choose_project(): 1639596241.273872	
44			12/15/2021 1:24:01 PM	[work_fetch] ------- start work fetch state -------	
45			12/15/2021 1:24:01 PM	[work_fetch] target work buffer: 86400.00 + 0.00 sec	
46			12/15/2021 1:24:01 PM	[work_fetch] --- project states ---	
48	World Community Grid	12/15/2021 1:24:01 PM	[work_fetch] REC 4124.711 prio -0.112 can request work	
49			12/15/2021 1:24:01 PM	[work_fetch] --- state for CPU ---	
50			12/15/2021 1:24:01 PM	[work_fetch] shortfall 695894.09 nidle 1.00 saturated 0.00 busy 0.00	
52	World Community Grid	12/15/2021 1:24:01 PM	[work_fetch] share 1.000  	
53			12/15/2021 1:24:01 PM	[work_fetch] --- state for NVIDIA GPU ---	
54			12/15/2021 1:24:01 PM	[work_fetch] shortfall 18361.75 nidle 0.00 saturated 68038.25 busy 0.00	
56	World Community Grid	12/15/2021 1:24:01 PM	[work_fetch] share 0.500  	
57			12/15/2021 1:24:01 PM	[work_fetch] ------- end work fetch state -------	
58	World Community Grid	12/15/2021 1:24:01 PM	choose_project: scanning	
59	World Community Grid	12/15/2021 1:24:01 PM	can fetch CPU	
60	World Community Grid	12/15/2021 1:24:01 PM	CPU needs work - buffer low	
61	World Community Grid	12/15/2021 1:24:01 PM	checking CPU	
62	World Community Grid	12/15/2021 1:24:01 PM	[work_fetch] using MC shortfall 591132.340164 instead of shortfall 695894.087949	
63	World Community Grid	12/15/2021 1:24:01 PM	[work_fetch] set_request() for CPU: ninst 10 nused_total 227.00 nidle_now 1.00 fetch share 1.00 req_inst 0.00 req_secs 591132.34	
64	World Community Grid	12/15/2021 1:24:01 PM	CPU set_request: 591132.340164	
65	World Community Grid	12/15/2021 1:24:01 PM	checking NVIDIA GPU	
66	World Community Grid	12/15/2021 1:24:01 PM	[work_fetch] using MC shortfall 18361.747788 instead of shortfall 18361.747788	
67	World Community Grid	12/15/2021 1:24:01 PM	[work_fetch] set_request() for NVIDIA GPU: ninst 1 nused_total 0.00 nidle_now 0.00 fetch share 0.50 req_inst 0.00 req_secs 18361.75	
68	World Community Grid	12/15/2021 1:24:01 PM	NVIDIA GPU set_request: 18361.747788	
47) Message boards : Questions and problems : Getting too may WCG tasks on systems that had been working ok (Message 106429)
Posted 15 Dec 2021 by Profile Joseph Stateson
Post:
Going to switch to latest version as I cannot account for why too many tasks are being downloaded when share is set to 0.

I have several 7.16.3 and the linux ones do not show a problem. Three win10 systems:
- 70 days, 834 tasks
- 322 days, 1658 tasks and I had abort 700+ tasks a few days ago.
- 2 days, 16 tasks

The above was not on new builds where share is set to 100 for a few minutes.

I went over to the WCG forum but did not see any similar problems. They do not have a "question and problems" forum so I had to poke around
It does not look like a problem at their end caused by the move from IBM. If it happens with 7.16.20 then I can try to debug it if I knew what to look for.

[edit] I just started boinc back up on a windows system that rebooted due to windows feature update. It has 7.16.3 and i just watched it download additional WCG tasks when there was no need. Share was 0 and there were already a weeks worth of tasks. Maybe when rebooting the %0 is not noticed ???
48) Message boards : Questions and problems : All Milkyway@Home GPU WU"s get Computation error (Message 106400)
Posted 13 Dec 2021 by Profile Joseph Stateson
Post:
I'm trying to get some GPU work going on my Mac Pro. It is running 10.13.6 and has a GeForce GT120.


that board does not support double precision float. All GPU tasks will fail.
49) Message boards : Questions and problems : WCG: new systems download 100s of CPU work units, not possible to work all (Message 106386)
Posted 11 Dec 2021 by Profile Joseph Stateson
Post:
[edit] I didnt wait long enough. Got additional tasks. Maybe this fixes the 91 second minimum delay problem!!! Will let it run for a while

Wow!! Could it be as simple as that? What I would like to see is a reported task and requested work during the same scheduler connection being filled.


Sorry, just got around to reading this.

No, that option did not cause new work units to be downloaded after a "finished" upload.
The work count starts at 300 for a single board and slowly drops to 0 and then there is that 91 second + up to 5 minute wai and occasionally even longer idle.

I think what happened was I requested an update and it just so happened that 91 seconds had elapsed since the last request so I actually got serviced.

On my "racks" with multiple GPUs an MW work unit finishes on the average of every 15 seconds so the 91 second requirement never happens. This test system had 1 board and all 4 tasks finish about exactly the same time and 2.5 minutes apart so there is a good chance the 91 seconds have elapsed. The net effect is I still have to use my boinc client "mod": to avoid the long idle time.
50) Message boards : Questions and problems : WCG: new systems download 100s of CPU work units, not possible to work all (Message 106383)
Posted 10 Dec 2021 by Profile Joseph Stateson
Post:
The option
<fetch_on_update>0</fetch_on_update>

is not working like I expected. I added it to cc_config.xml "options"
I think it works the way the developers intended:

<fetch_on_update>0|1</fetch_on_update>
When updating a project, request work even if not highest priority project.
Setting it to 1 adds extra fetching, but 0 doesn't block normal fetches. That quote comes from the User Manual.


IMHO the "Extra Fetch" was clearly added as shown quote "Sending scheduler request: Requested by user"

I set the option to >1< and restarted the client and did an update after a few minutes and got essentially the same thing
hp3400

57	Milkyway@Home	12/10/2021 1:48:11 PM	update requested by user	
58	Milkyway@Home	12/10/2021 1:48:15 PM	Sending scheduler request: Requested by user.	
59	Milkyway@Home	12/10/2021 1:48:15 PM	Requesting new tasks for AMD/ATI GPU	
60	Milkyway@Home	12/10/2021 1:48:33 PM	Scheduler request completed: got 0 new tasks	
61	Milkyway@Home	12/10/2021 1:48:33 PM	Not sending work - last request too recent: 35 sec	
62	Milkyway@Home	12/10/2021 1:48:33 PM	Project requested delay of 91 seconds	


Unless I am missing something, there is no difference on either update I requested other than I did get additional tasks with the >0<

so with or w/o work is always requested.

[edit] I didnt wait long enough. Got additional tasks. Maybe this fixes the 91 second minimum delay problem!!! Will let it run for a while

hp3400

57	Milkyway@Home	12/10/2021 1:48:11 PM	update requested by user	
58	Milkyway@Home	12/10/2021 1:48:15 PM	Sending scheduler request: Requested by user.	
59	Milkyway@Home	12/10/2021 1:48:15 PM	Requesting new tasks for AMD/ATI GPU	
60	Milkyway@Home	12/10/2021 1:48:33 PM	Scheduler request completed: got 0 new tasks	
61	Milkyway@Home	12/10/2021 1:48:33 PM	Not sending work - last request too recent: 35 sec	
62	Milkyway@Home	12/10/2021 1:48:33 PM	Project requested delay of 91 seconds	
63	Milkyway@Home	12/10/2021 1:50:04 PM	Sending scheduler request: To fetch work.	
64	Milkyway@Home	12/10/2021 1:50:04 PM	Requesting new tasks for AMD/ATI GPU	
65	Milkyway@Home	12/10/2021 1:50:07 PM	Scheduler request completed: got 36 new tasks	
66	Milkyway@Home	12/10/2021 1:50:07 PM	Project requested delay of 91 seconds	
51) Message boards : Questions and problems : WCG: new systems download 100s of CPU work units, not possible to work all (Message 106381)
Posted 10 Dec 2021 by Profile Joseph Stateson
Post:
The option
<fetch_on_update>0</fetch_on_update>


is not working like I expected. I added it to cc_config.xml "options"

<cc_config>
    <options>
        <use_all_gpus>1</use_all_gpus>
      <allow_remote_gui_rpc>1</allow_remote_gui_rpc>
      <fetch_on_update>0</fetch_on_update>
    </options>
</cc_config>


and restarted the client, waited a while, then requested an update and got over 100 tasks

hp3400

68	Milkyway@Home	12/10/2021 1:20:31 PM	update requested by user	
69	Milkyway@Home	12/10/2021 1:20:34 PM	Sending scheduler request: Requested by user.	
70	Milkyway@Home	12/10/2021 1:20:34 PM	Requesting new tasks for AMD/ATI GPU	
71	Milkyway@Home	12/10/2021 1:20:36 PM	Scheduler request completed: got 119 new tasks	


However, the project Milkyway has a known problem: It does not download new work units until 91 seconds after all existing work units have finished so getting 100+ tasks was doubly unexpected!
52) Message boards : Questions and problems : WCG: new systems download 100s of CPU work units, not possible to work all (Message 106361)
Posted 9 Dec 2021 by Profile Joseph Stateson
Post:
IMHO That problem with gpugrid is going to be hard to debug. I would not expect a gpu tasks to be swapped out for another from the same project.

Thinking about that reminds me of a problem that showed up over at Milkyway earlier that I tried to help with.
an n-body (cpu needs 4 threads) was totally idle wile four cpu tasks were running (system had only 4 cores).
My guess was the nbody was swapped out but would never got a time slice again because of all the smaller cpu tasks that finish at different times. All tasks were MW.
I suggest to run either one or the other but not both from the same project.

In other news I was able to verify that a new install of BOINC needed "WUprop" so that adding Einstein or WCG would not .cause 100s of downloads

Einstein is my fallback project with share = 0 and Milkway is my %100 as I can run 4 concurrent tasks.

I tried running two Einstein concurrently. Saw a tiny improvement but not enough to justify having to use a bigger fan to cool my rack of GPS.

I recently joined that supersecret GPU club and have some ideas to work on. One is to try to arrange my "boinc mod" so that if gpugrid gets suspended the GPUs get assigned to the same slot they were using.
When running my rack of three gpugrid tasks: p102-100, gtrx1070 and gtx1660ti all three can die when resumed from suspension as the CUDA compiler does not know the meta data is different and tries to pick up where it left off which causes a failure. The alternative is to run 3 instances of BOINC but that is a PITA.
53) Message boards : Questions and problems : WCG: new systems download 100s of CPU work units, not possible to work all (Message 106350)
Posted 8 Dec 2021 by Profile Joseph Stateson
Post:

Why is share being set to 100%. It is shown ad 0 in the manager but 100 is listed in the log (Boinctasks log)

10 12/7/2021 2:24:43 PM All projects have zero resource share; setting to 100

Wild guess:
"If All projects are set to zero, then there's no point in trying to do anything. So obviously this person doesn't know what he's doing. I'll be helpful and set them to 100% for him."


What I find strange is that of all the settings the user can control, the parameter that determines a project "share" is controlled at the project account and not at the boinc manager.

My first thought was that setting all to %100 allowed bundled Charity Engine to start crunching on un-suspecting users who would never have a project account nor know the definition of "share". However, after reading what Richard wrote about "fix my last checkin" I decided that Hanlon's razor is applicable here

I think there is a fix that does not involve adding an option to cc_config nor deleting that code. I run WUProp@home on systems that do not crunch CPU tasks so that I observe the CPU temperature that boinctasks displays. I just need to install WUProp on all new builds. It always runs at %100 and only one app ever runs. That will fix the "set all projects to %100" It just needs to be the first project added on new builds.
54) Message boards : Questions and problems : several cores flash 99 degree temps at 20 percent utilization (Message 106346)
Posted 7 Dec 2021 by Profile Joseph Stateson
Post:
I was running 60 and 70 percent of cores and cpu times and temps of system was in 80's. It started to get hot with fans roaring and found that there were various cores that flashed hot....even running at 20 per cent cores and 20 percent cpu....cpu temp bounces from 50 to 98 degrees.....any ideas


Look and see if there is a gap between the CPU heat spreader and the bottom of the heat sink. That actually happened to me and a copper shim fixit the problem until I found the correct cooler. I had the exact symptoms you mention.
55) Message boards : Questions and problems : WCG: new systems download 100s of CPU work units, not possible to work all (Message 106344)
Posted 7 Dec 2021 by Profile Joseph Stateson
Post:
I've just been having the same conversation with another user by email. So this is conveniently on my clipboard:

https://drive.google.com/drive/folders/14C1sfF9wDbG1U0fPSwkXx3jq_M1HrxwB?usp=sharing

You'll need both a .ZIP handler and a 7-zip handler to unpack boinc.exe - so good they compressed it twice.


?????

This must be your test hander that showed the problem
I had to suspend Einstein as it was downloading days worth of data with share set to "0" which is not right. I have 151 einstein tasks waiting to run. I can actually do that as the 2 GPU are good and the deadline is not tomorrow.

Why is share being set to 100%. It is shown ad 0 in the manager but 100 is listed in the log (Boinctasks log)


xps-435t

1			12/7/2021 2:24:42 PM	Starting BOINC client version 7.19.0 for windows_x86_64	
2			12/7/2021 2:24:42 PM	This a development version of BOINC and may not function properly	
3			12/7/2021 2:24:42 PM	Libraries: libcurl/7.80.0-DEV Schannel zlib/1.2.11	
4			12/7/2021 2:24:42 PM	Data directory: C:\ProgramData\BOINC	
5			12/7/2021 2:24:42 PM	Running under account josep	
6			12/7/2021 2:24:43 PM	CUDA: NVIDIA GPU 0: GeForce GTX 1060 3GB (driver version 456.71, CUDA version 11.1, compute capability 6.1, 3072MB, 2488MB available, 3936 GFLOPS peak)	
7			12/7/2021 2:24:43 PM	CUDA: NVIDIA GPU 1: GeForce GTX 1060 3GB (driver version 456.71, CUDA version 11.1, compute capability 6.1, 3072MB, 2488MB available, 3936 GFLOPS peak)	
8			12/7/2021 2:24:43 PM	OpenCL: NVIDIA GPU 0: GeForce GTX 1060 3GB (driver version 456.71, device version OpenCL 1.2 CUDA, 3072MB, 2488MB available, 3936 GFLOPS peak)	
9			12/7/2021 2:24:43 PM	OpenCL: NVIDIA GPU 1: GeForce GTX 1060 3GB (driver version 456.71, device version OpenCL 1.2 CUDA, 3072MB, 2488MB available, 3936 GFLOPS peak)	
10			12/7/2021 2:24:43 PM	All projects have zero resource share; setting to 100	
11			12/7/2021 2:24:43 PM	Version change (7.16.20 -> 7.19.0)	


why the following code in cs_statefile.cpp?

// if total resource share is zero, set all shares to 1
    //
    if (projects.size()) {
        unsigned int i;
        double x=0;
        for (i=0; i<projects.size(); i++) {
            x += projects[i]->resource_share;
        }
        if (!x) {
            msg_printf(NULL, MSG_INFO,
                "All projects have zero resource share; setting to 100"
            );
            for (i=0; i<projects.size(); i++) {
                projects[i]->resource_share = 100;
            }
        }
    }


Is this something that can be turned in as an issue?
56) Message boards : Questions and problems : WCG: new systems download 100s of CPU work units, not possible to work all (Message 106341)
Posted 7 Dec 2021 by Profile Joseph Stateson
Post:
I deliberately put one machine into the state where it was fetching the same quantum of new work every 30 seconds, and getting it, every time - so it was disregarding the new work when calculating what to fetch next time. Is that how your excess tasks arrive?

I downloaded and installed the CI test build of #4592: that cured it.


Doing something wrong: got the code that did not have the changes.

Clicked on that 4592 issue
Clicked on "dpa_max_concurrent"
observed the 6 day old change at client so I think I am looking at the mod you tested
selected "CODE" (the green box) and clicked on "Open with GitHub desktop"
Put the download in my project folder using my GitHub desktop
built using VS2019 release x64 no errors under win11
Looked at work_fetch.cpp and none of the changes were there

went back and re-looked at the green box and it is downloading from github.com/BOINC/boinc.git which I suspect is not what I wanted. I am not up to speed on using github for anything more than sharing my code.

Wanted to test that new boinc fix on my system as I want to enable WCG and do not want another 500+ downloads.

I built 3 system in last two weeks, one for a nephew and 2 for one of my kids. I forgot about the problem on the first system and was too slow getting around to stopping the WCG downloads on the next two.

57) Message boards : Questions and problems : WCG: new systems download 100s of CPU work units, not possible to work all (Message 106334)
Posted 7 Dec 2021 by Profile Joseph Stateson
Post:
Still having problems and I tried 7.16.20. I tried to make sure the share = 0 was recognized and configured only for Einstein instead of WCG

Rebuild of old system XPS-435t with three gtx-1060

Installed win10x64 21h2
Installed all Visual C Runtime (all versions)
Installed 7.16.20 and set advanced view
Added Einstein (my project default is GPU and share = 0)
Saw 100% appear under share and set "no new tasks" as soon as that option was enabled.
After a minute or two I saw a single tasks executing and that share had gone to 0.
I looked at the event log and the two GPUs that had only 3gb of memory were being ignored. I edited cc_config so that all 3 GPUs work and rebooted

Next time I looked there were 3 tasks executing but there were 12 GPU tasks waiting to execute. Should have been none waiting to execute.
The CPU has 12 threads. I checked but the 12 waiting tasks were all GPU tasks, none were CPU.
Just checked again and only 11 are left. Eventually will get down to 0 and then will be getting 1 for each one I turn in which is correct for share=0

Two days ago I aborted over 700 WCG tasks (total of 1200 in last 2 weeks) but it was my old 7.16.3 and so I decided to try 7.16.20 on a rebuild of an old system.
58) Message boards : Questions and problems : WCG: new systems download 100s of CPU work units, not possible to work all (Message 106245)
Posted 30 Nov 2021 by Profile Joseph Stateson
Post:
I recently assembled a pair of windows system with WCG pre-configured as "0" share. Normally only 1 wu per cpu gets downloaded.

Both systems have the older 7.16.3 boinc. Would the newer 7.20 handle this initialization correctly? I am guessing the project sees 12 threads and downloads a boatload of tasks and never notices that the share is supposed to be 0 till after the download.

I end up aborting 400+ files: about 58 days of work where the deadline was only about 3 days in the first place.
59) Message boards : Questions and problems : The BOINC client has exited unexpectedly 3 times within the last 3 minutes (Message 106243)
Posted 30 Nov 2021 by Profile Joseph Stateson
Post:
One of the problems I have with Linux is looking for informative error messages. About all I can do is grep for "boinc" in the var/log folder and then grep for "error". Usually the files are so large I need to delete them and reboot to be able to spot a problem the next time it happens.

Not anywhere as clean as looking in the windows event viewer and spotting "memory resource" warnings and finding I had too many apps running to leave suspended apps in ram.

AFAICT there are no linux apps that break logs into "apps" and "system" and then organize the results into critical, error, warning, and "info" like windows does.

I run about 10 systems and monitor the boinc event log using boinctasks. There are so many meaningless messages from all systems that I did a mod of boinc to filter out the worst ones. I have no need to be told 500 times that I can get CPU tasks but that I chose not to. Those push real error messages off the bottom of the log file before I can read them.
60) Message boards : Questions and problems : Can not get Rosetta Python (Vbox) tasks (Message 106109)
Posted 15 Nov 2021 by Profile Joseph Stateson
Post:

There is a big difference between a project disabling a feature and a project not enabling a feature


Just like the difference between a glass half empty and same one half full. You don't have all the resources you could have.


Previous 20 · Next 20

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.