BOINC unable to honor project shares (at all, not only in short run)

Message boards : Questions and problems : BOINC unable to honor project shares (at all, not only in short run)
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103633 - Posted: 21 Mar 2021, 20:18:29 UTC
Last modified: 21 Mar 2021, 20:34:29 UTC

Example:

Project shares:

Einstein@home: 1
MilkyWay: 10000

What could be expected (I started this discussion with PM exchange with Richard and he noticed that project share should be considered (in BOINC current paradigma) not as processing time share but rather credit share, but BOTH metrics can't be honored!):
MW runs almost always on GPU and E@h receives tasks very rare, when MW has prolonged "no work" stages.

What I see in reality:
I prepared "initial state" for this experiment as most easy to follow:
No E@h GPU tasks in queue; queue filled with MW GPU tasks.
[work_fetch] target work buffer: 129600.00 + 8640.00 sec

After a week:
15 E@h GPU tasks in queue.
RAC (note, that E@h host RAC includes CPU, so situation even worse that pictured):



Absolutely not 1:10 000. And will not be in near future (cause those 15 E@h tasks in cache worth >~150 MW tasks and they should be processed once downloaded!).
And I state that requested share never will be and can't be honored in current work fetch design.

When work is needed BOINC asks for work from MW first (as should be). But MW not as reliable in work providing as E@h is.
So time to time no work given. Then BOINC immediately (and that's NOT OK !) asks for work from E@h and get it.
Situation repeated when cache is lowering again.

Hence, only possible share will be the ratio of MW having work ready to send. It has absolute no connection not with processing time share nor with credit/RAC share.

What should be changed to fix this situation: there are 2 parts of work buffer: "main" and "additional".
Currently BOINC asks for work only when "main" in shortage and asks shortage + "additional". It's wrong behavior (!).
It should ask when some % of "additional" in shortage. And ask ONLY from high priority project until there is no shortage in "main". Then it should ask from all projects (in priority order of course).

Such sequence will give high priority project a chance to provide work accordingly its share. In current state all that "sophisticated" work fetch simulation just can't provide adequate results.
ID: 103633 · Report as offensive
Profile Dave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2515
United Kingdom
Message 103644 - Posted: 22 Mar 2021, 7:19:47 UTC - in response to Message 103633.  

I would certainly agree that there is no way the current rules will allow the work share you are asking for when the highest priority project often has no work available. My main project never has work for GPU so I don't worry about that too much. Because CPDN can have long periods with no work available as opposed to sporadic short periods, I set cache to minimum. If I get a period with no work, once CPDN has work available again it then more than fills the cache because of the length of the tasks. I don't see an easy way to cope with a project that may have tasks available every day for example but if they all come out in a brief period, you might never pick them up.

The only way I can see to deal with this would be for BOINC to add a feature to have a cache size you can set for individual projects. You can always put in a request for such at git-hub but I wouldn't hold my breath waiting!
ID: 103644 · Report as offensive
Ian&Steve C.

Send message
Joined: 24 Dec 19
Posts: 228
United States
Message 103648 - Posted: 22 Mar 2021, 12:30:45 UTC - in response to Message 103633.  

why not set Einstein to 0 instead of 1. then it will act as a true backup and only request work when MW is out of work.
ID: 103648 · Report as offensive
Profile Dave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2515
United Kingdom
Message 103649 - Posted: 22 Mar 2021, 12:39:51 UTC - in response to Message 103648.  

why not set Einstein to 0 instead of 1. then it will act as a true backup and only request work when MW is out of work.


Doh! Forgot about that one.
ID: 103649 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 103650 - Posted: 22 Mar 2021, 13:06:25 UTC - in response to Message 103648.  

He said in the previous thread that he wanted Einstein to run on one of the {CPU|NV} devices - I forget which - but not the other. That rules out RS zero.

The previous thread also wrote:
1 GPU MW task that doesn't need full CPU core!
What makes you think that MW has cracked the busy-wait OpenCL problem on NVidia?
ID: 103650 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103652 - Posted: 22 Mar 2021, 16:15:35 UTC - in response to Message 103648.  

why not set Einstein to 0 instead of 1. then it will act as a true backup and only request work when MW is out of work.

That way I will go w/o CPU cache at all. Cause E@h the only who does CPU work on that host. MW configured as GPU-only (old NV GPU barely cope with E@h tasks causing noticeable host slowdown, no such effects on MW tasks).
ID: 103652 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103653 - Posted: 22 Mar 2021, 16:30:27 UTC - in response to Message 103650.  
Last modified: 22 Mar 2021, 16:44:43 UTC


The previous thread also wrote:
1 GPU MW task that doesn't need full CPU core!
What makes you think that MW has cracked the busy-wait OpenCL problem on NVidia?

Hm, my SETI NV OCL build didn't need full CPU too as far as I could recall. It's the question of kernel size and parameters of the call.
Moreover, E@h GW NV build doesn't require busy-loop also [accordingly its 0.9CPU config]. Hardly it's CUDA one, E@h strongly prefer OpenCL...
Only FGRP search currently requires full CPU core.
ID: 103653 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103655 - Posted: 22 Mar 2021, 16:41:51 UTC
Last modified: 22 Mar 2021, 16:47:25 UTC

BTW, there is some peculiarity in MW server configuration adding complexity. As soon as I suspended all E@h NV tasks cache was immediately filled with MW tasks. Coincidence?
At least it means that MW has work quite often and if BOINC would ask for it....

3/22/2021 19:38:54 PM | Milkyway@Home | checking NVIDIA GPU
3/22/2021 19:38:54 PM | Milkyway@Home | [work_fetch] set_request() for NVIDIA GPU: ninst 1 nused_total 0.00 nidle_now 1.00 fetch share 1.00 req_inst 1.00 req_secs 138240.00
3/22/2021 19:38:54 PM | Milkyway@Home | NVIDIA GPU set_request: 138240.000000
3/22/2021 19:38:54 PM | Milkyway@Home | [work_fetch] request: CPU (0.00 sec, 0.00 inst) NVIDIA GPU (138240.00 sec, 1.00 inst)
3/22/2021 19:38:54 PM | Milkyway@Home | Sending scheduler request: To fetch work.
3/22/2021 19:38:54 PM | Milkyway@Home | Requesting new tasks for NVIDIA GPU
3/22/2021 19:38:55 PM | | [work_fetch] Request work fetch: application exited
3/22/2021 19:38:57 PM | Milkyway@Home | Scheduler request completed: got 206 new tasks
3/22/2021 19:38:57 PM | Milkyway@Home | Project requested delay of 91 seconds
3/22/2021 19:38:57 PM | | [work_fetch] Request work fetch: RPC complete
ID: 103655 · Report as offensive
Ian&Steve C.

Send message
Joined: 24 Dec 19
Posts: 228
United States
Message 103658 - Posted: 22 Mar 2021, 17:36:38 UTC - in response to Message 103653.  


The previous thread also wrote:
1 GPU MW task that doesn't need full CPU core!
What makes you think that MW has cracked the busy-wait OpenCL problem on NVidia?

Hm, my SETI NV OCL build didn't need full CPU too as far as I could recall. It's the question of kernel size and parameters of the call.
Moreover, E@h GW NV build doesn't require busy-loop also [accordingly its 0.9CPU config]. Hardly it's CUDA one, E@h strongly prefer OpenCL...
Only FGRP search currently requires full CPU core.


nvidia GPU app for GW does indeed use a full CPU core. in most cases it actually requires MORE than 1 full core.

pull up top or htop while it's running and you'll see more than 100% of 1 core being used, i've seen up to 150%. since the GPU job is offloading some work to the CPU.
ID: 103658 · Report as offensive
Ian&Steve C.

Send message
Joined: 24 Dec 19
Posts: 228
United States
Message 103659 - Posted: 22 Mar 2021, 17:38:36 UTC - in response to Message 103650.  

He said in the previous thread that he wanted Einstein to run on one of the {CPU|NV} devices - I forget which - but not the other. That rules out RS zero.

The previous thread also wrote:
1 GPU MW task that doesn't need full CPU core!
What makes you think that MW has cracked the busy-wait OpenCL problem on NVidia?


I guess I don't understand the intended behavior or what's trying to be accomplished. if he doesn't want one device to run a certain project, you can just use the <exclude_gpu> flag in cc_config, and either exclude by the project as a whole, or exclude by the application.
ID: 103659 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103660 - Posted: 22 Mar 2021, 17:44:46 UTC - in response to Message 103658.  


The previous thread also wrote:
1 GPU MW task that doesn't need full CPU core!
What makes you think that MW has cracked the busy-wait OpenCL problem on NVidia?

Hm, my SETI NV OCL build didn't need full CPU too as far as I could recall. It's the question of kernel size and parameters of the call.
Moreover, E@h GW NV build doesn't require busy-loop also [accordingly its 0.9CPU config]. Hardly it's CUDA one, E@h strongly prefer OpenCL...
Only FGRP search currently requires full CPU core.


nvidia GPU app for GW does indeed use a full CPU core. in most cases it actually requires MORE than 1 full core.

pull up top or htop while it's running and you'll see more than 100% of 1 core being used, i've seen up to 150%. since the GPU job is offloading some work to the CPU.


Well, I didn't study GW NV app work by myself so far (it runs in another host, my "on desk" one can't process it). So I based exclusively on its configuration, 0.9CPU+1GPU.
Maybe wrong config from server deceived me, will not argue here.

Nevertheless the main idea remains - it's possible to work w/o busy-wait loop. Will it bring benefit or not strongly depends on size of used kernels inside app.
ID: 103660 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103661 - Posted: 22 Mar 2021, 17:58:29 UTC - in response to Message 103659.  
Last modified: 22 Mar 2021, 18:05:49 UTC

He said in the previous thread that he wanted Einstein to run on one of the {CPU|NV} devices - I forget which - but not the other. That rules out RS zero.

The previous thread also wrote:
1 GPU MW task that doesn't need full CPU core!
What makes you think that MW has cracked the busy-wait OpenCL problem on NVidia?


I guess I don't understand the intended behavior or what's trying to be accomplished. if he doesn't want one device to run a certain project, you can just use the <exclude_gpu> flag in cc_config, and either exclude by the project as a whole, or exclude by the application.


I'll try to describe "what &why":
0) Both CPU and GPU: due to my router hangs time to time I prefer to have some local cache of work for both CPU and GPU.
1) CPU part
I prefer (from personal scientific declination) to participate mostly in GW search (also I consider it more resource-demanding so not all hosts can do it and need help from those who can), but host in question can't deal with 4 CPU GW tasks at once (low memory issue). Hence I attempt to limit number of GW tasks simultaneously in fly not disabling them completely.
So I attempted to use max_concurrent tag in app_config for GW app. Result - failure. BOINC inadequately implements this mode of operation [that's topic for another thread actually]

2) GPU part.
GW NV unsupported on my GPU, FGRP NV has strong effects on sound quality and host response. So I crunch more "easy" app on GPU - MW NV.
Cause MW time to time has no work I attempt to configure E@h as "backup" project for GPU part. But setting it backup in BOINC sense (0 share) will contradict with 0).
Result - failure. Cache slowly (not too slowly actually) fills with E@h NV tasks so only them are crunching on GPU eventually causing disturbances in my own work more often I'm ready to tolerate.
ID: 103661 · Report as offensive
Profile Dave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2515
United Kingdom
Message 103662 - Posted: 22 Mar 2021, 18:05:31 UTC

Thanks, I understand a bit more now. (Enough to know that I can contribute little if anything of value to the discussion!) With that in mind, I shall bow out.
ID: 103662 · Report as offensive
Ian&Steve C.

Send message
Joined: 24 Dec 19
Posts: 228
United States
Message 103663 - Posted: 22 Mar 2021, 18:06:25 UTC - in response to Message 103660.  

"0.9CPU"

as far as I am aware, this value is only used for BOINC bookkeeping on used resources. this value cannot influence/limit/dictact what is ACTUALLY being used. you could change this value to 0.01, and the actual use for this task will remain unchanged, however BOINC will think you have more free CPU resources than you actually do, leading to the possibility to over-commit CPU resources depending on other CPU projects and CPU use settings you have.
ID: 103663 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103664 - Posted: 22 Mar 2021, 18:18:46 UTC - in response to Message 103663.  
Last modified: 22 Mar 2021, 18:21:38 UTC

"0.9CPU"

as far as I am aware, this value is only used for BOINC bookkeeping on used resources. this value cannot influence/limit/dictact what is ACTUALLY being used. you could change this value to 0.01, and the actual use for this task will remain unchanged, however BOINC will think you have more free CPU resources than you actually do, leading to the possibility to over-commit CPU resources depending on other CPU projects and CPU use settings you have.


Fully agree. But if app can work w/o reserved CPU core (under "can work" I mean not technical ability that of course exist always as far as we speak about not real-time processing but rather "work w/o big noticeable slowdown") project usually configures it as demanding less than 1 CPU. That allows full CPU cores load by CPU apps in BOINC.
Compare GW app (0.9CPU (maybe incorrectly)+ 1GPU), FGRP (1CPU+1GPU) and MW app (0.83CPU+1GPU, correctly, no noticeable slowdown while all cores busy, ~99% GPU load through all task processing).
Hence, looking on 0.9CPU setting, I came to (perhaps wrong) conclusion.
ID: 103664 · Report as offensive
Ian&Steve C.

Send message
Joined: 24 Dec 19
Posts: 228
United States
Message 103665 - Posted: 22 Mar 2021, 18:19:02 UTC - in response to Message 103661.  



I'll try to describe "what &why":
0) Both CPU and GPU: due to my router hangs time to time I prefer to have some local cache of work for both CPU and GPU.
1) CPU part
I prefer (from personal scientific declination) to participate mostly in GW search (also I consider it more resource-demanding so not all hosts can do it and need help from those who can), but host in question can't deal with 4 CPU GW tasks at once (low memory issue). Hence I attempt to limit number of GW tasks simultaneously in fly not disabling them completely.
So I attempted to use max_concurrent tag in app_config for GW app. Result - failure. BOINC inadequately implements this mode of operation [that's topic for another thread actually]

2) GPU part.
GW NV unsupported on my GPU, FGRP NV has strong effects on sound quality and host response. So I crunch more "easy" app on GPU - MW NV.
Cause MW time to time has no work I attempt to configure E@h as "backup" project for GPU part. But setting it backup in BOINC sense (0 share) will contradict with 0).
Result - failure. Cache slowly (not too slowly actually) fills with E@h NV tasks so only them are crunching on GPU eventually causing disturbances in my own work more often I'm ready to tolerate.


in this situation i suggest increasing your cache size of MW to be large enough to hold you over on the downtimes, and stopping processing of Einstein on the GPU completely, you can do this by simply unchecking the GPU work in your project preferences, or with an exclude line in cc_config. that would keep GPU work to MW only, and CPU work to Einstein only. would that solve the problem?
ID: 103665 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103666 - Posted: 22 Mar 2021, 18:25:02 UTC - in response to Message 103665.  

would that solve the problem?

If MW can provide as many tasks (task takes 10 minutes, if it has hardwired limit on tasks in fly per host this will not work) as needed, then yes.

I'll try such config and will see if it will solve GPU part of problem.
ID: 103666 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 103668 - Posted: 22 Mar 2021, 19:21:22 UTC - in response to Message 103663.  

"0.9CPU"

as far as I am aware, this value is only used for BOINC bookkeeping on used resources. this value cannot influence/limit/dictact what is ACTUALLY being used. you could change this value to 0.01, and the actual use for this task will remain unchanged, however BOINC will think you have more free CPU resources than you actually do, leading to the possibility to over-commit CPU resources depending on other CPU projects and CPU use settings you have.
Exactly. As I tried to express in https://github.com/BOINC/boinc/issues/2949

As I said in that report, the code in sched_customize.cpp (line 505 ff.) "still relies on assumptions about the relative speeds of CPU and GPU devices, and the proportion of the work to be done on each device. As speeds have diverged, these assumptions have become less and less realistic."
ID: 103668 · Report as offensive
Ian&Steve C.

Send message
Joined: 24 Dec 19
Posts: 228
United States
Message 103669 - Posted: 22 Mar 2021, 19:55:44 UTC - in response to Message 103668.  
Last modified: 22 Mar 2021, 19:56:08 UTC

"0.9CPU"

as far as I am aware, this value is only used for BOINC bookkeeping on used resources. this value cannot influence/limit/dictact what is ACTUALLY being used. you could change this value to 0.01, and the actual use for this task will remain unchanged, however BOINC will think you have more free CPU resources than you actually do, leading to the possibility to over-commit CPU resources depending on other CPU projects and CPU use settings you have.
Exactly. As I tried to express in https://github.com/BOINC/boinc/issues/2949

As I said in that report, the code in sched_customize.cpp (line 505 ff.) "still relies on assumptions about the relative speeds of CPU and GPU devices, and the proportion of the work to be done on each device. As speeds have diverged, these assumptions have become less and less realistic."


and from what some have indicated elsewhere (maybe you), 0.9 realistically = 0 due to BOINC logic.

0.9 = 0 (not 0.9)
0.9+0.9 = 1 (not 1.8)
0.9+0.9+0.9 = 2 (not 2.7)

and so on. this is why I just force 1 CPU + 1 GPU with the app_config so everything is counted properly, and free CPU resources are properly accounted.
ID: 103669 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 103670 - Posted: 22 Mar 2021, 20:01:33 UTC - in response to Message 103669.  

0.9 = 0 (not 0.9)
0.9+0.9 = 1 (not 1.8)
0.9+0.9+0.9 = 2 (not 2.7)
That's exactly correct. BOINC 'overcommits' the CPU by 'not more than one complete core'.
ID: 103670 · Report as offensive
1 · 2 · Next

Message boards : Questions and problems : BOINC unable to honor project shares (at all, not only in short run)

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.