Problem with "max concurrent" in app config.

Message boards : Questions and problems : Problem with "max concurrent" in app config.
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Peter Hucker
Avatar

Send message
Joined: 6 Oct 06
Posts: 1144
United Kingdom
Message 103425 - Posted: 6 Mar 2021, 13:37:13 UTC

On my 24 core Ryzen, I've put this in app config:

<app_config>
   <app>
      <name>kryptos-plato</name>
      <max_concurrent>4</max_concurrent>
   </app>
</app_config>

This is because they're virtualbox programs that make the computer sluggish if I do too many at once, but that reason is irrelevant to this discussion.

I have set a buffer of 0+3 hours, and everything other than the above app sticks to that. It waits until it's about to run out, then downloads 3 hours. But the above is accumulating a huge amount of tasks, because Boinc doesn't seem to be accounting for the app config setting when downloading them, I think it's assuming I'm going to run 24 at once. I've currently got a queue of 7 days 1 hour, which would take 7 hours on all 24 cores, so near enough. But on only 4 cores, 42 hours, nowhere near the 3 I asked for.
ID: 103425 · Report as offensive
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 976
United Kingdom
Message 103427 - Posted: 6 Mar 2021, 15:23:35 UTC

I assume you are saying all other projects are complying with the restrictions you have set, that is x-concurrent tasks and a buffer of no more than 3 hours of work.
That being the case this would appear to be a project that has it's end of the process wrongly configured and is just sending out large amounts of data when only a small amount is requested. A quick look at the project's own forum indicates that there are other people suffering similar (not obviously identical) problems to the one you are seeing, indeed I see you have been quite active in one of the threads, so it may well be better to report it over there.
If this is affecting more projects than just kryptos@home then it could a BOINC problem.
ID: 103427 · Report as offensive
Peter Hucker
Avatar

Send message
Joined: 6 Oct 06
Posts: 1144
United Kingdom
Message 103428 - Posted: 6 Mar 2021, 16:57:03 UTC - in response to Message 103427.  
Last modified: 6 Mar 2021, 17:00:42 UTC

I assume you are saying all other projects are complying with the restrictions you have set, that is x-concurrent tasks and a buffer of no more than 3 hours of work.
That being the case this would appear to be a project that has it's end of the process wrongly configured and is just sending out large amounts of data when only a small amount is requested. A quick look at the project's own forum indicates that there are other people suffering similar (not obviously identical) problems to the one you are seeing, indeed I see you have been quite active in one of the threads, so it may well be better to report it over there.
If this is affecting more projects than just kryptos@home then it could a BOINC problem.
I've only set restrictions on that project, since it's the only one slowing my computer down with multiple virtualbox programs.

In which thread over there did you see a similar problem? All I've seen is complaints about the VB program crashing or being slow.

I'm not getting huge amounts of work at once, it's in bits. My Boinc client is asking for more work from Kryptos when there's loads left, as though it's getting confused because it needs more work (because the other 20 cores have one idle), so it gets Kryptos work, then realises its mistake and gets work from another project. I see it happening every time the buffer is running out for other non-VB projects.

It has to be a problem on the client end, because a work request is sent to Kryptos when there's days of Kryptos in the buffer, immediately followed by a work request to another project when it fails to start another Kryptos task. It looks very like it's not checking its own app config file before deciding where to request work from, but then trying another project when that download didn't fill the cores.

To see who's right, I'll set a max concurrent limit for another project to see what happens.
ID: 103428 · Report as offensive
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 976
United Kingdom
Message 103429 - Posted: 6 Mar 2021, 17:08:26 UTC - in response to Message 103428.  

Rather than trying to describe what is going wrong post your logs - Richard may be along soon to ask you to set some of the log debug flags so we can see what is being asked for.
One thing (or is it two things?), how long does a kryptos task take to run (really run) and what is the initial estimate of run time for tasks? This may help people understand where and what is going wrong. (Registration for new users is currently down, so it is hard to get any idea of these things.) In the past (and even today) there are projects for which the runtime guess was a gross underestimate, thus the servers would just keep pushing up excess work based on invalid runtime estimates.
ID: 103429 · Report as offensive
Peter Hucker
Avatar

Send message
Joined: 6 Oct 06
Posts: 1144
United Kingdom
Message 103430 - Posted: 6 Mar 2021, 17:12:32 UTC - in response to Message 103429.  
Last modified: 6 Mar 2021, 17:13:07 UTC

Rather than trying to describe what is going wrong post your logs - Richard may be along soon to ask you to set some of the log debug flags so we can see what is being asked for.
One thing (or is it two things?), how long does a kryptos task take to run (really run) and what is the initial estimate of run time for tasks? This may help people understand where and what is going wrong. (Registration for new users is currently down, so it is hard to get any idea of these things.) In the past (and even today) there are projects for which the runtime guess was a gross underestimate, thus the servers would just keep pushing up excess work based on invalid runtime estimates.
On this machine they take an average of 25 minutes, and that's what it correctly has in the estimate. They are variable though and can finish in anywhere from 5 minutes to 2 hours.

What makes me think it's a client and not a server problem is my client requests work when it already has a plentiful amount (in fact over the upper limit of 3 hours I set, even if it were allowed to use all the cores).
ID: 103430 · Report as offensive
Peter Hucker
Avatar

Send message
Joined: 6 Oct 06
Posts: 1144
United Kingdom
Message 103431 - Posted: 6 Mar 2021, 17:28:44 UTC
Last modified: 6 Mar 2021, 17:29:25 UTC

It doesn't appear to be doing it on another project. I put TN-Grid to a much higher weighting so it wants to get some (Kryptos was high weighting), then restricted it to 4 at a time. It correctly got loads of Rosetta and Universe tasks instead. I'll stick it back where it was with the dodgy Kryptos requests. If you or Richard can tell me what to log I'll look into it, unless one of you can get an account there. I wasn't aware they had stopped new accounts, although they're in the middle of making new versions of the program so perhaps they don't want too much load. Either that or something has crashed - is it actually disabled or does it appear broken?
ID: 103431 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 4536
United Kingdom
Message 103432 - Posted: 6 Mar 2021, 17:40:33 UTC

Sorry, I was out for a walk, fetching the newspaper.

The key 'starter for ten' is <sched_op_debug> - that's quiet enough to leave running all the time. The key thing to post here is the number of seconds of work requested, and the number of (estimated) seconds returned.

Depending on the answer, we might start asking questions about DCF, and even for a single (PLEASE - only one) cycle of <work_fetch_debug>. DON'T leave that one running!

I don't know the project, so I can only go on what you report.
ID: 103432 · Report as offensive
Peter Hucker
Avatar

Send message
Joined: 6 Oct 06
Posts: 1144
United Kingdom
Message 103433 - Posted: 6 Mar 2021, 17:48:35 UTC - in response to Message 103432.  
Last modified: 6 Mar 2021, 17:51:59 UTC

Sorry, I was out for a walk, fetching the newspaper.

The key 'starter for ten' is <sched_op_debug> - that's quiet enough to leave running all the time. The key thing to post here is the number of seconds of work requested, and the number of (estimated) seconds returned.

Depending on the answer, we might start asking questions about DCF, and even for a single (PLEASE - only one) cycle of <work_fetch_debug>. DON'T leave that one running!

I don't know the project, so I can only go on what you report.
You can get newspapers on t'internet nowadays lad!

I'll switch <sched_op_debug> on and wait for it do it again. But as I said earlier, I don't think it's getting too much on the request, it only gets a handful at once. The problem is it's asking for more when it already has loads. A simplified example:

Kryptos at weighting 100, Universe at weighting 10, Rosetta at weighting 10. PC has 24 cores, Kryptos is limited to using 4. Boinc is running 4 Kryptos tasks, 10 Rosetta tasks, and 10 Universe tasks, and has 100 Kryptos tasks in the buffer. A Universe finishes, there is nothing but Kryptos in the queue and it's not allowed to run a 5th one. It asks Kryptos for more work. It can't run any of those. It asks Universe for more work, then starts one. I now have 110 Kryptos tasks in the buffer. Then 120 next time, and so on.
ID: 103433 · Report as offensive
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 976
United Kingdom
Message 103434 - Posted: 6 Mar 2021, 17:57:42 UTC - in response to Message 103433.  

You can get newspapers on t'internet nowadays lad!

There are some things one might find rather uncomfortable or unpleasant to do with an i-pad or the like......Wrapping up one's fish and chips being but one of them ;-)
ID: 103434 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 4536
United Kingdom
Message 103435 - Posted: 6 Mar 2021, 18:01:14 UTC - in response to Message 103434.  

You can get newspapers on t'internet nowadays lad!
There are some things one might find rather uncomfortable or unpleasant to do with an i-pad or the like......Wrapping up one's fish and chips being but one of them ;-)
And going for a walk on the internet is a bit wobbly.
ID: 103435 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 4536
United Kingdom
Message 103436 - Posted: 6 Mar 2021, 18:05:11 UTC - in response to Message 103433.  

Well, on those figures, Kryptos is always going to be the highest priority for work fetch - it'll always be under-represented compared to its resource share (and that won't take any notice of cc_config.xml, either).

Tweak your resource shares to more closely match what you actually want to run.
ID: 103436 · Report as offensive
Peter Hucker
Avatar

Send message
Joined: 6 Oct 06
Posts: 1144
United Kingdom
Message 103437 - Posted: 6 Mar 2021, 18:08:05 UTC - in response to Message 103435.  

You can get newspapers on t'internet nowadays lad!
There are some things one might find rather uncomfortable or unpleasant to do with an i-pad or the like......Wrapping up one's fish and chips being but one of them ;-)
And going for a walk on the internet is a bit wobbly.
Trying it on fibre is just showing off.
ID: 103437 · Report as offensive
Peter Hucker
Avatar

Send message
Joined: 6 Oct 06
Posts: 1144
United Kingdom
Message 103438 - Posted: 6 Mar 2021, 18:09:50 UTC - in response to Message 103436.  

Well, on those figures, Kryptos is always going to be the highest priority for work fetch - it'll always be under-represented compared to its resource share (and that won't take any notice of cc_config.xml, either).

Tweak your resource shares to more closely match what you actually want to run.
Ah, so it's a known bug. App_config not checked when deciding on which project to download from. The funny thing is it always gets it right the second time. Something told it "oh, those ones I just downloaded are no good". If that check was performed one stage earlier....

I find it a bit daft that I have to adjust two things to tell it one thing.
ID: 103438 · Report as offensive
Peter Hucker
Avatar

Send message
Joined: 6 Oct 06
Posts: 1144
United Kingdom
Message 103439 - Posted: 6 Mar 2021, 19:47:28 UTC - in response to Message 103436.  

Well, on those figures, Kryptos is always going to be the highest priority for work fetch - it'll always be under-represented compared to its resource share (and that won't take any notice of cc_config.xml, either).

Tweak your resource shares to more closely match what you actually want to run.

I set the resource shares globally among 7 computers and 2 phones. Each PC might have different requirements - can't run certain apps, can't run too many at once, etc. I'll just leave it, it seems to top out at a certain number in the queue (which is still well within the deadline), so I can only assume the server is saying "er.... no, you have loads to do, you ain't getting any more".
ID: 103439 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 4536
United Kingdom
Message 103440 - Posted: 6 Mar 2021, 20:05:24 UTC - in response to Message 103439.  

I can only assume the server is saying "er.... no, you have loads to do, you ain't getting any more".
The actual wording in the log would be "This computer has reached a limit of tasks in progress", after an attempted work fetch.

If the project has set such a thing.
ID: 103440 · Report as offensive
Peter Hucker
Avatar

Send message
Joined: 6 Oct 06
Posts: 1144
United Kingdom
Message 103443 - Posted: 7 Mar 2021, 17:11:42 UTC - in response to Message 103440.  

I can only assume the server is saying "er.... no, you have loads to do, you ain't getting any more".
The actual wording in the log would be "This computer has reached a limit of tasks in progress", after an attempted work fetch.

If the project has set such a thing.
The biggest unanswered question is, why is Boinc asking for work without checking the requirements I've set in app config? It would be like:

You own a business doing up houses. You employ 4 electricians and 6 carpenters. You already have a backlog of electrical work to get done, but one of the carpenters is sat idle. You don't find more electrical work!
ID: 103443 · Report as offensive
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 976
United Kingdom
Message 103444 - Posted: 7 Mar 2021, 17:36:42 UTC - in response to Message 103428.  

Sorry, I forgot to say - this one looks to be a similar problem to yours (from the title)
https://www.kryptosathome.com/forum_thread.php?id=18
ID: 103444 · Report as offensive
Peter Hucker
Avatar

Send message
Joined: 6 Oct 06
Posts: 1144
United Kingdom
Message 103445 - Posted: 7 Mar 2021, 17:49:59 UTC - in response to Message 103444.  
Last modified: 7 Mar 2021, 17:50:56 UTC

Sorry, I forgot to say - this one looks to be a similar problem to yours (from the title)
https://www.kryptosathome.com/forum_thread.php?id=18
While they are discussing my (solved) problem of slowing down or crashing the tasks, and they achieved this by using either max concurrent or limiting Boinc's CPU usage globally, nobody has mentioned the problem I then get of downloading too many.

I don't have any crashing any more. That can be stopped by leaving paused tasks in RAM - they don't checkpoint properly, Boinc assumes everything does. Another project, Private GFN (a branch of Primegrid), also cannot be paused, but it's GPU work, and Boinc refuses to leave those in memory. I have in the past set Boinc to have a very large time between swapping tasks, but that causes things to miss their deadline.

Whatever you tweak in Boinc, it breaks something else [head collides with desk]
ID: 103445 · Report as offensive
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 976
United Kingdom
Message 103446 - Posted: 7 Mar 2021, 18:11:41 UTC - in response to Message 103445.  

As I said - I just scanned thread titles - win some, loose some.
I did have a look in a couple of the other threads and someone a few months back said something about this project not having done a good job with the actual application development. I think your comments about not check-pointing correctly is a symptom of such :-(

I can't help thinking that the issues you are seeing with the excessive delivery of tasks may be related to having tried to do "something clever" with the server side of BOINC and haven't got everything right. I do know that other projects are no exhibiting the same over-supply issue when using the max-concurrent tag, but seeing the exact messages around a work-call and work-not-call would certainly help Richard understand what is going on and thus either report a bug correctly or point you in the direction of a solution.
ID: 103446 · Report as offensive
Peter Hucker
Avatar

Send message
Joined: 6 Oct 06
Posts: 1144
United Kingdom
Message 103447 - Posted: 7 Mar 2021, 18:34:04 UTC - in response to Message 103446.  

As I said - I just scanned thread titles - win some, loose some.
I did have a look in a couple of the other threads and someone a few months back said something about this project not having done a good job with the actual application development. I think your comments about not check-pointing correctly is a symptom of such :-(
I did find it amusing when an LHC admin joined in. The argument was because Kryptos requires VB5, and although LHC will work with 5, they prefer you to use 6. All I did was link to a valid point made by LHC that it was a bit off making a brand new program requiring users to install outdated software.

I can't help thinking that the issues you are seeing with the excessive delivery of tasks may be related to having tried to do "something clever" with the server side of BOINC and haven't got everything right. I do know that other projects are no exhibiting the same over-supply issue when using the max-concurrent tag, but seeing the exact messages around a work-call and work-not-call would certainly help Richard understand what is going on and thus either report a bug correctly or point you in the direction of a solution.
I'm still convinced it's client side, because why is my client asking for more work when it already has well over the buffer? It can only be caused by it not checking to see if those idle cores are actually allowed to do Kryptos.

Is this what is needed?

20748	Kryptos@Home	07-03-2021 02:54 PM	[sched_op] Starting scheduler request	
20749	Kryptos@Home	07-03-2021 02:54 PM	Sending scheduler request: To fetch work.	
20750	Kryptos@Home	07-03-2021 02:54 PM	Reporting 5 completed tasks	
20751	Kryptos@Home	07-03-2021 02:54 PM	Requesting new tasks for CPU	
20752	Kryptos@Home	07-03-2021 02:54 PM	[sched_op] CPU work request: 38618.23 seconds; 0.00 devices	
20753	Kryptos@Home	07-03-2021 02:54 PM	[sched_op] AMD/ATI GPU work request: 0.00 seconds; 0.00 devices	
20754	Kryptos@Home	07-03-2021 02:54 PM	Scheduler request completed: got 20 new tasks	
20755	Kryptos@Home	07-03-2021 02:54 PM	[sched_op] Server version 715	
20756	Kryptos@Home	07-03-2021 02:54 PM	You are attached to this project twice.  Please remove projects named Kryptos@Home, then add http://www.kryptosathome.com/	
20757	Kryptos@Home	07-03-2021 02:54 PM	Project requested delay of 7 seconds	
20758	Kryptos@Home	07-03-2021 02:54 PM	[sched_op] estimated total CPU task duration: 39316 seconds	
20759	Kryptos@Home	07-03-2021 02:54 PM	[sched_op] estimated total AMD/ATI GPU task duration: 0 seconds	
20760	Kryptos@Home	07-03-2021 02:54 PM	[sched_op] handle_scheduler_reply(): got ack for task kryptos-plato_4090535_1614514611.768375_1	
20761	Kryptos@Home	07-03-2021 02:54 PM	[sched_op] handle_scheduler_reply(): got ack for task kryptos-plato_389219_1615074334.236245_0	
20762	Kryptos@Home	07-03-2021 02:54 PM	[sched_op] handle_scheduler_reply(): got ack for task kryptos-plato_389243_1615074334.554018_0	
20763	Kryptos@Home	07-03-2021 02:54 PM	[sched_op] handle_scheduler_reply(): got ack for task kryptos-plato_385387_1615072753.452400_0	
20764	Kryptos@Home	07-03-2021 02:54 PM	[sched_op] handle_scheduler_reply(): got ack for task kryptos-plato_391299_1615074871.761326_0	
20765	Kryptos@Home	07-03-2021 02:54 PM	[sched_op] Deferring communication for 00:00:07	
20766	Kryptos@Home	07-03-2021 02:54 PM	[sched_op] Reason: requested by project	

RYZEN

21062	Kryptos@Home	07-03-2021 03:55 PM	[sched_op] Starting scheduler request	
21063	Kryptos@Home	07-03-2021 03:55 PM	Sending scheduler request: To report completed tasks.	
21064	Kryptos@Home	07-03-2021 03:55 PM	Reporting 8 completed tasks	
21065	Kryptos@Home	07-03-2021 03:55 PM	Not requesting tasks: don't need (CPU: not highest priority project; AMD/ATI GPU: )	
21066	Kryptos@Home	07-03-2021 03:55 PM	[sched_op] CPU work request: 0.00 seconds; 0.00 devices	
21067	Kryptos@Home	07-03-2021 03:55 PM	[sched_op] AMD/ATI GPU work request: 0.00 seconds; 0.00 devices	
21068	Kryptos@Home	07-03-2021 03:55 PM	Scheduler request completed	
21069	Kryptos@Home	07-03-2021 03:55 PM	[sched_op] Server version 715	
21070	Kryptos@Home	07-03-2021 03:55 PM	You are attached to this project twice.  Please remove projects named Kryptos@Home, then add http://www.kryptosathome.com/	
21071	Kryptos@Home	07-03-2021 03:55 PM	Project requested delay of 7 seconds	
21072	Kryptos@Home	07-03-2021 03:55 PM	[sched_op] handle_scheduler_reply(): got ack for task kryptos-plato_392739_1615075284.827515_0	
21073	Kryptos@Home	07-03-2021 03:55 PM	[sched_op] handle_scheduler_reply(): got ack for task kryptos-plato_392022_1615075140.740626_0	
21074	Kryptos@Home	07-03-2021 03:55 PM	[sched_op] handle_scheduler_reply(): got ack for task kryptos-plato_389235_1615074334.494139_0	
21075	Kryptos@Home	07-03-2021 03:55 PM	[sched_op] handle_scheduler_reply(): got ack for task kryptos-plato_389251_1615074334.656094_0	
21076	Kryptos@Home	07-03-2021 03:55 PM	[sched_op] handle_scheduler_reply(): got ack for task kryptos-plato_391999_1615075135.625286_0	
21077	Kryptos@Home	07-03-2021 03:55 PM	[sched_op] handle_scheduler_reply(): got ack for task kryptos-plato_392763_1615075285.144052_0	
21078	Kryptos@Home	07-03-2021 03:55 PM	[sched_op] handle_scheduler_reply(): got ack for task kryptos-plato_391200_1615074831.501145_0	
21079	Kryptos@Home	07-03-2021 03:55 PM	[sched_op] handle_scheduler_reply(): got ack for task kryptos-plato_389462_1615074338.257452_0	
21080	Kryptos@Home	07-03-2021 03:55 PM	[sched_op] Deferring communication for 00:00:07	
21081	Kryptos@Home	07-03-2021 03:55 PM	[sched_op] Reason: requested by project	
ID: 103447 · Report as offensive
1 · 2 · Next

Message boards : Questions and problems : Problem with "max concurrent" in app config.

Copyright © 2021 University of California. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.