7.14.2 and 7.12.1 both fail to get work units on very fast systems

Message boards : BOINC client : 7.14.2 and 7.12.1 both fail to get work units on very fast systems
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 91230 - Posted: 28 Apr 2019, 1:02:40 UTC
Last modified: 28 Apr 2019, 1:03:57 UTC

There was a discussion about this at milkyway and also at seti. Basically my 4 GPUs finish a work unit in 10 seconds on the average. The queue when full is typically 600 - 800 but after it empties (milkyway project) no work units are provided for anywhere from 5 - 15 minutes. The suggestion was to downgrade to 7.12.1 but that did not fix the problem. This is inconvenient as boinc schedules other projects in whereas I have set a priority where I don't want them to run unless the primary project is down, off line, etc. I can issue a manual "update" to fix the problem so the project has data but wont send it.

Tried 7.12.1 : got a 10.5 minute delay as shown at 7:43. Longer delays with 7.14.2 as shown here going back 24 hours of history.

1			4/27/2019 5:02:43 PM	Starting BOINC client version 7.12.1 for windows_x86_64	
2			4/27/2019 5:02:43 PM	log flags: file_xfer, sched_ops, task	
3			4/27/2019 5:02:43 PM	Libraries: libcurl/7.47.1 OpenSSL/1.0.2g zlib/1.2.8	
4			4/27/2019 5:02:43 PM	Data directory: C:\ProgramData\BOINC	
5			4/27/2019 5:02:43 PM	Running under account josephy@stateson.net	
6			4/27/2019 5:02:45 PM	OpenCL: AMD/ATI GPU 0: AMD FirePro S9100 (driver version 2671.3, device version OpenCL 1.2 AMD-APP (2671.3), 6144MB, 6144MB available, 3226 GFLOPS peak)	
7			4/27/2019 5:02:45 PM	OpenCL: AMD/ATI GPU 1: AMD FirePro S9100 (driver version 2671.3, device version OpenCL 1.2 AMD-APP (2671.3), 6144MB, 6144MB available, 3226 GFLOPS peak)	
8			4/27/2019 5:02:45 PM	OpenCL: AMD/ATI GPU 2: AMD FirePro S9100 (driver version 2671.3, device version OpenCL 2.0 AMD-APP (2671.3), 12288MB, 12288MB available, 4608 GFLOPS peak)	
9			4/27/2019 5:02:45 PM	OpenCL: AMD/ATI GPU 3: AMD FirePro S9100 (driver version 2671.3, device version OpenCL 1.2 AMD-APP (2671.3), 6144MB, 6144MB available, 3226 GFLOPS peak)	
-
-
-
-
2213	Milkyway@Home	4/27/2019 7:31:21 PM	Computation for task de_modfit_85_bundle4_4s_south4s_0_1555431910_4124594_0 finished	
2214	Milkyway@Home	4/27/2019 7:32:32 PM	Sending scheduler request: To fetch work.	
2215	Milkyway@Home	4/27/2019 7:32:32 PM	Reporting 7 completed tasks	
2216	Milkyway@Home	4/27/2019 7:32:32 PM	Requesting new tasks for AMD/ATI GPU	
2217	Milkyway@Home	4/27/2019 7:32:34 PM	Scheduler request completed: got 0 new tasks	
2218	Milkyway@Home	4/27/2019 7:43:14 PM	Sending scheduler request: To fetch work.	
2219	Milkyway@Home	4/27/2019 7:43:14 PM	Requesting new tasks for AMD/ATI GPU	
2220	Milkyway@Home	4/27/2019 7:43:20 PM	Scheduler request completed: got 598 new tasks	
2221	Milkyway@Home	4/27/2019 7:43:23 PM	Starting task de_modfit_80_bundle5_4s_south4s_0_1554998626_1474893_2	
[/code]
ID: 91230 · Report as offensive
Profile Keith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 863
United States
Message 91231 - Posted: 28 Apr 2019, 6:33:33 UTC

Maybe you can ask Richard Haselgrove where to find the #3076 appveyor artifact link for the Windows client so you can download it and try it.

I thought using the older 7.12.1 client would worked for you.
ID: 91231 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 91232 - Posted: 28 Apr 2019, 7:06:47 UTC - in response to Message 91231.  
Last modified: 28 Apr 2019, 8:35:05 UTC

I think we need to understand a little bit better where this delay is coming from.

First, every time you contact a project, the project itself asks you to pause a little while before asking again - SETI for 303 seconds, Milkyway for 91 seconds. That shows in the Event Log, and it's best to have regard to it, but it obviously isn't the problem here.

Most of the time, it's your machine which does the asking for work, when it thinks it needs it - and that depends on your cache settings. It's usually best to ask 'little and often', but that does depend on whether the project regularly has work to send you.

I'd always suggests setting the <sched_op_debug> Event Log flag (which you can do from the Manager, Options menu) - it gives you the very basic stuff, like

28/04/2019 07:09:43 | SETI@home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
28/04/2019 07:09:43 | SETI@home | [sched_op] NVIDIA GPU work request: 4332.36 seconds; 0.00 devices
28/04/2019 07:09:43 | SETI@home | [sched_op] Intel GPU work request: 0.00 seconds; 0.00 devices
- I usually like to request about an hour of work, once an hour, but we can tweak that later. The problems like you're describing more usually arise when you wait and ask for a lot of work at once, and happen to hit a moment when the project hasn't got any - that tends to be when BOINC decides not to bother asking for a while, and fetch from a backup project instead. But we need to understand the overall picture before we can be sure of that.

Got to go out now, but the artifacts Keith mentioned are at https://ci.appveyor.com/project/BOINC/boinc/builds/23992763/artifacts. I'll dig out some notes on how to use them later.

Late edit - here are the notes I wrote last time for using those artifacts.

You'll need the middle one of the three downloadable links, labelled 'win-client'.

You may also need two common free utilities:
1) An archiver program capable of handling the 7z archive format, like https://www.7-zip.org/
2) The Microsoft Visual C++ Redistributable Packages for Visual Studio 2013, from https://www.microsoft.com/en-US/download/details.aspx?id=40784 (or your local Microsoft site)

The win-client archive package contains five separate executables. You only need to extract one of them, boinc.exe

All you need to do is to stop BOINC and replace that single file in the BOINC program directory: by default this is

C:\Program Files\BOINC

Try starting BOINC normally: it should just start up. If you see the Manager opening, but nothing else happens (it stops at 'Connecting to Client'), go back to C:\Program Files\BOINC and double-click on the new copy of boinc.exe. If you get a warning about a missing DLL (either msvcp120.dll or msvcr120.dll, I forget which pops up first), simply install the Microsoft C++ package.

And that's all. BOINC should start up normally, and report in the event log

Starting BOINC client version 7.15.0 for windows_x86_64
This a development version of BOINC and may not function properly
ID: 91232 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 91233 - Posted: 28 Apr 2019, 14:46:44 UTC - in response to Message 91232.  
Last modified: 28 Apr 2019, 15:03:18 UTC

I think we need to understand a little bit better where this delay is coming from.

I'd always suggests setting the <sched_op_debug> Event Log flag

https://ci.appveyor.com/project/BOINC/boinc/builds/23992763/artifacts

I will set those log flags and try to get a better picture of what is happening. I did look at that appveyor but I don't think it applies as I do not use max concurrent in cc_config. I do have an app_config for milkyway that I discovered long ago using google. I am not sure what all it does but it does list more info about the GPU and supposedly allows tasks to run faster. I assume it is not causing the problem.

<app_config>
<app_version>
<app_name>milkyway</app_name>
<plan_class>opencl_ati_101</plan_class>
<avg_ncpus>0.20</avg_ncpus>
<ngpus>0.19</ngpus>
<cmdline>--non-responsive --verbose --gpu-target-frequency 1 --gpu-polling-mode -1 --gpu-wait-factor 0 --process-priority 4 --gpu-disable-checkpointing</cmdline>
</app_version>
</app_config>


[EDIT] going to use the following
<cc_config>
<log_flags>
	<work_fetch_debug>1</work_fetch_debug>
	<sched_op_debug>1</sched_op_debug>
</log_flags>
</cc_config>


[EDIT AGAIN]
Getting message "no project chosen for work fetch". I looked at wiki for cc_config and did not see how to restrict work fetch to just milkyway else I get a lot of messages from projects that are not active
ID: 91233 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 91234 - Posted: 28 Apr 2019, 15:13:22 UTC - in response to Message 91233.  

Yes, <work_fetch_debug> is a bit of a blunderbuss. Best to set it once, wait until it's done just one cycle, and then unset it again while you pick over the pieces. [That's why I got them to put an 'apply' button on the dialog :-)]

But it is powerful - if you could fillet out that one complete cycle from

[work_fetch] ------- start work fetch state -------

to

[work_fetch] ------- end work fetch state -------

and post it here, we could take a look. Might contain some clues.
ID: 91234 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 91235 - Posted: 28 Apr 2019, 15:17:15 UTC - in response to Message 91234.  
Last modified: 28 Apr 2019, 15:18:30 UTC

Yes, <work_fetch_debug> is a bit of a blunderbuss. Best to set it once, wait until it's done just one cycle, and then unset it again while you pick over the pieces. [That's why I got them to put an 'apply' button on the dialog :-)]

But it is powerful - if you could fillet out that one complete cycle from

[work_fetch] ------- start work fetch state -------

to

[work_fetch] ------- end work fetch state -------

and post it here, we could take a look. Might contain some clues.



I managed t find
    if (found) {
        p->sched_rpc_pending = RPC_REASON_NEED_WORK;
    } else {
        if (log_flags.work_fetch_debug) {
            msg_printf(0, MSG_INFO, "[work_fetch] No project chosen for work fetch");


at THIS location but there was no selection for projects. Really need to exclude projects that are not active. I will delete all unused projects from this system to clean up the message log. l gave up trying to build boinc under VS2017 sometime ago.
ID: 91235 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 91236 - Posted: 28 Apr 2019, 15:41:55 UTC
Last modified: 28 Apr 2019, 16:21:00 UTC

OK, found a way to remove clutter, using BoincTasks "select project" to see only milkyway

====at line 15819=====
At 10:14:04 was last report of completed tasks. 9 reported. THE QUEUE IS EMPTY AT THIS TIME

At 10:20:58 got 621 new tasks. Delay of 6 minutes. Not to bad compared to 15 minutes I have seen in past.

Printout from line 15395 to 16650 is here
stateson.net\images\15395.txt

I have the whole 9 yards available if needed.
HTH !!!

[EDIT] Looks like project requested a 6 minute delay! Could this be the problem? Was it the client that wants a delay? I don't know how to read this info. Is it explained somewhere? If so I don't mind doing an analysis.

15835 Milkyway@Home 4/28/2019 10:14:06 AM [work_fetch] backing off AMD/ATI GPU 381 sec
ID: 91236 · Report as offensive
Profile Keith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 863
United States
Message 91239 - Posted: 28 Apr 2019, 17:10:40 UTC - in response to Message 91236.  

Richard is the expert in decrypting the work fetch debug output. Everything appears normal for intervals. The work requests look normal. What I don't understand is why you are getting backoffs for 10 and 5 minutes directly after the scheduler acknowledges receipt of reported work. That is coming from the scheduler and not from your host or client. Normally the scheduler backs off if there are issues in contacting the servers or the client has issues downloading work and the client can't acknowledge correct reception of the sent tasks. Have you looked at the Transfers tab in the Manager after you have requested work and see if you have task downloads in backoff?
ID: 91239 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 91240 - Posted: 28 Apr 2019, 17:32:18 UTC - in response to Message 91239.  
Last modified: 28 Apr 2019, 17:38:56 UTC

Richard is the expert in decrypting the work fetch debug output. Everything appears normal for intervals. The work requests look normal. What I don't understand is why you are getting backoffs for 10 and 5 minutes directly after the scheduler acknowledges receipt of reported work. That is coming from the scheduler and not from your host or client. Normally the scheduler backs off if there are issues in contacting the servers or the client has issues downloading work and the client can't acknowledge correct reception of the sent tasks. Have you looked at the Transfers tab in the Manager after you have requested work and see if you have task downloads in backoff?


I will check that possibility. Be nice if that info was in the event log. AFAICT there is no transfer "history" to review so I got to be ready to catch it. I have good bandwidth here at home but very rarely downloads hang up if too many concurrent. Conceivably, if a number of GPUGRID tasks complete all at once then the upload can be bottlenecked.

I had asked Fred at BoincTasks about implementing a rule for a project being out of data as I could then use the rule to run a batch file and send a text message to my phone. He has a lot on his plate so not sure about when or if that gets implemented. AFAIK that is the only way to find out in real time if project is out of data (other than editing boinc code and building a test program). If the client could put transfer info such as number of pending and estimated time into the event log that would be a real help.

[EDIT] Actually, can babysit the last few work units and when the hit 0 tasks, bring up the transfer tab to see WTF is going on.
ID: 91240 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 91241 - Posted: 28 Apr 2019, 18:12:07 UTC - in response to Message 91239.  

Richard is the expert in decrypting the work fetch debug output. Everything appears normal for intervals. The work requests look normal. What I don't understand is why you are getting backoffs for 10 and 5 minutes directly after the scheduler acknowledges receipt of reported work. That is coming from the scheduler and not from your host or client. Normally the scheduler backs off if there are issues in contacting the servers or the client has issues downloading work and the client can't acknowledge correct reception of the sent tasks. Have you looked at the Transfers tab in the Manager after you have requested work and see if you have task downloads in backoff?
No need. Look at that event log again, without the clutter:

15402 Milkyway@Home 4/28/2019 10:12:30 AM Scheduler request completed: got 0 new tasks
15404 Milkyway@Home 4/28/2019 10:12:30 AM Project requested delay of 91 seconds
15416 Milkyway@Home 4/28/2019 10:12:30 AM [work_fetch] backing off AMD/ATI GPU 723 sec
The project requested 91 seconds. The backoff was done by the client, as a normal reaction to the lack of available work. And if no work was assigned by the server, there'll be no files to download, and nothing will show in the transfers tab.

Sorry, I've had a busy weekend showing visitors round Yorkshire. They've moved on to London now, but I found myself surprisingly tired (and I've got a watercooler appointment with the TV later tonight). I should be back to normal tomorrow, and I'll try to look through the rest of the log before your morning starts.
ID: 91241 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 91242 - Posted: 28 Apr 2019, 18:25:55 UTC

OK, I can manage the next bit now. The next request was

15823 Milkyway@Home 4/28/2019 10:14:06 AM Scheduler request completed: got 0 new tasks
That's effectively 91 seconds after the one in my last post. In this mode, the client will set a backoff every time it gets a got 0 new tasks response, and will clear it every time one of your previously cached tasks finishes computation. Those are both known design features, like 'em or not.

At the moment, you're completing cached work much more often than once every 91 seconds, so you'll keep asking at every possible occasion. But as soon as you run dry, there's nothing to clear the backoffs (no more work finishing), and they'll build up.

The only way to cure this one is to persuade the project to provide more tasks.

A possible mitigation would be to set a cron job to trigger boinccmd to issue a project update every five minutes, but you might get very unpopular with the server administrators. They obviously aren't set up to satisfy your voracious appetite.
ID: 91242 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 91243 - Posted: 28 Apr 2019, 18:34:23 UTC - in response to Message 91241.  
Last modified: 28 Apr 2019, 18:38:26 UTC

Richard is the expert in decrypting the work fetch debug output. Everything appears normal for intervals. The work requests look normal. What I don't understand is why you are getting backoffs for 10 and 5 minutes directly after the scheduler acknowledges receipt of reported work. That is coming from the scheduler and not from your host or client. Normally the scheduler backs off if there are issues in contacting the servers or the client has issues downloading work and the client can't acknowledge correct reception of the sent tasks. Have you looked at the Transfers tab in the Manager after you have requested work and see if you have task downloads in backoff?
No need. Look at that event log again, without the clutter:

15402 Milkyway@Home 4/28/2019 10:12:30 AM Scheduler request completed: got 0 new tasks
15404 Milkyway@Home 4/28/2019 10:12:30 AM Project requested delay of 91 seconds
15416 Milkyway@Home 4/28/2019 10:12:30 AM [work_fetch] backing off AMD/ATI GPU 723 sec
The project requested 91 seconds. The backoff was done by the client, as a normal reaction to the lack of available work. And if no work was assigned by the server, there'll be no files to download, and nothing will show in the transfers tab.

Sorry, I've had a busy weekend showing visitors round Yorkshire. They've moved on to London now, but I found myself surprisingly tired (and I've got a watercooler appointment with the TV later tonight). I should be back to normal tomorrow, and I'll try to look through the rest of the log before your morning starts.


Another question might be: Why were 0 tasks sent when the project had about 11,000** tasks ready to send. If the project does not want to send tasks (for whatever reason) then the problem is the project and not the client.

If I wait out the seconds (723 or whatever) then I eventually get some new work. I have had other systems with nVidia cards running milkyway. They run much slower and I don't see them run out of data unless the project is off-line.

*** Not sure how often the server status is updated but I checked it when my last milkyway task finished and the delay started. I did not get any new work for a few minutes so I issued a project update and got work immediately. It is looking like the project is not sending stuff that it has and the client is backing off thinking there is no work which would be the correct procedure IF and only IF the project actually had no work. My guess is the problem is on the server side. Going to put 7.14.2 back on that system.
ID: 91243 · Report as offensive
Profile Keith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 863
United States
Message 91245 - Posted: 28 Apr 2019, 20:30:46 UTC - in response to Message 91241.  
Last modified: 28 Apr 2019, 20:37:59 UTC

No need. Look at that event log again, without the clutter:

15402 Milkyway@Home 4/28/2019 10:12:30 AM Scheduler request completed: got 0 new tasks
15404 Milkyway@Home 4/28/2019 10:12:30 AM Project requested delay of 91 seconds
15416 Milkyway@Home 4/28/2019 10:12:30 AM [work_fetch] backing off AMD/ATI GPU 723 sec


Yes, I missed that in all the clutter. I agree with the observation that the project is rarely ever out of work. Only when doing rare maintenance or has broken. Now that the tasks per gpu has been increased from historical 80 per to 300 now, I would need many hours to work through my 0.5 day cache with only the MW project running on my hosts. But I have Nvidia and not ATI/AMD cards.

So why does the client get assigned no work on the request when in fact the server DOES have work. Could this be the case if the RTS buffer size is set too low at MW and too many people hit the buffer just before Beemer Biker hit the buffer with his request which exhausted the available work to 0?

Requesting new tasks for AMD/ATI GPU
15821 Milkyway@Home 4/28/2019 10:14:04 AM [sched_op] CPU work request: 0.00 seconds; 0.00 devices
15822 Milkyway@Home 4/28/2019 10:14:04 AM [sched_op] AMD/ATI GPU work request: 120960.00 seconds; 4.00 devices
15823 Milkyway@Home 4/28/2019 10:14:06 AM Scheduler request completed: got 0 new tasks

[Edit] Incorrect in my number of allowed tasks. This is from Jake Weiss' post in project News

Hey guys,

So the current set up allows for users to have up to 200 workunits per GPU on their computer and another 40 workunits per CPU with a maximum of 600 possible workunits.

On the server, we try to store a cache of 10,000 workunits. Sometimes when a lot of people request work all at the same time, this cache will run low.

So all of the numbers I have listed are tunable. What would you guys recommend for changes to these numbers?

Jake
ID: 91245 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 91247 - Posted: 28 Apr 2019, 22:22:12 UTC

The TV cliffhanger is nicely set up for the feature-length series closer next week. Meanwhile, back at BOINC...

Quickly, before I fall into bed. The BOINC structure is uniform across projects, with minor local tweaks.

There are two numbers to consider - and please pass these on to Jake for consideration.

The first is the total number of workunits created by what are generically known as workunit generators, and are familiar to SETIzens as 'splitters'. At SETI, this number hovers around the 600,000 level, and is subject to some hysteresis - it takes time to turn off the generators when the RTS buffer hits peak, and they aren't called back into service until it falls to trough. The SETI generator has a typical wavelength of around an hour: I don't have a ready knowledge of the Milkyway figures, or even whether there's an equivalent of the Haveland graphs. The RTS buffer is stored in databases and disk files.

The second number relates to tasks held in fast cache memory by a process known as the 'feeder'. That number's probably measured in hundreds, and has a cycle time measured in seconds.

When you request work, it's the tasks in the feeder cache which are scanned for suitability: you need that fast cache response time. "No tasks allocated" equates either to "feeder empty" or "no suitable tasks in feeder". It's the old trade counter vs warehouse problem:

Are these in stock? Yes
Can I pick one up? No
Why not? Because I'd have to send Joe over to the warehouse first, and he's on lunch

My suspicion, reading this thread, is that Jake is talking about the workunit generators and RTS: I don't think he's reached the page in the manual about the feeder yet. Perhaps one of you could point him to https://boinc.berkeley.edu/trac/wiki/BackendPrograms#feeder, but tell him not to blindly wind up

<feeder_query_size>N</feeder_query_size>
The size of the feeder's enumeration query. Default is 200.
to something obscene: both timing (the scheduler has to search the list) and size (it has to fit into memory without paging) are critical.

And so to bed.
ID: 91247 · Report as offensive
Mad_Max

Send message
Joined: 29 Apr 19
Posts: 19
Russia
Message 91249 - Posted: 29 Apr 2019, 9:54:01 UTC - in response to Message 91242.  
Last modified: 29 Apr 2019, 10:09:00 UTC

OK, I can manage the next bit now. The next request was

15823 Milkyway@Home 4/28/2019 10:14:06 AM Scheduler request completed: got 0 new tasks
That's effectively 91 seconds after the one in my last post. In this mode, the client will set a backoff every time it gets a got 0 new tasks response, and will clear it every time one of your previously cached tasks finishes computation. Those are both known design features, like 'em or not.

At the moment, you're completing cached work much more often than once every 91 seconds, so you'll keep asking at every possible occasion. But as soon as you run dry, there's nothing to clear the backoffs (no more work finishing), and they'll build up.

The only way to cure this one is to persuade the project to provide more tasks.

A possible mitigation would be to set a cron job to trigger boinccmd to issue a project update every five minutes, but you might get very unpopular with the server administrators. They obviously aren't set up to satisfy your voracious appetite.

It is not a problem with supply of work from a project. I also have this problem with getting no work from MW. And many other people too (discussion thread at MW forum about it - https://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=4424 and in few other threads)
There are plenty of work available to sent (usually server maintain about 10 000 tasks ready to sent, sometimes it goes down by 1к-2к task but almost never close to zero), but some users can not get any of them until their local BOINC work cache is empty getting "got 0 new tasks" all the time until this point.
But if you press "update" - client receives a lot(few dozen task per request) of work immediately. Only automatic (scheduled) work fetch is failing.

Here is an example of BOINC log https://pastebin.com/8LCxm5RN

Problem also exit with old BOINC clients - i used 7.6.22 for example.
So may be it is not a client issue but within the server part of BOINC code.
ID: 91249 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 91250 - Posted: 29 Apr 2019, 10:04:25 UTC - in response to Message 91249.  
Last modified: 29 Apr 2019, 10:06:46 UTC

See my following post. It's the difference between warehouse storage and front desk pickup.

You see the numbers in the warehouse, and sure - there's plenty of work back there. But you don't see what's on the front desk - the project doesn't show that to you, because it changes several times a second.

All that clicking 'update' does is to send an immediate request (if you need work) - it doesn't change the state of the front desk pickup supplies. You stand exactly the same chance of success as if BOINC had asked automatically.
ID: 91250 · Report as offensive
Mad_Max

Send message
Joined: 29 Apr 19
Posts: 19
Russia
Message 91251 - Posted: 29 Apr 2019, 10:32:15 UTC
Last modified: 29 Apr 2019, 10:37:43 UTC

Read it. But am still sure that problem is not with the supply of the work.

Because as you wrote yourself - scheduled request and manual updates should have same chance of getting work. So if there was any shortfall of work supply on the servers failure rate(no getting new work) of scheduled requests and manual updates should be at about same level.

But it is not the case: almost all scheduled requests fail while almost all manual updates succeed.
Also almost all scheduled requests succeed if work cache is empty.
I added example of log from one of my machines in prev post but you probable miss it as it was in edit (did not expect response so fast). Here it again: https://pastebin.com/8LCxm5RN

As you can see there is about 100 failed scheduled requests before cache is empty. This is a last WU from cache is finished, and very first request after it got a lot of new work.
29/04/2019 05:08:16 | Milkyway@Home | Computation for task de_modfit_84_bundle4_4s_south4s_0_1555431910_4347759_1 finished
29/04/2019 05:08:30 | Milkyway@Home | Sending scheduler request: To fetch work.
29/04/2019 05:08:30 | Milkyway@Home | Reporting 2 completed tasks
29/04/2019 05:08:30 | Milkyway@Home | Requesting new tasks for AMD/ATI GPU
29/04/2019 05:08:32 | Milkyway@Home | Scheduler request completed: got 0 new tasks
29/04/2019 05:19:12 | Milkyway@Home | Sending scheduler request: To fetch work.
29/04/2019 05:19:12 | Milkyway@Home | Requesting new tasks for AMD/ATI GPU
29/04/2019 05:19:14 | Milkyway@Home | Scheduler request completed: got 64 new tasks

Then next ~50 scheduled requests failed until cache is empty again. And got lot of work in next request after cache emptied.
And so on - such cycle repeats for many days on many machines of different users.
ID: 91251 · Report as offensive
Mad_Max

Send message
Joined: 29 Apr 19
Posts: 19
Russia
Message 91252 - Posted: 29 Apr 2019, 10:47:35 UTC
Last modified: 29 Apr 2019, 10:57:39 UTC

P.S.
One of my thought after digging few such logs (also with additional debug info turned on) there may be a problem with combined request: reporting completed work + requesting new work.
As all successful work fetch i saw in logs was pure work requests (without reporting completed tasks)
- when all work was finished and reported (so nothing to report and client only request new work)
- at manual updates when nothing to report because there are no finished tasks yet (so client also only request new work)

Is there any option in current BOINC client to disable / pause automatic work reporting to test this hypothesis?
ID: 91252 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 91253 - Posted: 29 Apr 2019, 10:58:53 UTC - in response to Message 91251.  
Last modified: 29 Apr 2019, 11:01:18 UTC

Read it. But am still sure that problem is not with the supply of the work.
When I look at
https://milkyway.cs.rpi.edu/milkyway/server_status.php it says at the bottom Task data as of 29 Apr 2019, 10:41:21 UTC. That's about 20 minutes ago. A LOT can change in 20 minutes, so it now showing to have 11683 tasks RTS means that they were that number at 10:41:21 UTC, not now. Aside from that, even if that number is a constant 11K+, you still don't know how many tasks were in the feeder at the given moment your client asked for work. If there were none, or way fewer than you're asking for, it won't give you work.

Perhaps we need an average feeder number on the SSP?
ID: 91253 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 91254 - Posted: 29 Apr 2019, 11:11:26 UTC - in response to Message 91252.  

Pure coincidence on the reply - I'd been downstairs doing something else, and just happened to come back to the computer and look for messages as you posted. Since then, I've been out to the shops and back. Timing matters.

Which may also have a bearing on this problem. With the Milkyway tasks finishing so quickly, anyone who gets the 'no work' reply and goes into backoff woll clear the backoff via a completion and will ask again - the delay being solely governed by the 91 second server delay. It's possible that you're seeing a whole group of hosts in lockstep - all asking (and asking again) at the same time. The client backoffs were deliberately designed with a randomisation factor to avoid that problem, but there's nothing random about the server delay.

The advantage of the 'update' click could be as simple as that slight extra randomness in the timing.

Oh, wait. The 'zero sent' continues even after the cache is empty and there's no completed work either to clear the backoff, or needing to be reported? That destroys both our hypotheses!
ID: 91254 · Report as offensive
1 · 2 · Next

Message boards : BOINC client : 7.14.2 and 7.12.1 both fail to get work units on very fast systems

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.