Message boards : BOINC client : 7.14.2 and 7.12.1 both fail to get work units on very fast systems
Message board moderation
Author | Message |
---|---|
Send message Joined: 27 Jun 08 Posts: 641 |
There was a discussion about this at milkyway and also at seti. Basically my 4 GPUs finish a work unit in 10 seconds on the average. The queue when full is typically 600 - 800 but after it empties (milkyway project) no work units are provided for anywhere from 5 - 15 minutes. The suggestion was to downgrade to 7.12.1 but that did not fix the problem. This is inconvenient as boinc schedules other projects in whereas I have set a priority where I don't want them to run unless the primary project is down, off line, etc. I can issue a manual "update" to fix the problem so the project has data but wont send it. Tried 7.12.1 : got a 10.5 minute delay as shown at 7:43. Longer delays with 7.14.2 as shown here going back 24 hours of history. 1 4/27/2019 5:02:43 PM Starting BOINC client version 7.12.1 for windows_x86_64 2 4/27/2019 5:02:43 PM log flags: file_xfer, sched_ops, task 3 4/27/2019 5:02:43 PM Libraries: libcurl/7.47.1 OpenSSL/1.0.2g zlib/1.2.8 4 4/27/2019 5:02:43 PM Data directory: C:\ProgramData\BOINC 5 4/27/2019 5:02:43 PM Running under account josephy@stateson.net 6 4/27/2019 5:02:45 PM OpenCL: AMD/ATI GPU 0: AMD FirePro S9100 (driver version 2671.3, device version OpenCL 1.2 AMD-APP (2671.3), 6144MB, 6144MB available, 3226 GFLOPS peak) 7 4/27/2019 5:02:45 PM OpenCL: AMD/ATI GPU 1: AMD FirePro S9100 (driver version 2671.3, device version OpenCL 1.2 AMD-APP (2671.3), 6144MB, 6144MB available, 3226 GFLOPS peak) 8 4/27/2019 5:02:45 PM OpenCL: AMD/ATI GPU 2: AMD FirePro S9100 (driver version 2671.3, device version OpenCL 2.0 AMD-APP (2671.3), 12288MB, 12288MB available, 4608 GFLOPS peak) 9 4/27/2019 5:02:45 PM OpenCL: AMD/ATI GPU 3: AMD FirePro S9100 (driver version 2671.3, device version OpenCL 1.2 AMD-APP (2671.3), 6144MB, 6144MB available, 3226 GFLOPS peak) - - - - 2213 Milkyway@Home 4/27/2019 7:31:21 PM Computation for task de_modfit_85_bundle4_4s_south4s_0_1555431910_4124594_0 finished 2214 Milkyway@Home 4/27/2019 7:32:32 PM Sending scheduler request: To fetch work. 2215 Milkyway@Home 4/27/2019 7:32:32 PM Reporting 7 completed tasks 2216 Milkyway@Home 4/27/2019 7:32:32 PM Requesting new tasks for AMD/ATI GPU 2217 Milkyway@Home 4/27/2019 7:32:34 PM Scheduler request completed: got 0 new tasks 2218 Milkyway@Home 4/27/2019 7:43:14 PM Sending scheduler request: To fetch work. 2219 Milkyway@Home 4/27/2019 7:43:14 PM Requesting new tasks for AMD/ATI GPU 2220 Milkyway@Home 4/27/2019 7:43:20 PM Scheduler request completed: got 598 new tasks 2221 Milkyway@Home 4/27/2019 7:43:23 PM Starting task de_modfit_80_bundle5_4s_south4s_0_1554998626_1474893_2[/code] |
Send message Joined: 17 Nov 16 Posts: 891 |
Maybe you can ask Richard Haselgrove where to find the #3076 appveyor artifact link for the Windows client so you can download it and try it. I thought using the older 7.12.1 client would worked for you. |
Send message Joined: 5 Oct 06 Posts: 5130 |
I think we need to understand a little bit better where this delay is coming from. First, every time you contact a project, the project itself asks you to pause a little while before asking again - SETI for 303 seconds, Milkyway for 91 seconds. That shows in the Event Log, and it's best to have regard to it, but it obviously isn't the problem here. Most of the time, it's your machine which does the asking for work, when it thinks it needs it - and that depends on your cache settings. It's usually best to ask 'little and often', but that does depend on whether the project regularly has work to send you. I'd always suggests setting the <sched_op_debug> Event Log flag (which you can do from the Manager, Options menu) - it gives you the very basic stuff, like 28/04/2019 07:09:43 | SETI@home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices 28/04/2019 07:09:43 | SETI@home | [sched_op] NVIDIA GPU work request: 4332.36 seconds; 0.00 devices 28/04/2019 07:09:43 | SETI@home | [sched_op] Intel GPU work request: 0.00 seconds; 0.00 devices- I usually like to request about an hour of work, once an hour, but we can tweak that later. The problems like you're describing more usually arise when you wait and ask for a lot of work at once, and happen to hit a moment when the project hasn't got any - that tends to be when BOINC decides not to bother asking for a while, and fetch from a backup project instead. But we need to understand the overall picture before we can be sure of that. Got to go out now, but the artifacts Keith mentioned are at https://ci.appveyor.com/project/BOINC/boinc/builds/23992763/artifacts. I'll dig out some notes on how to use them later. Late edit - here are the notes I wrote last time for using those artifacts. You'll need the middle one of the three downloadable links, labelled 'win-client'. |
Send message Joined: 27 Jun 08 Posts: 641 |
I think we need to understand a little bit better where this delay is coming from. I will set those log flags and try to get a better picture of what is happening. I did look at that appveyor but I don't think it applies as I do not use max concurrent in cc_config. I do have an app_config for milkyway that I discovered long ago using google. I am not sure what all it does but it does list more info about the GPU and supposedly allows tasks to run faster. I assume it is not causing the problem. <app_config> <app_version> <app_name>milkyway</app_name> <plan_class>opencl_ati_101</plan_class> <avg_ncpus>0.20</avg_ncpus> <ngpus>0.19</ngpus> <cmdline>--non-responsive --verbose --gpu-target-frequency 1 --gpu-polling-mode -1 --gpu-wait-factor 0 --process-priority 4 --gpu-disable-checkpointing</cmdline> </app_version> </app_config> [EDIT] going to use the following <cc_config> <log_flags> <work_fetch_debug>1</work_fetch_debug> <sched_op_debug>1</sched_op_debug> </log_flags> </cc_config> [EDIT AGAIN] Getting message "no project chosen for work fetch". I looked at wiki for cc_config and did not see how to restrict work fetch to just milkyway else I get a lot of messages from projects that are not active |
Send message Joined: 5 Oct 06 Posts: 5130 |
Yes, <work_fetch_debug> is a bit of a blunderbuss. Best to set it once, wait until it's done just one cycle, and then unset it again while you pick over the pieces. [That's why I got them to put an 'apply' button on the dialog :-)] But it is powerful - if you could fillet out that one complete cycle from [work_fetch] ------- start work fetch state ------- to [work_fetch] ------- end work fetch state ------- and post it here, we could take a look. Might contain some clues. |
Send message Joined: 27 Jun 08 Posts: 641 |
Yes, <work_fetch_debug> is a bit of a blunderbuss. Best to set it once, wait until it's done just one cycle, and then unset it again while you pick over the pieces. [That's why I got them to put an 'apply' button on the dialog :-)] I managed t find if (found) { p->sched_rpc_pending = RPC_REASON_NEED_WORK; } else { if (log_flags.work_fetch_debug) { msg_printf(0, MSG_INFO, "[work_fetch] No project chosen for work fetch"); at THIS location but there was no selection for projects. Really need to exclude projects that are not active. I will delete all unused projects from this system to clean up the message log. l gave up trying to build boinc under VS2017 sometime ago. |
Send message Joined: 27 Jun 08 Posts: 641 |
OK, found a way to remove clutter, using BoincTasks "select project" to see only milkyway ====at line 15819===== At 10:14:04 was last report of completed tasks. 9 reported. THE QUEUE IS EMPTY AT THIS TIME At 10:20:58 got 621 new tasks. Delay of 6 minutes. Not to bad compared to 15 minutes I have seen in past. Printout from line 15395 to 16650 is here stateson.net\images\15395.txt I have the whole 9 yards available if needed. HTH !!! [EDIT] Looks like project requested a 6 minute delay! Could this be the problem? Was it the client that wants a delay? I don't know how to read this info. Is it explained somewhere? If so I don't mind doing an analysis. 15835 Milkyway@Home 4/28/2019 10:14:06 AM [work_fetch] backing off AMD/ATI GPU 381 sec |
Send message Joined: 17 Nov 16 Posts: 891 |
Richard is the expert in decrypting the work fetch debug output. Everything appears normal for intervals. The work requests look normal. What I don't understand is why you are getting backoffs for 10 and 5 minutes directly after the scheduler acknowledges receipt of reported work. That is coming from the scheduler and not from your host or client. Normally the scheduler backs off if there are issues in contacting the servers or the client has issues downloading work and the client can't acknowledge correct reception of the sent tasks. Have you looked at the Transfers tab in the Manager after you have requested work and see if you have task downloads in backoff? |
Send message Joined: 27 Jun 08 Posts: 641 |
Richard is the expert in decrypting the work fetch debug output. Everything appears normal for intervals. The work requests look normal. What I don't understand is why you are getting backoffs for 10 and 5 minutes directly after the scheduler acknowledges receipt of reported work. That is coming from the scheduler and not from your host or client. Normally the scheduler backs off if there are issues in contacting the servers or the client has issues downloading work and the client can't acknowledge correct reception of the sent tasks. Have you looked at the Transfers tab in the Manager after you have requested work and see if you have task downloads in backoff? I will check that possibility. Be nice if that info was in the event log. AFAICT there is no transfer "history" to review so I got to be ready to catch it. I have good bandwidth here at home but very rarely downloads hang up if too many concurrent. Conceivably, if a number of GPUGRID tasks complete all at once then the upload can be bottlenecked. I had asked Fred at BoincTasks about implementing a rule for a project being out of data as I could then use the rule to run a batch file and send a text message to my phone. He has a lot on his plate so not sure about when or if that gets implemented. AFAIK that is the only way to find out in real time if project is out of data (other than editing boinc code and building a test program). If the client could put transfer info such as number of pending and estimated time into the event log that would be a real help. [EDIT] Actually, can babysit the last few work units and when the hit 0 tasks, bring up the transfer tab to see WTF is going on. |
Send message Joined: 5 Oct 06 Posts: 5130 |
Richard is the expert in decrypting the work fetch debug output. Everything appears normal for intervals. The work requests look normal. What I don't understand is why you are getting backoffs for 10 and 5 minutes directly after the scheduler acknowledges receipt of reported work. That is coming from the scheduler and not from your host or client. Normally the scheduler backs off if there are issues in contacting the servers or the client has issues downloading work and the client can't acknowledge correct reception of the sent tasks. Have you looked at the Transfers tab in the Manager after you have requested work and see if you have task downloads in backoff?No need. Look at that event log again, without the clutter: 15402 Milkyway@Home 4/28/2019 10:12:30 AM Scheduler request completed: got 0 new tasksThe project requested 91 seconds. The backoff was done by the client, as a normal reaction to the lack of available work. And if no work was assigned by the server, there'll be no files to download, and nothing will show in the transfers tab. Sorry, I've had a busy weekend showing visitors round Yorkshire. They've moved on to London now, but I found myself surprisingly tired (and I've got a watercooler appointment with the TV later tonight). I should be back to normal tomorrow, and I'll try to look through the rest of the log before your morning starts. |
Send message Joined: 5 Oct 06 Posts: 5130 |
OK, I can manage the next bit now. The next request was 15823 Milkyway@Home 4/28/2019 10:14:06 AM Scheduler request completed: got 0 new tasksThat's effectively 91 seconds after the one in my last post. In this mode, the client will set a backoff every time it gets a got 0 new tasks response, and will clear it every time one of your previously cached tasks finishes computation. Those are both known design features, like 'em or not. At the moment, you're completing cached work much more often than once every 91 seconds, so you'll keep asking at every possible occasion. But as soon as you run dry, there's nothing to clear the backoffs (no more work finishing), and they'll build up. The only way to cure this one is to persuade the project to provide more tasks. A possible mitigation would be to set a cron job to trigger boinccmd to issue a project update every five minutes, but you might get very unpopular with the server administrators. They obviously aren't set up to satisfy your voracious appetite. |
Send message Joined: 27 Jun 08 Posts: 641 |
Richard is the expert in decrypting the work fetch debug output. Everything appears normal for intervals. The work requests look normal. What I don't understand is why you are getting backoffs for 10 and 5 minutes directly after the scheduler acknowledges receipt of reported work. That is coming from the scheduler and not from your host or client. Normally the scheduler backs off if there are issues in contacting the servers or the client has issues downloading work and the client can't acknowledge correct reception of the sent tasks. Have you looked at the Transfers tab in the Manager after you have requested work and see if you have task downloads in backoff?No need. Look at that event log again, without the clutter: Another question might be: Why were 0 tasks sent when the project had about 11,000** tasks ready to send. If the project does not want to send tasks (for whatever reason) then the problem is the project and not the client. If I wait out the seconds (723 or whatever) then I eventually get some new work. I have had other systems with nVidia cards running milkyway. They run much slower and I don't see them run out of data unless the project is off-line. *** Not sure how often the server status is updated but I checked it when my last milkyway task finished and the delay started. I did not get any new work for a few minutes so I issued a project update and got work immediately. It is looking like the project is not sending stuff that it has and the client is backing off thinking there is no work which would be the correct procedure IF and only IF the project actually had no work. My guess is the problem is on the server side. Going to put 7.14.2 back on that system. |
Send message Joined: 17 Nov 16 Posts: 891 |
No need. Look at that event log again, without the clutter: Yes, I missed that in all the clutter. I agree with the observation that the project is rarely ever out of work. Only when doing rare maintenance or has broken. Now that the tasks per gpu has been increased from historical 80 per to 300 now, I would need many hours to work through my 0.5 day cache with only the MW project running on my hosts. But I have Nvidia and not ATI/AMD cards. So why does the client get assigned no work on the request when in fact the server DOES have work. Could this be the case if the RTS buffer size is set too low at MW and too many people hit the buffer just before Beemer Biker hit the buffer with his request which exhausted the available work to 0? Requesting new tasks for AMD/ATI GPU 15821 Milkyway@Home 4/28/2019 10:14:04 AM [sched_op] CPU work request: 0.00 seconds; 0.00 devices 15822 Milkyway@Home 4/28/2019 10:14:04 AM [sched_op] AMD/ATI GPU work request: 120960.00 seconds; 4.00 devices 15823 Milkyway@Home 4/28/2019 10:14:06 AM Scheduler request completed: got 0 new tasks [Edit] Incorrect in my number of allowed tasks. This is from Jake Weiss' post in project News Hey guys, So the current set up allows for users to have up to 200 workunits per GPU on their computer and another 40 workunits per CPU with a maximum of 600 possible workunits. |
Send message Joined: 5 Oct 06 Posts: 5130 |
The TV cliffhanger is nicely set up for the feature-length series closer next week. Meanwhile, back at BOINC... Quickly, before I fall into bed. The BOINC structure is uniform across projects, with minor local tweaks. There are two numbers to consider - and please pass these on to Jake for consideration. The first is the total number of workunits created by what are generically known as workunit generators, and are familiar to SETIzens as 'splitters'. At SETI, this number hovers around the 600,000 level, and is subject to some hysteresis - it takes time to turn off the generators when the RTS buffer hits peak, and they aren't called back into service until it falls to trough. The SETI generator has a typical wavelength of around an hour: I don't have a ready knowledge of the Milkyway figures, or even whether there's an equivalent of the Haveland graphs. The RTS buffer is stored in databases and disk files. The second number relates to tasks held in fast cache memory by a process known as the 'feeder'. That number's probably measured in hundreds, and has a cycle time measured in seconds. When you request work, it's the tasks in the feeder cache which are scanned for suitability: you need that fast cache response time. "No tasks allocated" equates either to "feeder empty" or "no suitable tasks in feeder". It's the old trade counter vs warehouse problem: Are these in stock? Yes Can I pick one up? No Why not? Because I'd have to send Joe over to the warehouse first, and he's on lunch My suspicion, reading this thread, is that Jake is talking about the workunit generators and RTS: I don't think he's reached the page in the manual about the feeder yet. Perhaps one of you could point him to https://boinc.berkeley.edu/trac/wiki/BackendPrograms#feeder, but tell him not to blindly wind up <feeder_query_size>N</feeder_query_size>to something obscene: both timing (the scheduler has to search the list) and size (it has to fit into memory without paging) are critical. And so to bed. |
Send message Joined: 29 Apr 19 Posts: 19 |
OK, I can manage the next bit now. The next request was It is not a problem with supply of work from a project. I also have this problem with getting no work from MW. And many other people too (discussion thread at MW forum about it - https://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=4424 and in few other threads) There are plenty of work available to sent (usually server maintain about 10 000 tasks ready to sent, sometimes it goes down by 1к-2к task but almost never close to zero), but some users can not get any of them until their local BOINC work cache is empty getting "got 0 new tasks" all the time until this point. But if you press "update" - client receives a lot(few dozen task per request) of work immediately. Only automatic (scheduled) work fetch is failing. Here is an example of BOINC log https://pastebin.com/8LCxm5RN Problem also exit with old BOINC clients - i used 7.6.22 for example. So may be it is not a client issue but within the server part of BOINC code. |
Send message Joined: 5 Oct 06 Posts: 5130 |
See my following post. It's the difference between warehouse storage and front desk pickup. You see the numbers in the warehouse, and sure - there's plenty of work back there. But you don't see what's on the front desk - the project doesn't show that to you, because it changes several times a second. All that clicking 'update' does is to send an immediate request (if you need work) - it doesn't change the state of the front desk pickup supplies. You stand exactly the same chance of success as if BOINC had asked automatically. |
Send message Joined: 29 Apr 19 Posts: 19 |
Read it. But am still sure that problem is not with the supply of the work. Because as you wrote yourself - scheduled request and manual updates should have same chance of getting work. So if there was any shortfall of work supply on the servers failure rate(no getting new work) of scheduled requests and manual updates should be at about same level. But it is not the case: almost all scheduled requests fail while almost all manual updates succeed. Also almost all scheduled requests succeed if work cache is empty. I added example of log from one of my machines in prev post but you probable miss it as it was in edit (did not expect response so fast). Here it again: https://pastebin.com/8LCxm5RN As you can see there is about 100 failed scheduled requests before cache is empty. This is a last WU from cache is finished, and very first request after it got a lot of new work. 29/04/2019 05:08:16 | Milkyway@Home | Computation for task de_modfit_84_bundle4_4s_south4s_0_1555431910_4347759_1 finished 29/04/2019 05:08:30 | Milkyway@Home | Sending scheduler request: To fetch work. 29/04/2019 05:08:30 | Milkyway@Home | Reporting 2 completed tasks 29/04/2019 05:08:30 | Milkyway@Home | Requesting new tasks for AMD/ATI GPU 29/04/2019 05:08:32 | Milkyway@Home | Scheduler request completed: got 0 new tasks 29/04/2019 05:19:12 | Milkyway@Home | Sending scheduler request: To fetch work. 29/04/2019 05:19:12 | Milkyway@Home | Requesting new tasks for AMD/ATI GPU 29/04/2019 05:19:14 | Milkyway@Home | Scheduler request completed: got 64 new tasks Then next ~50 scheduled requests failed until cache is empty again. And got lot of work in next request after cache emptied. And so on - such cycle repeats for many days on many machines of different users. |
Send message Joined: 29 Apr 19 Posts: 19 |
P.S. One of my thought after digging few such logs (also with additional debug info turned on) there may be a problem with combined request: reporting completed work + requesting new work. As all successful work fetch i saw in logs was pure work requests (without reporting completed tasks) - when all work was finished and reported (so nothing to report and client only request new work) - at manual updates when nothing to report because there are no finished tasks yet (so client also only request new work) Is there any option in current BOINC client to disable / pause automatic work reporting to test this hypothesis? |
Send message Joined: 29 Aug 05 Posts: 15571 |
Read it. But am still sure that problem is not with the supply of the work.When I look at https://milkyway.cs.rpi.edu/milkyway/server_status.php it says at the bottom Task data as of 29 Apr 2019, 10:41:21 UTC. That's about 20 minutes ago. A LOT can change in 20 minutes, so it now showing to have 11683 tasks RTS means that they were that number at 10:41:21 UTC, not now. Aside from that, even if that number is a constant 11K+, you still don't know how many tasks were in the feeder at the given moment your client asked for work. If there were none, or way fewer than you're asking for, it won't give you work. Perhaps we need an average feeder number on the SSP? |
Send message Joined: 5 Oct 06 Posts: 5130 |
Pure coincidence on the reply - I'd been downstairs doing something else, and just happened to come back to the computer and look for messages as you posted. Since then, I've been out to the shops and back. Timing matters. Which may also have a bearing on this problem. With the Milkyway tasks finishing so quickly, anyone who gets the 'no work' reply and goes into backoff woll clear the backoff via a completion and will ask again - the delay being solely governed by the 91 second server delay. It's possible that you're seeing a whole group of hosts in lockstep - all asking (and asking again) at the same time. The client backoffs were deliberately designed with a randomisation factor to avoid that problem, but there's nothing random about the server delay. The advantage of the 'update' click could be as simple as that slight extra randomness in the timing. Oh, wait. The 'zero sent' continues even after the cache is empty and there's no completed work either to clear the backoff, or needing to be reported? That destroys both our hypotheses! |
Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.