GPU tasks skipped after scheduler overcommits CPU cores

Message boards : Questions and problems : GPU tasks skipped after scheduler overcommits CPU cores
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Bryn Mawr
Help desk expert

Send message
Joined: 31 Dec 18
Posts: 284
United Kingdom
Message 103414 - Posted: 4 Mar 2021, 19:46:52 UTC - in response to Message 103412.  

And same issue shows itself here too:



2 CPUs just sit idle cause host can't fit more tasks of GWnew type in memory and have queue full of GWnew tasks.
While FGRP-type work present on server host will not ask for it (it cant) so it just can't fill idle CPUs with different work from single project cause it can't distinguish types of work in work request!

EDIT: BTW, what nidle means then??
3/4/2021 21:22:20 PM | | [work_fetch] --- state for CPU ---
3/4/2021 21:22:20 PM | | [work_fetch] shortfall 47990.74 nidle 0.00 saturated 224169.26 busy 0.00
3/4/2021 21:22:20 PM | Einstein@Home | [work_fetch] share 1.000
3/4/2021 21:22:20 PM | Milkyway@Home | [work_fetch] share 0.000 blocked by project preferences
3/4/2021 21:22:20 PM | SETI@home Beta Test | [work_fetch] share 0.000

2CPUs are idle and BOINC doesn't see this??


Nidle is, I believe, the number of idle cores
ID: 103414 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103418 - Posted: 4 Mar 2021, 20:01:14 UTC - in response to Message 103414.  

And same issue shows itself here too:



2 CPUs just sit idle cause host can't fit more tasks of GWnew type in memory and have queue full of GWnew tasks.
While FGRP-type work present on server host will not ask for it (it cant) so it just can't fill idle CPUs with different work from single project cause it can't distinguish types of work in work request!

EDIT: BTW, what nidle means then??
3/4/2021 21:22:20 PM | | [work_fetch] --- state for CPU ---
3/4/2021 21:22:20 PM | | [work_fetch] shortfall 47990.74 nidle 0.00 saturated 224169.26 busy 0.00
3/4/2021 21:22:20 PM | Einstein@Home | [work_fetch] share 1.000
3/4/2021 21:22:20 PM | Milkyway@Home | [work_fetch] share 0.000 blocked by project preferences
3/4/2021 21:22:20 PM | SETI@home Beta Test | [work_fetch] share 0.000

2CPUs are idle and BOINC doesn't see this??


Nidle is, I believe, the number of idle cores


I believed too, but obviously - no.
There are 4 cores and only 3 running task (even if GPU one takes full core =3, not 4). And still that field is zero.
ID: 103418 · Report as offensive
Bryn Mawr
Help desk expert

Send message
Joined: 31 Dec 18
Posts: 284
United Kingdom
Message 103423 - Posted: 4 Mar 2021, 22:26:06 UTC - in response to Message 103418.  

And same issue shows itself here too:



2 CPUs just sit idle cause host can't fit more tasks of GWnew type in memory and have queue full of GWnew tasks.
While FGRP-type work present on server host will not ask for it (it cant) so it just can't fill idle CPUs with different work from single project cause it can't distinguish types of work in work request!

EDIT: BTW, what nidle means then??
3/4/2021 21:22:20 PM | | [work_fetch] --- state for CPU ---
3/4/2021 21:22:20 PM | | [work_fetch] shortfall 47990.74 nidle 0.00 saturated 224169.26 busy 0.00
3/4/2021 21:22:20 PM | Einstein@Home | [work_fetch] share 1.000
3/4/2021 21:22:20 PM | Milkyway@Home | [work_fetch] share 0.000 blocked by project preferences
3/4/2021 21:22:20 PM | SETI@home Beta Test | [work_fetch] share 0.000

2CPUs are idle and BOINC doesn't see this??


Nidle is, I believe, the number of idle cores


I believed too, but obviously - no.
There are 4 cores and only 3 running task (even if GPU one takes full core =3, not 4). And still that field is zero.


Just to be picky, the fact that the machine has 4 cores does not mean that Boinc believes that it has 4 cores available to it :-p
ID: 103423 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103424 - Posted: 5 Mar 2021, 23:34:47 UTC - in response to Message 103423.  


Just to be picky, the fact that the machine has 4 cores does not mean that Boinc believes that it has 4 cores available to it :-p

Indeed.

But anyway I learnt how BOINC deals with "waiting for memory" and completely refused this mode of operation.
BOINC's behavior is too dumb. It doesn't try to fill computing devices, just suspends one of tasks leaving device idle.
Time to time it suspended even GPU task (!) leaving GPU idle with few CPU memory-consuming GW tasks running.

So back to app_config. For now I'm quite despaired to find adequate operation mode w/o micro-managing.
ID: 103424 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 103470 - Posted: 9 Mar 2021, 12:42:35 UTC

OK, plan of action, after yet another report of the same problem (thread 14182, from someone who wants to gripe about the problem, but not to be part of the solution). I'm going to try and work up my hack into something presentable, and let a few trusted beta testers take it for a spin. Windows only, at this stage.

  • Starting point: client release branch at 31 January 2021 (current). Identifies itself as v7.16.16
  • Coding: it's a simple hack, should take about 10 minutes. Because of this b*****d of a programming language and coding style, it'll take hours, if not days,
  • Testing: I'll recreate the original problem here, and then test the fix here too. Try to ensure that nothing else bad has crept in.

With any luck, and provided real life doesn't intrude, I may have something for you by the weekend.

ID: 103470 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103471 - Posted: 9 Mar 2021, 13:16:46 UTC - in response to Message 103470.  
Last modified: 9 Mar 2021, 13:17:52 UTC

OK, plan of action, after yet another report of the same problem (thread 14182, from someone who wants to gripe about the problem, but not to be part of the solution). I'm going to try and work up my hack into something presentable, and let a few trusted beta testers take it for a spin. Windows only, at this stage.

  • Starting point: client release branch at 31 January 2021 (current). Identifies itself as v7.16.16
  • Coding: it's a simple hack, should take about 10 minutes. Because of this b*****d of a programming language and coding style, it'll take hours, if not days,
  • Testing: I'll recreate the original problem here, and then test the fix here too. Try to ensure that nothing else bad has crept in.

With any luck, and provided real life doesn't intrude, I may have something for you by the weekend.


Richard, as I understand your biggest issue is with char string and passing it into inner area.

Above I proposed how to replace it with boolean variable.
So, just add ,bool work_fetch=false) instead of ) in any inner function declaration.

This would allow not to touch parts outside of work fetch at all (default value is false, if param not listed in call it will be assumed false).

And initial initialization:

void rr_simulation(const char* why) {
static double last_time=0;
bool work_fetch=(why=="work fetch"?true:false);

Have no handy build environment currently so building on you...
ID: 103471 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 103474 - Posted: 9 Mar 2021, 19:15:13 UTC - in response to Message 103471.  

I chose to use an integer - allows for future expansion of the rr_sim space ;-)

It builds, without warnings and - after some effort - errors. I've re-created the problem case, but I'm going to leave it running on the old app overnight, so I can test properly with fresh eyes in the morning.
ID: 103474 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 103484 - Posted: 11 Mar 2021, 12:09:38 UTC

I finished tidying up the re-worked hack yesterday, and it's been running for about 18 hours without problems. Here's a slightly artificial log to show the effect - I manually boosted the number of CPU tasks cached, to increase the numbers.

11/03/2021 11:46:03 |  | [rr_sim] doing sim: CPU sched
11/03/2021 11:46:03 |  | [rr_sim] start: work_buf min 864 additional 864 total 1728 on_frac 0.997 active_frac 1.000
11/03/2021 11:46:03 | Einstein@Home | [rr_sim] 9.23: h1_0676.05_O2C02Cl4In0__O2MDFS3a_Spotlight_676.85Hz_929_2 finishes (1.00 CPU + 1.00 NVIDIA GPU) (2585.35G/280.08G)
11/03/2021 11:46:03 | Einstein@Home | [rr_sim] 378.46: h1_0676.05_O2C02Cl4In0__O2MDFS3a_Spotlight_676.85Hz_925_2 finishes (1.00 CPU + 1.00 NVIDIA GPU) (105999.42G/280.08G)
11/03/2021 11:46:03 | Einstein@Home | [rr_sim] 932.30: h1_0676.05_O2C02Cl4In0__O2MDFS3a_Spotlight_676.85Hz_926_2 finishes (1.00 CPU + 1.00 NVIDIA GPU) (258535.16G/280.08G)
11/03/2021 11:46:03 | Einstein@Home | [rr_sim] 1301.53: h1_0648.65_O2C02Cl4In0__O2MDFS3a_Spotlight_649.20Hz_511_2 finishes (1.00 CPU + 1.00 NVIDIA GPU) (258535.16G/280.08G)
11/03/2021 11:46:03 | NumberFields@home | [rr_sim] 1758.54: wu_sf3_DS-16x271-2_Grp197929of1000000_1 finishes (1.00 CPU) (11087.93G/6.31G)
11/03/2021 11:46:03 | Einstein@Home | [rr_sim] 1855.37: h1_0648.65_O2C02Cl4In0__O2MDFS3a_Spotlight_649.20Hz_510_2 finishes (1.00 CPU + 1.00 NVIDIA GPU) (258535.16G/280.08G)
11/03/2021 11:46:03 | NumberFields@home | [rr_sim] at app max concurrent for GetDecics
11/03/2021 11:46:03 | NumberFields@home | [rr_sim] at app max concurrent for GetDecics
11/03/2021 11:46:03 | NumberFields@home | [rr_sim] at app max concurrent for GetDecics
11/03/2021 11:46:03 | NumberFields@home | [rr_sim] at app max concurrent for GetDecics
11/03/2021 11:46:03 | NumberFields@home | [rr_sim] at app max concurrent for GetDecics
11/03/2021 11:46:03 | NumberFields@home | [rr_sim] at app max concurrent for GetDecics
11/03/2021 11:46:03 | NumberFields@home | [rr_sim] at app max concurrent for GetDecics
11/03/2021 11:46:03 | NumberFields@home | [rr_sim] at app max concurrent for GetDecics
11/03/2021 11:46:03 | NumberFields@home | [rr_sim] at app max concurrent for GetDecics
11/03/2021 11:46:03 | NumberFields@home | [rr_sim] at app max concurrent for GetDecics
11/03/2021 11:46:03 | NumberFields@home | [rr_sim] at app max concurrent for GetDecics
11/03/2021 11:46:03 | NumberFields@home | [rr_sim] at app max concurrent for GetDecics
11/03/2021 11:46:03 | NumberFields@home | [rr_sim] at app max concurrent for GetDecics
11/03/2021 11:46:03 | NumberFields@home | [rr_sim] at app max concurrent for GetDecics
11/03/2021 11:46:03 | NumberFields@home | [rr_sim] at app max concurrent for GetDecics
11/03/2021 11:46:03 | NumberFields@home | [rr_sim] at app max concurrent for GetDecics
11/03/2021 11:46:03 | Einstein@Home | [rr_sim] 1922.87: p2030.20170627.G31.90+02.51.S.b0s0g0.00000_2911_3 finishes (0.50 CPU + 1.00 Intel GPU) (31419.20G/16.34G)
11/03/2021 11:46:03 | NumberFields@home | [rr_sim] at app max concurrent for GetDecics
11/03/2021 11:46:03 | Einstein@Home | [rr_sim] 2224.60: h1_0648.05_O2C02Cl4In0__O2MDFS3a_Spotlight_648.60Hz_532_2 finishes (1.00 CPU + 1.00 NVIDIA GPU) (258535.16G/280.08G)
11/03/2021 11:46:03 | NumberFields@home | [rr_sim] at app max concurrent for GetDecics
11/03/2021 11:46:03 | NumberFields@home | [rr_sim] 2424.21: wu_sf3_DS-16x271-2_Grp618165of1000000_0 finishes (1.00 CPU) (15285.09G/6.31G)
11/03/2021 11:46:03 | NumberFields@home | [rr_sim] 6516.53: wu_sf3_DS-16x271-2_Grp617510of1000000_0 finishes (1.00 CPU) (30000.00G/6.31G)

11/03/2021 11:46:05 |  | choose_project(): 1615463165.647471
11/03/2021 11:46:05 |  | [rr_sim] doing sim: work fetch
11/03/2021 11:46:05 |  | [rr_sim] start: work_buf min 864 additional 864 total 1728 on_frac 0.997 active_frac 1.000
11/03/2021 11:46:05 | Einstein@Home | [rr_sim] 9.23: h1_0676.05_O2C02Cl4In0__O2MDFS3a_Spotlight_676.85Hz_929_2 finishes (1.00 CPU + 1.00 NVIDIA GPU) (2585.35G/280.08G)
11/03/2021 11:46:05 | Einstein@Home | [rr_sim] 377.46: h1_0676.05_O2C02Cl4In0__O2MDFS3a_Spotlight_676.85Hz_925_2 finishes (1.00 CPU + 1.00 NVIDIA GPU) (105718.49G/280.08G)
11/03/2021 11:46:05 | Einstein@Home | [rr_sim] 932.30: h1_0676.05_O2C02Cl4In0__O2MDFS3a_Spotlight_676.85Hz_926_2 finishes (1.00 CPU + 1.00 NVIDIA GPU) (258535.16G/280.08G)
11/03/2021 11:46:05 | Einstein@Home | [rr_sim] 1300.53: h1_0648.65_O2C02Cl4In0__O2MDFS3a_Spotlight_649.20Hz_511_2 finishes (1.00 CPU + 1.00 NVIDIA GPU) (258535.16G/280.08G)
11/03/2021 11:46:05 | NumberFields@home | [rr_sim] 1758.54: wu_sf3_DS-16x271-2_Grp197929of1000000_1 finishes (1.00 CPU) (11087.90G/6.31G)
11/03/2021 11:46:05 | Einstein@Home | [rr_sim] 1855.37: h1_0648.65_O2C02Cl4In0__O2MDFS3a_Spotlight_649.20Hz_510_2 finishes (1.00 CPU + 1.00 NVIDIA GPU) (258535.16G/280.08G)
11/03/2021 11:46:05 | Einstein@Home | [rr_sim] 1921.76: p2030.20170627.G31.90+02.51.S.b0s0g0.00000_2911_3 finishes (0.50 CPU + 1.00 Intel GPU) (31400.97G/16.34G)
11/03/2021 11:46:05 | Einstein@Home | [rr_sim] 2223.60: h1_0648.05_O2C02Cl4In0__O2MDFS3a_Spotlight_648.60Hz_532_2 finishes (1.00 CPU + 1.00 NVIDIA GPU) (258535.16G/280.08G)
11/03/2021 11:46:05 | NumberFields@home | [rr_sim] 2423.21: wu_sf3_DS-16x271-2_Grp618165of1000000_0 finishes (1.00 CPU) (15278.75G/6.31G)
11/03/2021 11:46:05 | NumberFields@home | [rr_sim] 6516.53: wu_sf3_DS-16x271-2_Grp617510of1000000_0 finishes (1.00 CPU) (30000.00G/6.31G)
11/03/2021 11:46:05 | NumberFields@home | [rr_sim] 6613.36: wu_sf3_DS-16x271-2_Grp618075of1000000_0 finishes (1.00 CPU) (30000.00G/6.31G)
11/03/2021 11:46:05 | NumberFields@home | [rr_sim] 6981.59: wu_sf3_DS-16x271-2_Grp616684of1000000_0 finishes (1.00 CPU) (30000.00G/6.31G)
11/03/2021 11:46:05 | NumberFields@home | [rr_sim] 7181.20: wu_sf3_DS-16x271-2_Grp619492of1000000_0 finishes (1.00 CPU) (30000.00G/6.31G)
11/03/2021 11:46:05 | NumberFields@home | [rr_sim] 11274.52: wu_sf3_DS-16x271-2_Grp616806of1000000_0 finishes (1.00 CPU) (30000.00G/6.31G)
11/03/2021 11:46:05 | NumberFields@home | [rr_sim] 11371.35: wu_sf3_DS-16x271-2_Grp618515of1000000_0 finishes (1.00 CPU) (30000.00G/6.31G)
11/03/2021 11:46:05 | NumberFields@home | [rr_sim] 11739.58: wu_sf3_DS-16x271-2_Grp617586of1000000_0 finishes (1.00 CPU) (30000.00G/6.31G)
11/03/2021 11:46:05 | NumberFields@home | [rr_sim] 11939.19: wu_sf3_DS-16x271-2_Grp619493of1000000_0 finishes (1.00 CPU) (30000.00G/6.31G)
11/03/2021 11:46:05 | NumberFields@home | [rr_sim] 16032.51: wu_sf3_DS-16x271-2_Grp619476of1000000_0 finishes (1.00 CPU) (30000.00G/6.31G)
11/03/2021 11:46:05 | NumberFields@home | [rr_sim] 16129.34: wu_sf3_DS-16x271-2_Grp619477of1000000_0 finishes (1.00 CPU) (30000.00G/6.31G)
11/03/2021 11:46:05 | NumberFields@home | [rr_sim] 16497.57: wu_sf3_DS-16x271-2_Grp618817of1000000_0 finishes (1.00 CPU) (30000.00G/6.31G)
11/03/2021 11:46:05 | NumberFields@home | [rr_sim] 16697.18: wu_sf3_DS-16x271-2_Grp618644of1000000_0 finishes (1.00 CPU) (30000.00G/6.31G)
11/03/2021 11:46:05 | NumberFields@home | [rr_sim] 20790.50: wu_sf3_DS-16x271-2_Grp619016of1000000_0 finishes (1.00 CPU) (30000.00G/6.31G)
11/03/2021 11:46:05 | NumberFields@home | [rr_sim] 20887.33: wu_sf3_DS-16x271-2_Grp619479of1000000_0 finishes (1.00 CPU) (30000.00G/6.31G)
11/03/2021 11:46:05 | NumberFields@home | [rr_sim] 21255.56: wu_sf3_DS-16x271-2_Grp619478of1000000_0 finishes (1.00 CPU) (30000.00G/6.31G)
11/03/2021 11:46:05 | NumberFields@home | [rr_sim] 21455.17: wu_sf3_DS-16x271-2_Grp619017of1000000_0 finishes (1.00 CPU) (30000.00G/6.31G)

11/03/2021 11:46:05 |  | [work_fetch] ------- start work fetch state -------
11/03/2021 11:46:05 |  | [work_fetch] target work buffer: 864.00 + 864.00 sec
11/03/2021 11:46:05 |  | [work_fetch] --- project states ---
11/03/2021 11:46:05 | Einstein@Home | [work_fetch] REC 824078.357 prio -2.159 can request work
11/03/2021 11:46:05 | NumberFields@home | [work_fetch] REC 1860.297 prio -0.075 can request work
11/03/2021 11:46:05 |  | [work_fetch] --- state for CPU ---
11/03/2021 11:46:05 |  | [work_fetch] shortfall 0.00 nidle 0.00 saturated 20790.50 busy 0.00
11/03/2021 11:46:05 | Einstein@Home | [work_fetch] share 0.000 blocked by project preferences
11/03/2021 11:46:05 | NumberFields@home | [work_fetch] share 1.000
11/03/2021 11:46:05 |  | [work_fetch] --- state for NVIDIA GPU ---
11/03/2021 11:46:05 |  | [work_fetch] shortfall 0.00 nidle 0.00 saturated 1855.37 busy 0.00
11/03/2021 11:46:05 | Einstein@Home | [work_fetch] share 1.000
11/03/2021 11:46:05 | NumberFields@home | [work_fetch] share 0.000 blocked by project preferences
11/03/2021 11:46:05 |  | [work_fetch] --- state for Intel GPU ---
11/03/2021 11:46:05 |  | [work_fetch] shortfall 0.00 nidle 0.00 saturated 1921.76 busy 0.00
11/03/2021 11:46:05 | Einstein@Home | [work_fetch] share 1.000
11/03/2021 11:46:05 | NumberFields@home | [work_fetch] share 0.000 no applications
11/03/2021 11:46:05 |  | [work_fetch] ------- end work fetch state -------
David only ran a single type of [rr_sim], the first one shown here (which I've left untouched). You can see that it concludes with the last NumberFields task - my CPU app - finishing after 6516 seconds.

My hack runs a separate version of [rr_sim] for work fetch, giving a more realistic buffer size of 21,455 seconds, which is reflected in the 'saturated' work fetch figure for the CPU. I haven't done anything special for GPUs - they're normally controlled by the number of GPUs in the system, rather than by app_config.xml.

If anyone wants to test it (and has experience of the CPU overfetch we've been discussing), let me know.
ID: 103484 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103487 - Posted: 11 Mar 2021, 20:41:54 UTC - in response to Message 103484.  
Last modified: 11 Mar 2021, 20:45:10 UTC

Well, you run at least 2 projects so cache keeps busy.
With only single project I expect idle CPU cores after some time of working with max_concurrent.

And sure, want to test.
ID: 103487 · Report as offensive
Sandman192

Send message
Joined: 28 Aug 19
Posts: 49
United States
Message 103502 - Posted: 12 Mar 2021, 16:21:22 UTC

BOINC gets work but can't finish on time? What?
Plus, won't get **ANY** work units from project unless I stop another project. Exp: If TN-Grid is scheduled to get work then only TN-Grid will only get work. If WCG is scheduled to get work then only WCG will get work and will be more in charge and takes out TN-Grid.

This is happening on 2 of my computers. One Linux and one Windows 10.


**"Days overdue; you may not get credit for it. Consider aborting it". ???**
**"Tasks won't finish in time: BOINC runs 99.6% of the time; computation is enabled 100.0% of that". ???** Unless I stop it from getting other projects.
3/9/2021 3:57:05 AM | TN-Grid Platform | Task 181047_Hs_T154140-OXCT1_wu-277_1614546821865_1 is 3.11 days overdue; you may not get credit for it. Consider aborting it.
3/9/2021 3:57:05 AM | TN-Grid Platform | Task 181053_Hs_T002498-OPTN_wu-183_1614555068070_1 is 3.03 days overdue; you may not get credit for it. Consider aborting it.
3/9/2021 3:57:05 AM | TN-Grid Platform | Task 181055_Hs_T002502-OPTN_wu-232_1614558714182_0 is 2.99 days overdue; you may not get credit for it. Consider aborting it.
3/9/2021 3:57:05 AM | TN-Grid Platform | Task 181074_Hs_T194898-OGN_wu-176_1614588563451_1 is 2.66 days overdue; you may not get credit for it. Consider aborting it.
3/9/2021 3:57:05 AM | TN-Grid Platform | Task 181084_Hs_T116698-OBFC2A_wu-83_1614600576731_1 is 2.50 days overdue; you may not get credit for it. Consider aborting it.
3/9/2021 3:57:05 AM | TN-Grid Platform | Task 181162_Hs_T191705-NELF_wu-18_1614716635947_0 is 1.18 days overdue; you may not get credit for it. Consider aborting it.
3/9/2021 3:57:05 AM | TN-Grid Platform | Task 181166_Hs_T191709-NELF_wu-65_1614721341325_1 is 1.11 days overdue; you may not get credit for it. Consider aborting it.
3/9/2021 3:57:05 AM | TN-Grid Platform | Task 181235_Hs_T175748-MYO1G_wu-194_1614812880971_1 is 0.25 days overdue; you may not get credit for it. Consider aborting it.
3/9/2021 3:57:05 AM | TN-Grid Platform | Task 181242_Hs_T045070-MYO1E_wu-217_1614821392673_1 is 0.18 days overdue; you may not get credit for it. Consider aborting it.
3/9/2021 3:57:45 AM | SiDock@home | Tasks won't finish in time: BOINC runs 99.6% of the time; computation is enabled 100.0% of that
3/9/2021 3:57:45 AM | SiDock@home | Project requested delay of 7 seconds
3/9/2021 3:57:52 AM | Rosetta@home | Sending scheduler request: To fetch work.
3/9/2021 3:57:52 AM | Rosetta@home | Requesting new tasks for CPU
3/9/2021 3:57:53 AM | Rosetta@home | Scheduler request completed: got 0 new tasks
3/9/2021 3:57:53 AM | Rosetta@home | No tasks sent
3/9/2021 3:57:53 AM | Rosetta@home | Tasks won't finish in time: BOINC runs 99.6% of the time; computation is enabled 100.0% of that
3/9/2021 3:57:53 AM | Rosetta@home | Project requested delay of 31 seconds
3/9/2021 3:57:59 AM | GPUGRID | Sending scheduler request: To fetch work.
3/9/2021 4:44:11 AM | TN-Grid Platform | Aborting task 181258_Hs_T100042-MYH7B_wu-211_1614840590806_1; not started and deadline has passed
3/9/2021 4:44:11 AM | TN-Grid Platform | Aborting task 181259_Hs_T100043-MYH7B_wu-90_1614841440934_1; not started and deadline has passed
3/9/2021 9:20:21 AM | SiDock@home | Tasks won't finish in time: BOINC runs 99.6% of the time; computation is enabled 100.0% of that/9/2021 9:20:21 AM | SiDock@home | Project requested delay of 7 seconds
3/9/2021 10:43:19 AM | Rosetta@home | Tasks won't finish in time: BOINC runs 99.6% of the time; computation is enabled 100.0% of that
3/9/2021 10:43:19 AM | Rosetta@home | Project requested delay of 31 seconds
3/9/2021 2:23:04 PM | SiDock@home | Tasks won't finish in time: BOINC runs 99.6% of the time; computation is enabled 100.0% of that
3/9/2021 2:23:04 PM | SiDock@home | Project requested delay of 7 seconds
ID: 103502 · Report as offensive
Profile Dave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2518
United Kingdom
Message 103504 - Posted: 12 Mar 2021, 16:35:38 UTC

BOINC gets work but can't finish on time? What?
Plus, won't get **ANY** work units from project unless I stop another project. Exp: If TN-Grid is scheduled to get work then only TN-Grid will only get work. If WCG is scheduled to get work then only WCG will get work and will be more in charge and takes out TN-Grid.

This is happening on 2 of my computers. One Linux and one Windows 10.


BOINC should prioritise work that is going to miss the deadline unless it does so. Are you overcommitting your computer by having too big a cache? If projects have short deadlines, then if you go for the maximum of 10days work plus 10 additional days, some tasks are going to run out of time before the 20 days is up.

I haven't run the projects you mention apart from WCG on my computer so don't know if deadlines are short but I would suggest running at minimum cache size. I have mine at 0.1 +0.1 days. It is very rarely that I have anything go over the time limit.
ID: 103504 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 103505 - Posted: 12 Mar 2021, 17:19:52 UTC - in response to Message 103504.  

Are you...?
Sandman isn't looking for help (from us). He's been specifically asked by Richard to come over here and post his logs, in order to hunt for the bug(let) in this thread. Expect more posts and logs in the future, but no need to try to help him. He's in good hands.
ID: 103505 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 103506 - Posted: 12 Mar 2021, 17:43:46 UTC

Sorry, I was away from the screen for a while. I've got a couple of PMs in my inbox too. I'll get on to them.
ID: 103506 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 103507 - Posted: 12 Mar 2021, 18:23:05 UTC

@Sandman192 - you have replies to your PMs.
ID: 103507 · Report as offensive
Sandman192

Send message
Joined: 28 Aug 19
Posts: 49
United States
Message 103593 - Posted: 18 Mar 2021, 20:38:52 UTC

I can have 10 projects schedule but WCG or TN-Grid is all I get unless I stop WCG or TN-Grid from getting any work at all and things seem to be normal.

And again I have never had this problem before I updated my BOINC version.
And also happing on my second computer running Linux.
ID: 103593 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 103594 - Posted: 18 Mar 2021, 21:05:29 UTC - in response to Message 103593.  

You probably pushed up the priority of the other projects by being locked into working on TN-Grid for so long by deadline pressure. It will return to normal gradually, but over a period of several days.

See the Configuration Options page of the User Manual. Try setting the line

<rec_half_life_days>X</rec_half_life_days>
A project's scheduling priority is determined by its estimated credit in the last X days. Default is 10; set it larger if you run long high-priority jobs.
to something much smaller: one day, instead of the default 10, would sort things out quicker.
ID: 103594 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103629 - Posted: 21 Mar 2021, 19:41:25 UTC
Last modified: 21 Mar 2021, 19:53:07 UTC

Running Richard's bugfixed build with work fetch through last week.
Initial bug definitely fixed - there is no overfetch.
Unfortunately, the signs of other issue I mentioned earlier getting stronger:
Currently host has only GW tasks in cache + 1 running FGRP task. Cause only 2 GW tasks allowed at once and host has 4 cores there will be idle ones soon...

And GPU part suffers from inability to honor project shares. But this issue worth separate thread.


EDIT: Unfortunately, it's not "signs", it 's happened already...



2 GW tasks, 1 FGRP task... and 1 GPU MW task that doesn't need full CPU core! So, 1 CPU core already idle. When FGRP task finishes there will be 2 cores with high probability....

So, this bugfix isn't enough to use max_concurrent as expected.

EDIT2: And BOINC doesn't react on idle device (CPU):

3/21/2021 22:48:01 PM | | [work_fetch] ------- start work fetch state -------
3/21/2021 22:48:01 PM | | [work_fetch] target work buffer: 129600.00 + 8640.00 sec
3/21/2021 22:48:01 PM | | [work_fetch] --- project states ---
3/21/2021 22:48:01 PM | Einstein@Home | [work_fetch] REC 19077.814 prio -6075.662 can request work
3/21/2021 22:48:01 PM | Milkyway@Home | [work_fetch] REC 12328.567 prio -0.513 can request work
3/21/2021 22:48:01 PM | SETI@home Beta Test | [work_fetch] REC 0.000 prio 0.000 can't request work: suspended via Manager
3/21/2021 22:48:01 PM | | [work_fetch] --- state for CPU ---
3/21/2021 22:48:01 PM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 153096.94 busy 0.00
3/21/2021 22:48:01 PM | Einstein@Home | [work_fetch] share 1.000
3/21/2021 22:48:01 PM | Milkyway@Home | [work_fetch] share 0.000 blocked by project preferences
3/21/2021 22:48:01 PM | SETI@home Beta Test | [work_fetch] share 0.000
3/21/2021 22:48:01 PM | | [work_fetch] --- state for NVIDIA GPU ---
3/21/2021 22:48:01 PM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 139709.31 busy 0.00
3/21/2021 22:48:01 PM | Einstein@Home | [work_fetch] share 0.000
3/21/2021 22:48:01 PM | Milkyway@Home | [work_fetch] share 1.000
3/21/2021 22:48:01 PM | SETI@home Beta Test | [work_fetch] share 0.000
3/21/2021 22:48:01 PM | | [work_fetch] ------- end work fetch state -------

Manual update didn't help (as expected):

3/21/2021 22:50:17 PM | Einstein@Home | piggyback: resource CPU
3/21/2021 22:50:17 PM | Einstein@Home | piggyback: don't need CPU
3/21/2021 22:50:17 PM | Einstein@Home | piggyback: resource NVIDIA GPU
3/21/2021 22:50:17 PM | Einstein@Home | piggyback: don't need NVIDIA GPU
3/21/2021 22:50:17 PM | Einstein@Home | [work_fetch] request: CPU (0.00 sec, 0.00 inst) NVIDIA GPU (0.00 sec, 0.00 inst)
3/21/2021 22:50:17 PM | Einstein@Home | Sending scheduler request: Requested by user.
3/21/2021 22:50:17 PM | Einstein@Home | Not requesting tasks: don't need (CPU: job cache full; NVIDIA GPU: job cache full)
3/21/2021 22:50:18 PM | Einstein@Home | Scheduler request completed
3/21/2021 22:50:18 PM | Einstein@Home | Project requested delay of 60 seconds
3/21/2021 22:50:18 PM | | [work_fetch] Request work fetch: RPC complete
3/21/2021 22:50:23 PM | | choose_project(): 1616356223.263348
3/21/2021 22:50:23 PM | | [work_fetch] ------- start work fetch state -------
3/21/2021 22:50:23 PM | | [work_fetch] target work buffer: 129600.00 + 8640.00 sec
3/21/2021 22:50:23 PM | | [work_fetch] --- project states ---
3/21/2021 22:50:23 PM | Einstein@Home | [work_fetch] REC 19076.165 prio -15117.019 can't request work: scheduler RPC backoff (54.94 sec)
3/21/2021 22:50:23 PM | Milkyway@Home | [work_fetch] REC 12330.310 prio -1.120 can request work
3/21/2021 22:50:23 PM | SETI@home Beta Test | [work_fetch] REC 0.000 prio 0.000 can't request work: suspended via Manager
3/21/2021 22:50:23 PM | | [work_fetch] --- state for CPU ---
3/21/2021 22:50:23 PM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 153031.73 busy 0.00
3/21/2021 22:50:23 PM | Einstein@Home | [work_fetch] share 0.000
3/21/2021 22:50:23 PM | Milkyway@Home | [work_fetch] share 0.000 blocked by project preferences
3/21/2021 22:50:23 PM | SETI@home Beta Test | [work_fetch] share 0.000
3/21/2021 22:50:23 PM | | [work_fetch] --- state for NVIDIA GPU ---
3/21/2021 22:50:23 PM | | [work_fetch] shortfall 0.00 nidle 0.00 saturated 139559.98 busy 0.00
3/21/2021 22:50:23 PM | Einstein@Home | [work_fetch] share 0.000
3/21/2021 22:50:23 PM | Milkyway@Home | [work_fetch] share 1.000
3/21/2021 22:50:23 PM | SETI@home Beta Test | [work_fetch] share 0.000
3/21/2021 22:50:23 PM | | [work_fetch] ------- end work fetch state -------
ID: 103629 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103654 - Posted: 22 Mar 2021, 16:37:19 UTC - in response to Message 103629.  

Currently (as was expected) 2 GW CPU tasks + 1 NV FGRP task. 1 core sits completely idle.
And no MW tasks on host....
Seems I have no other way as to set E@h as backup (zero share) project again and crunch w/o cache on my second-power host...
Accounting for instability of my current router this almost definitely means idle host time :/

Richard, if no more info about modded build required I prefer to return to stock one cause it will ask for work when CPU is idle.
ID: 103654 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 103667 - Posted: 22 Mar 2021, 19:09:20 UTC - in response to Message 103654.  

Fair enough. I think we've established what I set out to achieve - that there is a bug (several bugs!), and that a simple hack eliminates the massive overfetch that's given in the thread title.

To go further, and eliminate the extra bugs that pertain to your setup, would require re-writing the whole of rr_sim to keep track of max_concurrent at every step of the way. I don't think I'm skilled enough to do that. We'll have to stop at the proof-of-concept.
ID: 103667 · Report as offensive
Raistmer

Send message
Joined: 9 Apr 06
Posts: 302
Message 103672 - Posted: 22 Mar 2021, 22:26:27 UTC - in response to Message 103667.  
Last modified: 22 Mar 2021, 22:35:29 UTC

Yep, the choice between over-fetch and idle cores - the sad choice...

BOINC many years suffers from wrong "atomic entity" definition, I would say.
First time I said that 15 (?) years ago, at first approach of GPU computing, it was in mail list those times, not only on forums...
There are many ad-hoc additions since those times but no real re-write with re-design. And it's just as needed as before.
The "atomic entity" here is app_version/plan class. Not the project. And still we suffer from project-centric initial approach.
It's just everywhere. From per-project server requests to per-project shares.
Initially there was only 1 app per project (and just single project - SETI ). And "atomic project" was == "atomic app" (it appeared so cause no other apps were there).
But then AstroPulse added... and whole Credit system gone mad. Then GPGPU emerged - wow, 2 different devices in single host just uncomparable by computing abilities... And so on...
BOINC manages not projects, it manages tasks for particular specific apps as atomic entities.
And those apps then grouping into different groups by resource usage and by project owning....

What I mean - all that can be done between 2 apps belonging to 2 different projects should be able to do between 2 apps belonging to single project.
And it's definitely not the case still...
ID: 103672 · Report as offensive
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Questions and problems : GPU tasks skipped after scheduler overcommits CPU cores

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.