1)
Message boards :
Questions and problems :
BOINC 6.10.43/6.10.44 no longer released for public
(Message 32285)
Posted 19 Apr 2010 by avidday Post: There is a fix for this already, but it won't come until the next BOINC version. It's also comprised of checking available memory on the GPUs before they get work, not as it is now give work, then check memory. That will fix the symptom, so that jobs won't get wrongly put into an infinite "check every 5 minutes for enough free memory" loop, but not the root cause of the problem, which is actually the act checking the free memory itself. I will email you the log and some other information that the developers should probably look at. Thanks for your help. |
2)
Message boards :
Questions and problems :
BOINC 6.10.43/6.10.44 no longer released for public
(Message 32282)
Posted 19 Apr 2010 by avidday Post:
Indeed it is very verbose (your xml was a bit broken btw, but the schema is pretty straight forward). I have a little snippet which explains both problems I see. A task finishes and the machine is idle. The scheduler runs: 19-Apr-2010 14:40:36 [---] [rr_sim] rr_sim start: work_buf_total 30240.00 on_frac 0.961 active_frac 0.993 19-Apr-2010 14:41:26 [---] [cpu_sched_debug] enforce_schedule(): start 19-Apr-2010 14:41:26 [---] [cpu_sched_debug] preliminary job list: 19-Apr-2010 14:41:26 [---] [cpu_sched_debug] final job list: 19-Apr-2010 14:41:26 [---] [cpu_sched_debug] using 0.00 out of 1 CPUs 19-Apr-2010 14:41:26 [---] [cpu_sched_debug] enforce_schedule: end 19-Apr-2010 14:41:26 [---] [rr_sim] rr_sim start: work_buf_total 30240.00 on_frac 0.961 active_frac 0.993 19-Apr-2010 14:41:26 [Milkyway@home] [rr_sim] 0.00: starting de_new_test2_81566_1271673925_0 (0.05 CPU + 1.00 NV) 19-Apr-2010 14:41:26 [Milkyway@home] [rr_sim] 0.00: starting de_new_test2_44636_1271666330_2 (0.05 CPU + 1.00 NV) 19-Apr-2010 14:41:26 [Milkyway@home] [rr_sim] 0.00: de_new_test2_81566_1271673925_0 finishes after 720.80 (97353.46G/135.06G) 19-Apr-2010 14:41:26 [Milkyway@home] [rr_sim] 720.80: starting de_new_test2_76466_1271672823_1 (0.05 CPU + 1.00 NV) 19-Apr-2010 14:41:26 [Milkyway@home] [rr_sim] 720.80: de_new_test2_44636_1271666330_2 finishes after 0.00 (0.00G/135.06G) 19-Apr-2010 14:41:26 [Milkyway@home] [rr_sim] 720.80: starting de_new_test2_80758_1271673719_1 (0.05 CPU + 1.00 NV) 19-Apr-2010 14:41:26 [Milkyway@home] [rr_sim] 720.80: de_new_test2_76466_1271672823_1 finishes after 720.80 (97353.46G/135.06G) 19-Apr-2010 14:41:26 [Milkyway@home] [rr_sim] 1441.61: starting de_new_test2_59278_1271669297_1 (0.05 CPU + 1.00 NV) 19-Apr-2010 14:41:26 [Milkyway@home] [rr_sim] 1441.61: de_new_test2_80758_1271673719_1 finishes after 0.00 (0.00G/135.06G) 19-Apr-2010 14:41:26 [Milkyway@home] [rr_sim] 1441.61: de_new_test2_59278_1271669297_1 finishes after 720.80 (97353.46G/135.06G) and nothing happens. It reruns the same way at 30 second intervals for 6 minutes, with the machine idle and then: 19-Apr-2010 14:47:30 [---] [cpu_sched_debug] Request CPU reschedule: Idle state change 19-Apr-2010 14:47:30 [---] [cpu_sched_debug] schedule_cpus(): start 19-Apr-2010 14:47:30 [---] [rr_sim] rr_sim start: work_buf_total 30240.00 on_frac 0.961 active_frac 0.993 19-Apr-2010 14:47:30 [Milkyway@home] [rr_sim] 0.00: starting de_new_test2_81566_1271673925_0 (0.05 CPU + 1.00 NV) 19-Apr-2010 14:47:30 [Milkyway@home] [rr_sim] 0.00: starting de_new_test2_44636_1271666330_2 (0.05 CPU + 1.00 NV) 19-Apr-2010 14:47:30 [Milkyway@home] [rr_sim] 0.00: de_new_test2_81566_1271673925_0 finishes after 720.79 (97353.46G/135.06G) 19-Apr-2010 14:47:30 [Milkyway@home] [rr_sim] 720.79: starting de_new_test2_76466_1271672823_1 (0.05 CPU + 1.00 NV) 19-Apr-2010 14:47:30 [Milkyway@home] [rr_sim] 720.79: de_new_test2_44636_1271666330_2 finishes after 0.00 (0.00G/135.06G) 19-Apr-2010 14:47:30 [Milkyway@home] [rr_sim] 720.79: starting de_new_test2_80758_1271673719_1 (0.05 CPU + 1.00 NV) 19-Apr-2010 14:47:30 [Milkyway@home] [rr_sim] 720.79: de_new_test2_76466_1271672823_1 finishes after 720.79 (97353.46G/135.06G) 19-Apr-2010 14:47:30 [Milkyway@home] [rr_sim] 1441.58: starting de_new_test2_59278_1271669297_1 (0.05 CPU + 1.00 NV) 19-Apr-2010 14:47:30 [Milkyway@home] [rr_sim] 1441.58: de_new_test2_80758_1271673719_1 finishes after 0.00 (0.00G/135.06G) 19-Apr-2010 14:47:30 [Milkyway@home] [rr_sim] 1441.58: starting de_new_test2_73287_1271672226_2 (0.05 CPU + 1.00 NV) 19-Apr-2010 14:47:30 [Milkyway@home] [rr_sim] 1441.58: de_new_test2_59278_1271669297_1 finishes after 720.79 (97353.46G/135.06G) 19-Apr-2010 14:47:30 [Milkyway@home] [rr_sim] 2162.37: de_new_test2_73287_1271672226_2 finishes after 0.00 (0.00G/135.06G) 19-Apr-2010 14:47:30 [Milkyway@home] [cpu_sched_debug] scheduling de_new_test2_81566_1271673925_0 (coprocessor job, FIFO) 19-Apr-2010 14:47:30 [---] [cpu_sched_debug] reserving 1.000000 of coproc CUDA 19-Apr-2010 14:47:30 [Milkyway@home] [cpu_sched_debug] scheduling de_new_test2_44636_1271666330_2 (coprocessor job, FIFO) 19-Apr-2010 14:47:30 [---] [cpu_sched_debug] reserving 1.000000 of coproc CUDA 19-Apr-2010 14:47:30 [---] [cpu_sched_debug] Request enforce CPU schedule: schedule_cpus 19-Apr-2010 14:47:30 [---] [cpu_sched_debug] enforce_schedule(): start 19-Apr-2010 14:47:30 [---] [cpu_sched_debug] preliminary job list: 19-Apr-2010 14:47:30 [Milkyway@home] [cpu_sched_debug] 0: de_new_test2_81566_1271673925_0 (MD: no; UTS: no) 19-Apr-2010 14:47:30 [Milkyway@home] [cpu_sched_debug] 1: de_new_test2_44636_1271666330_2 (MD: no; UTS: no) 19-Apr-2010 14:47:30 [---] [cpu_sched_debug] final job list: 19-Apr-2010 14:47:30 [Milkyway@home] [cpu_sched_debug] 0: de_new_test2_81566_1271673925_0 (MD: no; UTS: no) 19-Apr-2010 14:47:30 [Milkyway@home] [cpu_sched_debug] 1: de_new_test2_44636_1271666330_2 (MD: no; UTS: no) 19-Apr-2010 14:47:30 [Milkyway@home] [coproc_debug] Assigning CUDA instance 0 to de_new_test2_81566_1271673925_0 19-Apr-2010 14:47:30 [Milkyway@home] [coproc_debug] Assigning CUDA instance 1 to de_new_test2_44636_1271666330_2 19-Apr-2010 14:47:30 [Milkyway@home] Can't get available GPU RAM: 999 19-Apr-2010 14:47:30 [---] [cpu_sched_debug] Request CPU reschedule: insufficient GPU RAM So it is something in the machine idle logic which is stopping the jobs from being launched, and then, as I thought, it is the compute mode settings which are problematic after that (I do a lot of CUDA development, and I can help you fix that if you want). The first GPU is marked as compute prohibited by the driver, but the boinc scheduler is trying to use it anyway. The job it tries to schedule on the compute prohibited device then gets stuck on the job queue, even though it was never started. We can continue this by email/pm if you like.. |
3)
Message boards :
Questions and problems :
BOINC 6.10.43/6.10.44 no longer released for public
(Message 32274)
Posted 19 Apr 2010 by avidday Post:
I understand that, but my question is why, when the client has work, it doesn't run it? The task start/stop/report logic is in the client, not the project server, isn't it? I am working on the assumption that as long as the client's own internal settings permit it, it will just start and run tasks until the task queue is empty. I am seeing long pauses between the client starting tasks which I am assuming should not occur. But if you want to look at the BOINC source code, that's possible. Check http://boinc.berkeley.edu/trac/browser/branches/boinc_core_release_6_10 for the 6.10 code. Thank you for the link |
4)
Message boards :
Questions and problems :
BOINC 6.10.43/6.10.44 no longer released for public
(Message 32271)
Posted 19 Apr 2010 by avidday Post: I am trying to understand the scheduling behaviour of the Linux 6.10.44 release with GPUs, because I am seeing some strange things I can't explain. My client, from time to time, sits with a full task queue, not running any task (this is using the Milkyway@home cuda application) for anything up to 45 minutes at a time. Under other circumstances, a task will sit for days in the task queue and never start - even to the point that when the project went down for maintenance for 24 hours and every other task was completed, a five day old task was still stuck in the queue and never started, nor ran. My machine looks like this (Ubuntu 9.14 with the 195.36.15 release drivers): Fri 16 Apr 2010 09:01:44 PM EEST Starting BOINC client version 6.10.44 for x86_64-pc-linux-gnu Fri 16 Apr 2010 09:01:44 PM EEST log flags: file_xfer, sched_ops, task Fri 16 Apr 2010 09:01:44 PM EEST Libraries: libcurl/7.18.0 OpenSSL/0.9.8g zlib/1.2.3.3 c-ares/1.5.1 Fri 16 Apr 2010 09:01:44 PM EEST Data directory: /home/david/BOINC Fri 16 Apr 2010 09:01:44 PM EEST Processor: 4 AuthenticAMD AMD Phenom(tm) II X4 945 Processor [Family 16 Model 4 Stepping 2] Fri 16 Apr 2010 09:01:44 PM EEST Processor: 512.00 KB cache Fri 16 Apr 2010 09:01:44 PM EEST Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good pni monitor cx16 lahf_lm cmp_legacy svm extapic cr8_ Fri 16 Apr 2010 09:01:44 PM EEST OS: Linux: 2.6.28-18-generic Fri 16 Apr 2010 09:01:44 PM EEST Memory: 7.70 GB physical, 7.45 GB virtual Fri 16 Apr 2010 09:01:44 PM EEST Disk: 891.22 GB total, 757.31 GB free Fri 16 Apr 2010 09:01:44 PM EEST Local time is UTC +3 hours Fri 16 Apr 2010 09:01:44 PM EEST NVIDIA GPU 0: GeForce GTX 275 (driver version unknown, CUDA version 3000, compute capability 1.3, 895MB, 701 GFLOPS peak) Fri 16 Apr 2010 09:01:44 PM EEST NVIDIA GPU 1: GeForce GTX 275 (driver version unknown, CUDA version 3000, compute capability 1.3, 896MB, 701 GFLOPS peak) Fri 16 Apr 2010 09:01:44 PM EEST Milkyway@home URL http://milkyway.cs.rpi.edu/milkyway/; Computer ID 167065; resource share 100 Fri 16 Apr 2010 09:01:44 PM EEST Milkyway@home General prefs: from Milkyway@home (last modified 08-Apr-2010 17:50:24) Fri 16 Apr 2010 09:01:44 PM EEST Milkyway@home Host location: none Fri 16 Apr 2010 09:01:44 PM EEST Milkyway@home General prefs: using your defaults Fri 16 Apr 2010 09:01:44 PM EEST Reading preferences override file Fri 16 Apr 2010 09:01:44 PM EEST Preferences: Fri 16 Apr 2010 09:01:44 PM EEST max memory usage when active: 3943.92MB Fri 16 Apr 2010 09:01:44 PM EEST max memory usage when idle: 7099.06MB Fri 16 Apr 2010 09:01:44 PM EEST max disk usage: 10.00GB Fri 16 Apr 2010 09:01:44 PM EEST max CPUs used: 1 Fri 16 Apr 2010 09:01:44 PM EEST (to change, visit the web site of an attached project, Fri 16 Apr 2010 09:01:44 PM EEST or click on Preferences) Fri 16 Apr 2010 09:01:44 PM EEST Not using a proxy with one gpu marked compute compute prohibited and the other marked compute exclusive. I have "Compute while computer is in use" and "Use GPU while computer is in use" selected in the manager, and most of the time, it works fine. A typical example of the problem looks something like this: Mon 19 Apr 2010 11:45:37 AM EEST Milkyway@home Computation for task de_new_test2_29033_1271663028_0 finished Mon 19 Apr 2010 11:45:39 AM EEST Milkyway@home Started upload of de_new_test2_29033_1271663028_0_0 Mon 19 Apr 2010 11:45:42 AM EEST Milkyway@home Finished upload of de_new_test2_29033_1271663028_0_0 Mon 19 Apr 2010 11:46:17 AM EEST Milkyway@home Sending scheduler request: To fetch work. Mon 19 Apr 2010 11:46:17 AM EEST Milkyway@home Reporting 1 completed tasks, requesting new tasks for GPU Mon 19 Apr 2010 11:46:22 AM EEST Milkyway@home Scheduler request completed: got 1 new tasks Mon 19 Apr 2010 11:46:24 AM EEST Milkyway@home Started download of de_new_test2_46499_1271666646_search_parameters Mon 19 Apr 2010 11:46:27 AM EEST Milkyway@home Finished download of de_new_test2_46499_1271666646_search_parameters Mon 19 Apr 2010 11:47:28 AM EEST Milkyway@home Sending scheduler request: To fetch work. Mon 19 Apr 2010 11:47:28 AM EEST Milkyway@home Requesting new tasks for GPU Mon 19 Apr 2010 11:47:33 AM EEST Milkyway@home Scheduler request completed: got 0 new tasks Mon 19 Apr 2010 11:47:33 AM EEST Milkyway@home Message from server: No work sent Mon 19 Apr 2010 11:47:33 AM EEST Milkyway@home Message from server: (reached limit of 6 tasks in progress) Mon 19 Apr 2010 11:48:38 AM EEST Milkyway@home Sending scheduler request: To fetch work. Mon 19 Apr 2010 11:49:48 AM EEST Milkyway@home Sending scheduler request: To fetch work. Mon 19 Apr 2010 11:49:48 AM EEST Milkyway@home Requesting new tasks for GPU Mon 19 Apr 2010 11:49:53 AM EEST Milkyway@home Scheduler request completed: got 0 new tasks Mon 19 Apr 2010 11:49:53 AM EEST Milkyway@home Message from server: No work sent Mon 19 Apr 2010 11:49:53 AM EEST Milkyway@home Message from server: (reached limit of 6 tasks in progress) Mon 19 Apr 2010 11:50:58 AM EEST Milkyway@home Sending scheduler request: To fetch work. Mon 19 Apr 2010 11:50:58 AM EEST Milkyway@home Requesting new tasks for GPU Mon 19 Apr 2010 11:51:03 AM EEST Milkyway@home Scheduler request completed: got 0 new tasks Mon 19 Apr 2010 11:51:03 AM EEST Milkyway@home Message from server: No work sent Mon 19 Apr 2010 11:51:03 AM EEST Milkyway@home Message from server: (reached limit of 6 tasks in progress) Mon 19 Apr 2010 11:52:08 AM EEST Milkyway@home Sending scheduler request: To fetch work. Mon 19 Apr 2010 11:52:08 AM EEST Milkyway@home Requesting new tasks for GPU Mon 19 Apr 2010 11:52:13 AM EEST Milkyway@home Scheduler request completed: got 0 new tasks Mon 19 Apr 2010 11:52:13 AM EEST Milkyway@home Message from server: No work sent Mon 19 Apr 2010 11:52:13 AM EEST Milkyway@home Message from server: (reached limit of 6 tasks in progress) Mon 19 Apr 2010 11:52:22 AM EEST Milkyway@home Starting de_new_test2_18284_1271660211_2 Mon 19 Apr 2010 11:52:22 AM EEST Milkyway@home Starting task de_new_test2_18284_1271660211_2 using milkyway version 24 Here a task finishes and reports, with 5 other tasks in the queue. A new task is requested and downloaded from the scheduler, so that the task queue is full at 6 tasks, then nothing happens. The machine sits idle for several minutes, periodically polling for new work (and getting nothing because it has a full task queue), but no task ever starts. Then finally something happens. This is a small example, but I have observed these "fallow" periods persist for 45 minutes in a couple of cases. The question is why? Does your project have a source repository somewhere I could browse? I have a suspicion about what might be happening [it might be the client is mishandling or misinterpreting the Linux driver compute settings], but looking at your CUDA interface code would certainly be helpful. Thanks in advance. |
Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.