Posts by avidday

1) Message boards : Questions and problems : BOINC 6.10.43/6.10.44 no longer released for public (Message 32285)
Posted 19 Apr 2010 by avidday
Post:
There is a fix for this already, but it won't come until the next BOINC version. It's also comprised of checking available memory on the GPUs before they get work, not as it is now give work, then check memory.


That will fix the symptom, so that jobs won't get wrongly put into an infinite "check every 5 minutes for enough free memory" loop, but not the root cause of the problem, which is actually the act checking the free memory itself.

I will email you the log and some other information that the developers should probably look at.

Thanks for your help.
2) Message boards : Questions and problems : BOINC 6.10.43/6.10.44 no longer released for public (Message 32282)
Posted 19 Apr 2010 by avidday
Post:

Forgot something... :-)
Since that log will be quite extensive, please do not post it in the forums. Or at least not in this thread... please email it to me, I'll send you my email address in a private message.


Indeed it is very verbose (your xml was a bit broken btw, but the schema is pretty straight forward). I have a little snippet which explains both problems I see.

A task finishes and the machine is idle. The scheduler runs:

19-Apr-2010 14:40:36 [---] [rr_sim] rr_sim start: work_buf_total 30240.00 on_frac 0.961 active_frac 0.993
19-Apr-2010 14:41:26 [---] [cpu_sched_debug] enforce_schedule(): start
19-Apr-2010 14:41:26 [---] [cpu_sched_debug] preliminary job list:
19-Apr-2010 14:41:26 [---] [cpu_sched_debug] final job list:
19-Apr-2010 14:41:26 [---] [cpu_sched_debug] using 0.00 out of 1 CPUs
19-Apr-2010 14:41:26 [---] [cpu_sched_debug] enforce_schedule: end
19-Apr-2010 14:41:26 [---] [rr_sim] rr_sim start: work_buf_total 30240.00 on_frac 0.961 active_frac 0.993
19-Apr-2010 14:41:26 [Milkyway@home] [rr_sim] 0.00: starting de_new_test2_81566_1271673925_0 (0.05 CPU + 1.00 NV)
19-Apr-2010 14:41:26 [Milkyway@home] [rr_sim] 0.00: starting de_new_test2_44636_1271666330_2 (0.05 CPU + 1.00 NV)
19-Apr-2010 14:41:26 [Milkyway@home] [rr_sim] 0.00: de_new_test2_81566_1271673925_0 finishes after 720.80 (97353.46G/135.06G)
19-Apr-2010 14:41:26 [Milkyway@home] [rr_sim] 720.80: starting de_new_test2_76466_1271672823_1 (0.05 CPU + 1.00 NV)
19-Apr-2010 14:41:26 [Milkyway@home] [rr_sim] 720.80: de_new_test2_44636_1271666330_2 finishes after 0.00 (0.00G/135.06G)
19-Apr-2010 14:41:26 [Milkyway@home] [rr_sim] 720.80: starting de_new_test2_80758_1271673719_1 (0.05 CPU + 1.00 NV)
19-Apr-2010 14:41:26 [Milkyway@home] [rr_sim] 720.80: de_new_test2_76466_1271672823_1 finishes after 720.80 (97353.46G/135.06G)
19-Apr-2010 14:41:26 [Milkyway@home] [rr_sim] 1441.61: starting de_new_test2_59278_1271669297_1 (0.05 CPU + 1.00 NV)
19-Apr-2010 14:41:26 [Milkyway@home] [rr_sim] 1441.61: de_new_test2_80758_1271673719_1 finishes after 0.00 (0.00G/135.06G)
19-Apr-2010 14:41:26 [Milkyway@home] [rr_sim] 1441.61: de_new_test2_59278_1271669297_1 finishes after 720.80 (97353.46G/135.06G)


and nothing happens. It reruns the same way at 30 second intervals for 6 minutes, with the machine idle and then:

19-Apr-2010 14:47:30 [---] [cpu_sched_debug] Request CPU reschedule: Idle state change
19-Apr-2010 14:47:30 [---] [cpu_sched_debug] schedule_cpus(): start
19-Apr-2010 14:47:30 [---] [rr_sim] rr_sim start: work_buf_total 30240.00 on_frac 0.961 active_frac 0.993
19-Apr-2010 14:47:30 [Milkyway@home] [rr_sim] 0.00: starting de_new_test2_81566_1271673925_0 (0.05 CPU + 1.00 NV)
19-Apr-2010 14:47:30 [Milkyway@home] [rr_sim] 0.00: starting de_new_test2_44636_1271666330_2 (0.05 CPU + 1.00 NV)
19-Apr-2010 14:47:30 [Milkyway@home] [rr_sim] 0.00: de_new_test2_81566_1271673925_0 finishes after 720.79 (97353.46G/135.06G)
19-Apr-2010 14:47:30 [Milkyway@home] [rr_sim] 720.79: starting de_new_test2_76466_1271672823_1 (0.05 CPU + 1.00 NV)
19-Apr-2010 14:47:30 [Milkyway@home] [rr_sim] 720.79: de_new_test2_44636_1271666330_2 finishes after 0.00 (0.00G/135.06G)
19-Apr-2010 14:47:30 [Milkyway@home] [rr_sim] 720.79: starting de_new_test2_80758_1271673719_1 (0.05 CPU + 1.00 NV)
19-Apr-2010 14:47:30 [Milkyway@home] [rr_sim] 720.79: de_new_test2_76466_1271672823_1 finishes after 720.79 (97353.46G/135.06G)
19-Apr-2010 14:47:30 [Milkyway@home] [rr_sim] 1441.58: starting de_new_test2_59278_1271669297_1 (0.05 CPU + 1.00 NV)
19-Apr-2010 14:47:30 [Milkyway@home] [rr_sim] 1441.58: de_new_test2_80758_1271673719_1 finishes after 0.00 (0.00G/135.06G)
19-Apr-2010 14:47:30 [Milkyway@home] [rr_sim] 1441.58: starting de_new_test2_73287_1271672226_2 (0.05 CPU + 1.00 NV)
19-Apr-2010 14:47:30 [Milkyway@home] [rr_sim] 1441.58: de_new_test2_59278_1271669297_1 finishes after 720.79 (97353.46G/135.06G)
19-Apr-2010 14:47:30 [Milkyway@home] [rr_sim] 2162.37: de_new_test2_73287_1271672226_2 finishes after 0.00 (0.00G/135.06G)
19-Apr-2010 14:47:30 [Milkyway@home] [cpu_sched_debug] scheduling de_new_test2_81566_1271673925_0 (coprocessor job, FIFO)
19-Apr-2010 14:47:30 [---] [cpu_sched_debug] reserving 1.000000 of coproc CUDA
19-Apr-2010 14:47:30 [Milkyway@home] [cpu_sched_debug] scheduling de_new_test2_44636_1271666330_2 (coprocessor job, FIFO)
19-Apr-2010 14:47:30 [---] [cpu_sched_debug] reserving 1.000000 of coproc CUDA
19-Apr-2010 14:47:30 [---] [cpu_sched_debug] Request enforce CPU schedule: schedule_cpus
19-Apr-2010 14:47:30 [---] [cpu_sched_debug] enforce_schedule(): start
19-Apr-2010 14:47:30 [---] [cpu_sched_debug] preliminary job list:
19-Apr-2010 14:47:30 [Milkyway@home] [cpu_sched_debug] 0: de_new_test2_81566_1271673925_0 (MD: no; UTS: no)
19-Apr-2010 14:47:30 [Milkyway@home] [cpu_sched_debug] 1: de_new_test2_44636_1271666330_2 (MD: no; UTS: no)
19-Apr-2010 14:47:30 [---] [cpu_sched_debug] final job list:
19-Apr-2010 14:47:30 [Milkyway@home] [cpu_sched_debug] 0: de_new_test2_81566_1271673925_0 (MD: no; UTS: no)
19-Apr-2010 14:47:30 [Milkyway@home] [cpu_sched_debug] 1: de_new_test2_44636_1271666330_2 (MD: no; UTS: no)
19-Apr-2010 14:47:30 [Milkyway@home] [coproc_debug] Assigning CUDA instance 0 to de_new_test2_81566_1271673925_0
19-Apr-2010 14:47:30 [Milkyway@home] [coproc_debug] Assigning CUDA instance 1 to de_new_test2_44636_1271666330_2
19-Apr-2010 14:47:30 [Milkyway@home] Can't get available GPU RAM: 999
19-Apr-2010 14:47:30 [---] [cpu_sched_debug] Request CPU reschedule: insufficient GPU RAM


So it is something in the machine idle logic which is stopping the jobs from being launched, and then, as I thought, it is the compute mode settings which are problematic after that (I do a lot of CUDA development, and I can help you fix that if you want). The first GPU is marked as compute prohibited by the driver, but the boinc scheduler is trying to use it anyway. The job it tries to schedule on the compute prohibited device then gets stuck on the job queue, even though it was never started.

We can continue this by email/pm if you like..
3) Message boards : Questions and problems : BOINC 6.10.43/6.10.44 no longer released for public (Message 32274)
Posted 19 Apr 2010 by avidday
Post:


BOINC isn't a project, while why the Milkyway scheduler may or may not give you work is something you have to take up with them. It's their server that says that no work is sent, with the reason given (their maximum of 6 tasks per queue).


I understand that, but my question is why, when the client has work, it doesn't run it? The task start/stop/report logic is in the client, not the project server, isn't it?

I am working on the assumption that as long as the client's own internal settings permit it, it will just start and run tasks until the task queue is empty. I am seeing long pauses between the client starting tasks which I am assuming should not occur.

But if you want to look at the BOINC source code, that's possible. Check http://boinc.berkeley.edu/trac/browser/branches/boinc_core_release_6_10 for the 6.10 code.


Thank you for the link
4) Message boards : Questions and problems : BOINC 6.10.43/6.10.44 no longer released for public (Message 32271)
Posted 19 Apr 2010 by avidday
Post:
I am trying to understand the scheduling behaviour of the Linux 6.10.44 release with GPUs, because I am seeing some strange things I can't explain. My client, from time to time, sits with a full task queue, not running any task (this is using the Milkyway@home cuda application) for anything up to 45 minutes at a time. Under other circumstances, a task will sit for days in the task queue and never start - even to the point that when the project went down for maintenance for 24 hours and every other task was completed, a five day old task was still stuck in the queue and never started, nor ran.

My machine looks like this (Ubuntu 9.14 with the 195.36.15 release drivers):

Fri 16 Apr 2010 09:01:44 PM EEST		Starting BOINC client version 6.10.44 for x86_64-pc-linux-gnu
Fri 16 Apr 2010 09:01:44 PM EEST		log flags: file_xfer, sched_ops, task
Fri 16 Apr 2010 09:01:44 PM EEST		Libraries: libcurl/7.18.0 OpenSSL/0.9.8g zlib/1.2.3.3 c-ares/1.5.1
Fri 16 Apr 2010 09:01:44 PM EEST		Data directory: /home/david/BOINC
Fri 16 Apr 2010 09:01:44 PM EEST		Processor: 4 AuthenticAMD AMD Phenom(tm) II X4 945 Processor [Family 16 Model 4 Stepping 2]
Fri 16 Apr 2010 09:01:44 PM EEST		Processor: 512.00 KB cache
Fri 16 Apr 2010 09:01:44 PM EEST		Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good pni monitor cx16 lahf_lm cmp_legacy svm extapic cr8_
Fri 16 Apr 2010 09:01:44 PM EEST		OS: Linux: 2.6.28-18-generic
Fri 16 Apr 2010 09:01:44 PM EEST		Memory: 7.70 GB physical, 7.45 GB virtual
Fri 16 Apr 2010 09:01:44 PM EEST		Disk: 891.22 GB total, 757.31 GB free
Fri 16 Apr 2010 09:01:44 PM EEST		Local time is UTC +3 hours
Fri 16 Apr 2010 09:01:44 PM EEST		NVIDIA GPU 0: GeForce GTX 275 (driver version unknown, CUDA version 3000, compute capability 1.3, 895MB, 701 GFLOPS peak)
Fri 16 Apr 2010 09:01:44 PM EEST		NVIDIA GPU 1: GeForce GTX 275 (driver version unknown, CUDA version 3000, compute capability 1.3, 896MB, 701 GFLOPS peak)
Fri 16 Apr 2010 09:01:44 PM EEST	Milkyway@home	URL http://milkyway.cs.rpi.edu/milkyway/; Computer ID 167065; resource share 100
Fri 16 Apr 2010 09:01:44 PM EEST	Milkyway@home	General prefs: from Milkyway@home (last modified 08-Apr-2010 17:50:24)
Fri 16 Apr 2010 09:01:44 PM EEST	Milkyway@home	Host location: none
Fri 16 Apr 2010 09:01:44 PM EEST	Milkyway@home	General prefs: using your defaults
Fri 16 Apr 2010 09:01:44 PM EEST		Reading preferences override file
Fri 16 Apr 2010 09:01:44 PM EEST		Preferences:
Fri 16 Apr 2010 09:01:44 PM EEST		   max memory usage when active: 3943.92MB
Fri 16 Apr 2010 09:01:44 PM EEST		   max memory usage when idle: 7099.06MB
Fri 16 Apr 2010 09:01:44 PM EEST		   max disk usage: 10.00GB
Fri 16 Apr 2010 09:01:44 PM EEST		   max CPUs used: 1
Fri 16 Apr 2010 09:01:44 PM EEST		   (to change, visit the web site of an attached project,
Fri 16 Apr 2010 09:01:44 PM EEST		   or click on Preferences)
Fri 16 Apr 2010 09:01:44 PM EEST		Not using a proxy


with one gpu marked compute compute prohibited and the other marked compute exclusive. I have "Compute while computer is in use" and "Use GPU while computer is in use" selected in the manager, and most of the time, it works fine. A typical example of the problem looks something like this:

Mon 19 Apr 2010 11:45:37 AM EEST	Milkyway@home	Computation for task de_new_test2_29033_1271663028_0 finished
Mon 19 Apr 2010 11:45:39 AM EEST	Milkyway@home	Started upload of de_new_test2_29033_1271663028_0_0
Mon 19 Apr 2010 11:45:42 AM EEST	Milkyway@home	Finished upload of de_new_test2_29033_1271663028_0_0
Mon 19 Apr 2010 11:46:17 AM EEST	Milkyway@home	Sending scheduler request: To fetch work.
Mon 19 Apr 2010 11:46:17 AM EEST	Milkyway@home	Reporting 1 completed tasks, requesting new tasks for GPU
Mon 19 Apr 2010 11:46:22 AM EEST	Milkyway@home	Scheduler request completed: got 1 new tasks
Mon 19 Apr 2010 11:46:24 AM EEST	Milkyway@home	Started download of de_new_test2_46499_1271666646_search_parameters
Mon 19 Apr 2010 11:46:27 AM EEST	Milkyway@home	Finished download of de_new_test2_46499_1271666646_search_parameters
Mon 19 Apr 2010 11:47:28 AM EEST	Milkyway@home	Sending scheduler request: To fetch work.
Mon 19 Apr 2010 11:47:28 AM EEST	Milkyway@home	Requesting new tasks for GPU
Mon 19 Apr 2010 11:47:33 AM EEST	Milkyway@home	Scheduler request completed: got 0 new tasks
Mon 19 Apr 2010 11:47:33 AM EEST	Milkyway@home	Message from server: No work sent
Mon 19 Apr 2010 11:47:33 AM EEST	Milkyway@home	Message from server: (reached limit of 6 tasks in progress)
Mon 19 Apr 2010 11:48:38 AM EEST	Milkyway@home	Sending scheduler request: To fetch work.
Mon 19 Apr 2010 11:49:48 AM EEST	Milkyway@home	Sending scheduler request: To fetch work.
Mon 19 Apr 2010 11:49:48 AM EEST	Milkyway@home	Requesting new tasks for GPU
Mon 19 Apr 2010 11:49:53 AM EEST	Milkyway@home	Scheduler request completed: got 0 new tasks
Mon 19 Apr 2010 11:49:53 AM EEST	Milkyway@home	Message from server: No work sent
Mon 19 Apr 2010 11:49:53 AM EEST	Milkyway@home	Message from server: (reached limit of 6 tasks in progress)
Mon 19 Apr 2010 11:50:58 AM EEST	Milkyway@home	Sending scheduler request: To fetch work.
Mon 19 Apr 2010 11:50:58 AM EEST	Milkyway@home	Requesting new tasks for GPU
Mon 19 Apr 2010 11:51:03 AM EEST	Milkyway@home	Scheduler request completed: got 0 new tasks
Mon 19 Apr 2010 11:51:03 AM EEST	Milkyway@home	Message from server: No work sent
Mon 19 Apr 2010 11:51:03 AM EEST	Milkyway@home	Message from server: (reached limit of 6 tasks in progress)
Mon 19 Apr 2010 11:52:08 AM EEST	Milkyway@home	Sending scheduler request: To fetch work.
Mon 19 Apr 2010 11:52:08 AM EEST	Milkyway@home	Requesting new tasks for GPU
Mon 19 Apr 2010 11:52:13 AM EEST	Milkyway@home	Scheduler request completed: got 0 new tasks
Mon 19 Apr 2010 11:52:13 AM EEST	Milkyway@home	Message from server: No work sent
Mon 19 Apr 2010 11:52:13 AM EEST	Milkyway@home	Message from server: (reached limit of 6 tasks in progress)
Mon 19 Apr 2010 11:52:22 AM EEST	Milkyway@home	Starting de_new_test2_18284_1271660211_2
Mon 19 Apr 2010 11:52:22 AM EEST	Milkyway@home	Starting task de_new_test2_18284_1271660211_2 using milkyway version 24


Here a task finishes and reports, with 5 other tasks in the queue. A new task is requested and downloaded from the scheduler, so that the task queue is full at 6 tasks, then nothing happens. The machine sits idle for several minutes, periodically polling for new work (and getting nothing because it has a full task queue), but no task ever starts. Then finally something happens.

This is a small example, but I have observed these "fallow" periods persist for 45 minutes in a couple of cases. The question is why?

Does your project have a source repository somewhere I could browse? I have a suspicion about what might be happening [it might be the client is mishandling or misinterpreting the Linux driver compute settings], but looking at your CUDA interface code would certainly be helpful.

Thanks in advance.




Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.