Posts by GeneAZ

21) Message boards : Questions and problems : Scheduling priority, is it affected by resource share? (Message 62461)
Posted 7 Jun 2015 by GeneAZ
Post:
Linux x64; boinc 7.4.23; AMD FX-4300 (4 core); Nvidia GTX-650; running projects Seti, Asteroids, & Einstein.

Looking at BoincManager -> Tasks -> (project) -> Properties and there is a value given for "Scheduling priority". For the past couple of months the values shown for the running projects have been fairly stable, Seti -1.10 / Einstein -1.30 / Asteroids -0.40 . With narrow range variations.

Resource shares are: Seti 85 / Einstein 10 / Asteroids 5.

Somewhere I've read about "long term debt" needing to be worked off and causing a project to consume more resource than otherwise might be expected. Asteroids runs one core 24/7 (max_concurrent=1) in spite of resource share 5; and Seti doesn't run any CPU (max_concurrent=3) in spite of resource share 85 and idle CPU cores.

Is priority -0.40 higher, i.e. more urgent, than -1.10 ? And, if so, why is Asteroids continuing for months at that priority? I have buffering set to 1 day plus 0.5 day. Presently Asteroids has 66 tasks buffered, with a total estimated time "remaining" of 130 hours. It is configured for CPU only.

Meanwhile, Seti could use 2 more cores but CPU work fetch is blocked since the CPU work queue is full (of Asteroids work).

Momentarily suspending Asteroids brings a flood of Seti CPU work but after resuming Asteroids, Seti works through its buffered tasks and fetches no more.
(Seti GPU work continues to flow normally and even Einstein GPU work takes a proportional slice of the GPU resource.)

So, what's the secret to getting Asteroids off of its priority limit? And how to "suggest" to boinc manager that 130 hours of Asteroids work exceeds the 1+0.5 day buffer parameters?

I can, of course, set NNT for Asteroids periodically but doesn't that then upset the long-term averaging boinc manager is trying to do?

Boinc 7.4.23 has the new "Event Log Diagnostic Flags" menu so just ask for a flag setting and it shall be done.
22) Message boards : Questions and problems : Elapsed Time not updating (Message 62460)
Posted 7 Jun 2015 by GeneAZ
Post:
I see similar symptoms in 7.4.23 Debian Linux boinc manager. Task tab list doesn't update fully. Sometimes one task, in a list of running tasks, doesn't update its status line; I haven't identified a pattern. Sometimes it's the last running task, sometimes not. Bringing mouse focus back to the window will fully update all tasks ONCE and then the 1-second updates resume for most, but not all, running tasks. If the "progress" of a task causes it to be repositioned in the list then again everything updates ONCE and the partial updates continue. So, obviously, the "progress" is being properly computed behind the scenes; it is apparently just the window display update that is incomplete.

(Hoping to see an x64 Linux 7.4.42 soon...)
23) Message boards : Questions and problems : S@H CPU downloads blocked when A@H project running (Message 60504)
Posted 24 Feb 2015 by GeneAZ
Post:
flakinho -- and anybody else...
(RH your posts to the other thread duly noted)

Yes, I did see the "...CPU threshold" thread you had started earlier. I was not sure whether my configuration (not using VMs) might give additional insight into the issue. It appeared to be boinc related and not project dependent. However, in the 7+ hours since my original post E@H has downloaded 4 CPU tasks (one of which is now running) and 3 Nvidia tasks.

Now I'm more confused! Why does E@H fetch CPU work and S@H does not? (And, yes, I have noticed that the S@H servers have been short of work during this time.) Does it make a difference that A@H and S@H projects have app_info.xml configurations and E@H does not?

There is a work-around, of course. Just suspend other tasks momentarily every two or three days to let the S@H work refill.

GeneAZ
24) Message boards : Questions and problems : S@H CPU downloads blocked when A@H project running (Message 60499)
Posted 23 Feb 2015 by GeneAZ
Post:
Here's the system: 64-bit Linux; BOINC 7.2.42; AMD FX-4300 (4 core); Nvidia GTX-650; 8 GB RAM; Buffer parameters 1.0 days minimum + 0.5 days additional.
--Seti CPU ID=4774476--
Abbreviations used for active projects and resource shares:
S@H Seti@home 80 CPU + Nvidia work enabled
E@H Einstein@home 20 CPU + Nvidia work enabled
A@H Asteroids@home 5 CPU work enabled

S@H configured for 3 concurrent tasks, E@H and A@H configured for 1 concurrent task. If other tasks are suspended then S@H happily fetches CPU and Nvidia work. But if A@H, for example, is running its one allowed task then S@H does NOT request CPU work and eventually the buffer CPU work is depleted leaving 2 or 3 of the cores idle.

Example situation: Feb. 20, A@H running one CPU, S@H running one Nvidia, E@H waiting to run (only Nvidia work available). Work requests to S@H were asking for ZERO seconds CPU work. O.K., "suspend A@H project" and --IMMEDIATELY-- work request to S@H and 55 CPU tasks downloaded!

A@H -resumed- and there have been no further S@H CPU work downloads for 3 days now. All the 55 downloads have been processed and reported. The buffer has no S@H CPU work and is not requesting any.

Maybe related observation -- for at least a week now, since Feb. 17, A@H has been running one CPU task (all that app_config allows) full time despite the relatively low resource share setting. It's cranking out 9 work units per day.

I've added cc_config log flags for <work_fetch_debug>, <sched_op_debug>, and <cpu_sched_debug> and what I hope will be relevant output follows:

.../snip/...
23-Feb-2015 13:50:11 [---] [cpu_sched_debug] using 1.80 out of 4 CPUs
23-Feb-2015 13:50:11 [---] [work_fetch] Request work fetch: CPUs idle
23-Feb-2015 13:50:11 [Asteroids@home] [cpu_sched_debug] ps_150218b_1138_5_1 sched state 2 next 2 task state 1
23-Feb-2015 13:50:11 [SETI@home] [cpu_sched_debug] 03se12ac.22341.4566.438086664208.12.66_0 sched state 2 next 2 task state 1
23-Feb-2015 13:50:11 [---] [cpu_sched_debug] enforce_schedule: end
23-Feb-2015 13:50:14 [---] [work_fetch] entering choose_project()
23-Feb-2015 13:50:14 [---] [work_fetch] ------- start work fetch state -------
23-Feb-2015 13:50:14 [---] [work_fetch] target work buffer: 86400.00 + 43200.00 sec
23-Feb-2015 13:50:14 [---] [work_fetch] --- project states ---
23-Feb-2015 13:50:14 [NFS@Home] [work_fetch] REC 59.786 prio -0.000000 can't req work: "no new tasks" requested via Manager
23-Feb-2015 13:50:14 [Asteroids@home] [work_fetch] REC 495.453 prio -0.355004 can req work
23-Feb-2015 13:50:14 [SETI@home] [work_fetch] REC 24465.446 prio -1.127314 can req work
23-Feb-2015 13:50:14 [Einstein@Home] [work_fetch] REC 6830.041 prio -2.041003 can req work
23-Feb-2015 13:50:14 [---] [work_fetch] --- state for CPU ---
23-Feb-2015 13:50:14 [---] [work_fetch] shortfall 6051.51 nidle 0.00 saturated 126446.69 busy 0.00
23-Feb-2015 13:50:14 [NFS@Home] [work_fetch] fetch share 0.000
23-Feb-2015 13:50:14 [Asteroids@home] [work_fetch] fetch share 0.053
23-Feb-2015 13:50:14 [SETI@home] [work_fetch] fetch share 0.842
23-Feb-2015 13:50:14 [Einstein@Home] [work_fetch] fetch share 0.105
23-Feb-2015 13:50:14 [---] [work_fetch] --- state for NVIDIA ---
23-Feb-2015 13:50:14 [---] [work_fetch] shortfall 34731.81 nidle 0.00 saturated 94639.88 busy 0.00
23-Feb-2015 13:50:14 [NFS@Home] [work_fetch] fetch share 0.000 (no apps)
23-Feb-2015 13:50:14 [Asteroids@home] [work_fetch] fetch share 0.000 (blocked by prefs) (no apps)
23-Feb-2015 13:50:14 [SETI@home] [work_fetch] fetch share 0.889
23-Feb-2015 13:50:14 [Einstein@Home] [work_fetch] fetch share 0.111
23-Feb-2015 13:50:14 [---] [work_fetch] ------- end work fetch state -------
23-Feb-2015 13:50:14 [---] [work_fetch] No project chosen for work fetch
.../snip/...

I hope to resume S@H CPU work "soon" which can easily be done by momentarily suspending A@H but I'm willing to defer that action to do any further diagnosis of boinc in its present state (i.e. S@H work request = zero seconds cpu).

Thanks for any help.
GeneAZ
25) Message boards : Questions and problems : Bad things happen on loss of network connectivity (Message 54774)
Posted 5 Jul 2014 by GeneAZ
Post:
Try "just once" ? I beg to differ. boinc is not doing that! Looking at the event log (see previous posted message) the upload of the SAME result is _attempted_ at 10:39:32 and 10:40:54 and 10:43:12 even though a previous result has failed the "upload failure limit" and has already set the project backoff at least 15 minutes. And if left to continue running, this iteration of: [try upload; fail; set retry between 1 minute and 2 minutes bounds; re-calculate project backoff - but apparently ignore it] continues forever.
I can see the rationale for a "try just once" but boinc isn't doing that. I seem to remember some previous boinc version holding new uploads in a "pending" state, if a project backoff is in effect. I like that strategy better.
26) Message boards : Questions and problems : Bad things happen on loss of network connectivity (Message 54743)
Posted 5 Jul 2014 by GeneAZ
Post:
More testing on this problem: it boils down to this -- when network connectivity is lost, the 2nd and subsequent tasks that finish DO NOT honor the project backoff interval. What is the point of having a project backoff if the uploader does not use it? A snipped/edited stdout log follows, in which one can see a result upload failing (becuase of network disconnect) and a project backoff of many minutes being set. But then the result upload just keeps on trying at its own 1 or 2 minute backoff.
The "05ja..." task was the SECOND task to finish after network loss; the first task (04se...) made its three upload attempts and then does not try anymore, consistent with the project backoff invoked.

10:39:32 [SETI@home] Started upload of 05ja09ab.14765.11524.438086664195.12.209_0_0
10:39:53 [SETI@home] Temporarily failed upload of 05ja09ab.14765.11524.438086664195.12.209_0_0: can't resolve hostname
10:39:53 [SETI@home] [file_xfer] project-wide xfer delay for 945.028502 sec
10:39:53 [SETI@home] Backing off 00:01:00 on upload of 05ja09ab.14765.11524.438086664195.12.209_0_0
10:40:54 [SETI@home] [fxd] starting upload, upload_offset -1
10:40:54 [SETI@home] Started upload of 05ja09ab.14765.11524.438086664195.12.209_0_0
10:41:14 [SETI@home] [file_xfer] file transfer status -113 (can't resolve hostname)
10:41:14 [SETI@home] Temporarily failed upload of 05ja09ab.14765.11524.438086664195.12.209_0_0: can't resolve hostname
10:41:14 [SETI@home] [file_xfer] project-wide xfer delay for 1618.068131 sec
10:41:14 [SETI@home] Backing off 00:01:58 on upload of 05ja09ab.14765.11524.438086664195.12.209_0_0
10:43:12 [SETI@home] [fxd] starting upload, upload_offset -1
10:43:12 [SETI@home] Started upload of 05ja09ab.14765.11524.438086664195.12.209_0_0
10:43:12 [SETI@home] [file_xfer] URL: http://setiboincdata.ssl.berkeley.edu/sah_cgi/file_upload_handler
10:43:33 [SETI@home] [file_xfer] file transfer status -113 (can't resolve hostname)
10:43:33 [SETI@home] Temporarily failed upload of 05ja09ab.14765.11524.438086664195.12.209_0_0: can't resolve hostname
10:43:33 [SETI@home] [file_xfer] project-wide xfer delay for 3642.048836 sec
10:43:33 [SETI@home] Backing off 00:01:00 on upload of 05ja09ab.14765.11524.438086664195.12.209_0_0

I see the bug Trac (ticket #113) of 5 years ago regarding the "computation error" and other side effects of a network outage. But that issue seems to me to be an effect of, and not the cause of, uploads being attempted during a project backoff interval.
The above event log excerpt was "snipped" out of the stdoutdae.txt log created with the cc_config logfile options: file_xfer_debug, file_xfer, http_debug, and http_xfer_debug. The 27 kbyte file, for the 19 minutes of interest, is saved should anyone want to see it.
27) Message boards : Questions and problems : Bad things happen on loss of network connectivity (Message 54634)
Posted 28 Jun 2014 by GeneAZ
Post:
boinc/boincmgr version 7.2.42 (wxWidgets 2.8.12)
Linux 3.2.46 (x86_64)
AMD FX-4300, 4-core + Nvidia GTX-650
Only Setiathome project active; both CPU and GPU tasks enabled, configured for 3 concurrent CPU and 1 GPU (with CPU allocated to feed it).

On June 25, my internet router failed, un-noticed for several hours. The end result was 64 tasks running for a few seconds and ending with computation error status. In Linux it was a signal 11 (segmentation violation). The event log was showing continuous cycles of starting upload -> upload failed -> backoff 1:xx .

To try to recreate this failure mode today I "pulled the plug" on the system ethernet cable to the network router. The same boinc failure developed and here is the sequence of symptoms:
(1) 08:25:00 network manually disconnected
(2) 08:28:35 GPU task #1 finished
(3) 08:28:35 GPU task #2 started
(4) 08:28:37 start upload task #1
(5) 08:28:58 ..upload failed, backoff 03:51
(6) 08:29:18 Event log shows internet access lost
(7) 08:32:49 start upload task #1
(8) 08:33:10 ..upload failed, backoff 02:34
(9) 08:35:44 start upload task #1
(10) 08:36:05 ..upload failed, backoff 03:28
Not shown in Event log, but Project backoff approx. 9 minutes invoked.
(11) 08:49:18 start upload task #1
(12) 08:49:38 ..upload failed, backoff 02:18
Not shown in Event log, but Project backoff approx. 29 minutes invoked.
(13) 08:59:27 GPU task #2 finished
(14) 08:59:27 GPU task #3 started
(15) 08:59:29 start upload task #2
*** even though Project backoff is still active ***
(16) 08:59:50 ..upload failed, backoff 1:09
Not shown in Event log, but Project backoff went to ~1 hr 40 min.
(17) 09:03:20 start upload task #2
(18) 09:03:40 ..upload failed, backoff 1:28
Not shown in Event log, but Project backoff went to > 8 hours
(19) 09:05:09 start upload task #2
(20) 09:05:30 ..upload failed, backoff 1:15
Project backoff probably changed but not recorded.
(21) 09:06:47 start upload task #2
(22) 09:07:07 ..upload failed, backoff 1:58
Not shown in Event log, but Project backoff went to ~4 hr 57 min.
>>clip<< **more upload failures**
(23) 09:14:08 CPU task #3 finished
(24) 09:14:08 CPU task #4 started
(25) 09:14:10 upload task #3
(26) 09:14:31 ..upload failed, backoff 1:34
(27) 09:14:31 upload task #2
(28) 09:14:51 ..upload failed, backoff 1:01

*** At this time it is noticed that the upload of task #1 remains "pending" presumably because of the long Project Backoff interval.
HOWEVER, the subsequent tasks #2 & #3 do NOT seem to honor the Project backoff and continue to attempt uploads on their own (short) backoff intervals. ALSO, it is noted that the 1-second updates of the Tasks and Transfers displays, from boincmgr, freeze for the duration of the 20-second timeout during upload attempts.
For the rest of the Event log data there are MANY failed upload attempts, with backoffs approx. 1 minute and 3 tasks cycling through their 20-second timeouts the boincmgr is nearly frozen. I am leaving out the upload messages.

(29) 09:30:04 GPU task #5 started
(30) 09:43:46 GPU task #5 "finished"
***exited with zero status*** HOWEVER, task Status still = Running
(31) 09:50:40 CPU task #? finished
(32) 09:50:40 CPU task #6 started
(33) 09:50:42 upload task (CPU#?)
(34) 09:51:03 ..upload failed, backoff 1:00
(35) 09:52:45 GPU task #5 "finished"
***exited with zero status*** HOWEVER, task Status still = Running
(36) 09:58:29 GPU task #5 finished
***exited with zero status*** ...still Running...
(37) 10:05:12 GPU task #5 finished
(38) 10:05:12 GPU task #6 started
(39) 10:05:52 GPU task $6 finished (computation error)
(40) 10:05:52 GPU task #7 started
(41) 10:06:33 GPU task #7 finished (computation error)

***At this time the test was terminated by suspending GPU activity and reconnecting the network cable.

My observation is that the second, and all subsequent, tasks in the upload queue do not honor the Project Backoff interval. As additional tasks finish their computation they flood the inoperative upload path and (possibly) block some essential boincmgr, or operating system, functions. Eventually leading to fatal task terminations.

Since this failure can be reproduced (relatively) easily I can re-run the test with appropriate cc_config.xml options which may be helpful. It just means I would need to babysit the process for about an hour to intervene when things start to go haywire.

Somebody tell me if this is a known bug. Or, if it may be specific to the Seti applications. It just looks to me like a boinc manager issue.
Thanks.


Previous 20

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.