Beta BOINC 5.7.x/5.8.x discussion/problem report

Message boards : BOINC Manager : Beta BOINC 5.7.x/5.8.x discussion/problem report
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · Next

AuthorMessage
Profile Trog Dog
Avatar

Send message
Joined: 6 May 06
Posts: 287
Australia
Message 7448 - Posted: 13 Jan 2007, 11:43:33 UTC

Resource limit exceeded sucks. This turns otherwise productive machines into cripples, even when all settings are set to 100%. The max memory usage doesn't take into account the swap partition.

Maybe it's something that the individual projects have to address - the main victims seem to be WCG and Rosetta.

Reverting back to 5.4.11

BTW, these settings don't limit usage (ie. hold usage to a max level - which most crunchers would take it to mean) they abort the wu as soon as it's exceeded. Instead of acting as a cruise control (limiting max speed) as soon as the set speed is exceeded the journey is aborted.


CIC1=CC=C(C2=N[C@@H](CC(OC(C)(C)C)=O)C3=NN=C(C)N3C4=C2C(C)=C(C)S4)C=C1
ID: 7448 · Report as offensive
Profile Trog Dog
Avatar

Send message
Joined: 6 May 06
Posts: 287
Australia
Message 7460 - Posted: 13 Jan 2007, 23:57:24 UTC - in response to Message 7449.  

Resource .... is aborted.


Help Defeat Cancer of WCG needs a sizable physical RAM to run at all, regardless the size of the VM, else the swapping would just cripple the machine for normal use. If talking about the 'waiting for memory', the 'busy' memory and 'idle' memory % i've set to 60% / 90%. The default 50% would pause them at some point in the computation. This is on a 1.5gb RAM machine.

Probably u already know.... the settings for WCG need to be put in the general_prefs_override.xml as the WCG website is not yet configured for the 5.8 final release.

<global_preferences>
<ram_max_used_busy_pct>60.0</ram_max_used_busy_pct>
<ram_max_used_idle_pct>90.0</ram_max_used_idle_pct>
</global_preferences>

BTW, on a virgin install these values were oddly set to 100% in the genereal_prefs.xml file, so maybe u dont meet particular minimum system requirements for WCG like the HDC being 750mb RAM. The projects at WCG check these minima including the VM settings, so nothing is likely to bomb...u'd simply not get a work unit.



The problem is not with HDC wu's, as you say the amount of RAM is detected beforehand and the wu's are not issued. Seems to be only FAAH wu's (from WCG) that fall prey to this behaviour.

As I said with settings at 100% (manually altered to these settings after the first errors appeared) the errors still continue.

This occurs on machines with 512MB and lower with up to 1Gig swap partition, which prior to upgrading to 5.8.x ran without problems.

Anyway reverted back to 5.4.11
CIC1=CC=C(C2=N[C@@H](CC(OC(C)(C)C)=O)C3=NN=C(C)N3C4=C2C(C)=C(C)S4)C=C1
ID: 7460 · Report as offensive
biohazard326
Avatar

Send message
Joined: 14 Jan 07
Posts: 2
United States
Message 7466 - Posted: 14 Jan 2007, 5:34:05 UTC

ok ive been running betas here and there to test. most worked fine with no problems, but the last two betas 5.8.1 and 5.8.2 both showed some interesting work unit problems. i ran both betas and my work unit list would slowly but surely dry up to the point where last night (after running 5.8.2 for a day+/-) that i had only 4 work units (and i run 20+/- projects). now 5.8.1 had this problem also but to a lesser extent. so after watching my work unit list dry up to nothing, i reinstalled 5.4.11 (stable) and lo and behold....within 10 minutes of my install work units FLOODED in. now i dont know if the work unit "part" of the betas are limited so that you dont recieve too many projects and overload the beta or if there is truly something borked with the betas that dont allow work units to come in....but either way there is something wrong and im posting so that someone can fix the rather large problem or explain that its the beta program limiting. thanks
ID: 7466 · Report as offensive
Profile KSMarksPsych
Avatar

Send message
Joined: 30 Oct 05
Posts: 1239
United States
Message 7467 - Posted: 14 Jan 2007, 5:43:05 UTC - in response to Message 7466.  

ok ive been running betas here and there to test. most worked fine with no problems, but the last two betas 5.8.1 and 5.8.2 both showed some interesting work unit problems. i ran both betas and my work unit list would slowly but surely dry up to the point where last night (after running 5.8.2 for a day+/-) that i had only 4 work units (and i run 20+/- projects). now 5.8.1 had this problem also but to a lesser extent. so after watching my work unit list dry up to nothing, i reinstalled 5.4.11 (stable) and lo and behold....within 10 minutes of my install work units FLOODED in. now i dont know if the work unit "part" of the betas are limited so that you dont recieve too many projects and overload the beta or if there is truly something borked with the betas that dont allow work units to come in....but either way there is something wrong and im posting so that someone can fix the rather large problem or explain that its the beta program limiting. thanks



The changes in the CPU scheduler between the 5.4.x series and the 5.8.x series is quite drastic.

This page describes the client scheduling policies.

I've always use a very small connect to interval (.01 days) so I haven't noticed a large change in the number of WUs in my cache.

But, your mileage may vary.
Kathryn :o)
ID: 7467 · Report as offensive
Rene
Avatar

Send message
Joined: 20 Nov 06
Posts: 34
Netherlands
Message 7469 - Posted: 14 Jan 2007, 11:09:27 UTC - in response to Message 7340.  

I also saw some strange behaviour after viewing the Simple Gui.
Closing the BOINC manager should close all, but after viewing the Simple Gui all stays active.
No simular behaviour without viewing the Simple Gui.
The manager comes with the "close confirmation" message and all stops.


This linux 5.8.1 "problem" seems to be fixed in the 5.8.2 build.

Also this one:
Message button closed the BOINC manager, but BOINC (and the app running) remained active.


;-)


ID: 7469 · Report as offensive
jgoldsti

Send message
Joined: 11 Nov 06
Posts: 12
United States
Message 7474 - Posted: 14 Jan 2007, 14:11:19 UTC - in response to Message 7471.  
Last modified: 14 Jan 2007, 14:21:35 UTC

I'm also seeing a problem with 5.8.2 running on Windows XP on one of my machines. I've set the profile for this machine to connect to the internet every 2.5 days (it is only running WCG work units). It pulls down about 12 hours of work rather than the amount it requested (261000 seconds) and it is completing all the work units in the buffer before requesting more work units. On my other machines running BOINC 5.8.2 (using other profiles), the machines pull down the correct amount of work and pull down more work units as they complete units and report back to WCG. I've tried letting all the work units complete on this machine and resetting, and this does not fix the problem. The DCF for this machine is around 1 (currently .97). This machine is a laptop that is suspended at night while the other machines are desktops running 24x7. I was not seeing this problem with 5.7.x and with 5.8.0/5.8.1.
ID: 7474 · Report as offensive
biohazard326
Avatar

Send message
Joined: 14 Jan 07
Posts: 2
United States
Message 7481 - Posted: 14 Jan 2007, 18:41:31 UTC - in response to Message 7467.  
Last modified: 14 Jan 2007, 18:44:14 UTC

The changes in the CPU scheduler between the 5.4.x series and the 5.8.x series is quite drastic.

This page describes the client scheduling policies.

I've always use a very small connect to interval (.01 days) so I haven't noticed a large change in the number of WUs in my cache.

But, your mileage may vary.



yeah i read that and i cant realy see where/why it would complete almost every WU in my list before even attempting to contact and ask for more work. i looked over the logs before i reinstalled 5.4.11 and saw that over the course of the day it had barely polled any projects for work, but instead plowed through the WUs i had already downloaded. (btw small update, since reinstalling 5.4.11 ive had approximately 50-100 WUs download to me in the one day since 5.8.2, so apparently something is wonky)


I'm also seeing a problem with 5.8.2 running on Windows XP on one of my machines. I've set the profile for this machine to connect to the internet every 2.5 days (it is only running WCG work units). It pulls down about 12 hours of work rather than the amount it requested (261000 seconds) and it is completing all the work units in the buffer before requesting more work units. On my other machines running BOINC 5.8.2 (using other profiles), the machines pull down the correct amount of work and pull down more work units as they complete units and report back to WCG. I've tried letting all the work units complete on this machine and resetting, and this does not fix the problem. The DCF for this machine is around 1 (currently .97). This machine is a laptop that is suspended at night while the other machines are desktops running 24x7. I was not seeing this problem with 5.7.x and with 5.8.0/5.8.1.



exactly, i mean ive tested most if not all the beta versions since the last stable and i think a few before that and none (til the 5.7.xx series and up) have had this serious WU dropoff issue. i mean there were times with 5.4.11 (and a few beta versions up in series from that) where i was deluged with WUs having 75+ WUs in my queue at a time, so the 4 from 5.8.2 was quite shocking
ID: 7481 · Report as offensive
jgoldsti

Send message
Joined: 11 Nov 06
Posts: 12
United States
Message 7500 - Posted: 15 Jan 2007, 16:01:19 UTC - in response to Message 7474.  

Yes, I understood that. My concern is that it is not pulling down more work units until it has completed all the work units vs. pulling down work units as it reports completion so that I have about 2.5 days of work units always in the queue.

ID: 7500 · Report as offensive
jgoldsti

Send message
Joined: 11 Nov 06
Posts: 12
United States
Message 7505 - Posted: 15 Jan 2007, 19:04:05 UTC - in response to Message 7500.  

One of the folks on a World Community Grid bulletin board shared how to setup cc_config to see why it was not pulling more work.
I ran BOINC with logging on for a few minutes and it reported back the following:
"2007-01-15 13:51:26 [World Community Grid] [rr_sim] result faah1231_d105n643_x2BPZ_01_2 finishes after 168168.786883 (23573.056801/0.140175)
2007-01-15 13:51:26 [World Community Grid] [rr_sim] result faah1231_d105n643_x2BPZ_01_2 misses deadline by 118092.291263".

If this means that BOINC thinks that this work unit is going to miss it's deadline, something is not correct. The work unit has a deadline of Jan 20 and is about to start in a few hours and is estimated to need 6:30 hours to complete. What diagnostic info can I provide to help isolate the problem and get it corrected?
ID: 7505 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15490
Netherlands
Message 7519 - Posted: 16 Jan 2007, 2:25:02 UTC - in response to Message 7505.  

WCG only allows for so many results to be downloaded at the same time. It doesn't care about the "connect to" setting. If it's more than the Connect to setting, you only get up to the setting that WCG set it to.

So you can't ask for 10 days of work.
And if you ask for 12 hours of work, it may well be they give you 10 hours of work or less. Not as other projects do give you 12+ hours of work.

I don't think in this case you can blame BOINC.
Blame WCG. :-)

ID: 7519 · Report as offensive
Marky-UK

Send message
Joined: 12 Jul 06
Posts: 35
United Kingdom
Message 7529 - Posted: 16 Jan 2007, 8:28:52 UTC

Another problem with 5.8.2: It doesn't forget host-specific resource shares.

5.8 supports host-specific resource shares and so does BAM. BAM also lets you remove them, which removes the setting from the account manager reply. But 5.8.2 doesn't remove the ams_resource_share setting from the client_state.xml file.

If you remove a host resource share in BAM, 5.8.2 goes back to using the project resource share instead, but it leaves the ams_resource_share setting in the client_state file. When the client is next restarted, it goes back to using the ams_resource_share instead.
ID: 7529 · Report as offensive
jgoldsti

Send message
Joined: 11 Nov 06
Posts: 12
United States
Message 7535 - Posted: 16 Jan 2007, 12:09:42 UTC - in response to Message 7519.  

The problem turned out to be that cpu_efficiency had been computed as below .25 on this machine.....not sure what had caused this but it is steadily rising so hopefully the problem will correct itself.
ID: 7535 · Report as offensive
jgoldsti

Send message
Joined: 11 Nov 06
Posts: 12
United States
Message 7541 - Posted: 16 Jan 2007, 19:01:53 UTC - in response to Message 7535.  

It turns out the problem was a beta version of another program that had a CPU utilization bug...it was consuming a fair amount of CPU in the background and I did not notice it. cpu_efficiency has risen to .38 and scheduling is starting to work more normally.
ID: 7541 · Report as offensive
[AF>Linux]Arnaud

Send message
Joined: 30 Aug 05
Posts: 58
Message 7542 - Posted: 16 Jan 2007, 19:20:14 UTC
Last modified: 16 Jan 2007, 19:34:15 UTC

Problem in 5.8.2-gnu:

Boinc tries to download applications of suspended projects like the Spinup project (CPDN Spinup is now closed and the apps aren't on the server anymore, but I'm still attached to it and the project is suspended : Boinc wants to download the apps)

Tue 16 Jan 2007 20:15:14 CET||file projects/climateapps1.oucs.ox.ac.uk_hadcm3spinup/hadcm3spinup_4.09_i686-pc-linux-gnu not found
Tue 16 Jan 2007 20:15:14 CET||file projects/climateapps1.oucs.ox.ac.uk_hadcm3spinup/hadcm3spinup_4.09_i686-pc-linux-gnu.so not found
Tue 16 Jan 2007 20:15:14 CET||file projects/climateapps1.oucs.ox.ac.uk_hadcm3spinup/hadcm3spinupse_4.09_i686-pc-linux-gnu.zip not found
...............
Tue 16 Jan 2007 20:29:43 CET|CPDN HadCM3 Spinup|Backing off 41 minutes and 21 seconds on download of file hadcm3spinup_4.09_i686-pc-linux-gnu.so
Tue 16 Jan 2007 20:30:04 CET||[http_debug] HTTP_OP::init_get(): http://climateapps1.oucs.ox.ac.uk/hadcm3spinup/download/hadcm3spinup_4.09_i686-pc-linux-gnu
Tue 16 Jan 2007 20:30:04 CET|CPDN HadCM3 Spinup|[file_xfer] Started download of file hadcm3spinup_4.09_i686-pc-linux-gnu
Tue 16 Jan 2007 20:30:14 CET|CPDN HadCM3 Spinup|[file_xfer] Temporarily failed download of hadcm3spinup_4.09_i686-pc-linux-gnu: file not found
Tue 16 Jan 2007 20:30:14 CET|CPDN HadCM3 Spinup|Backing off 45 minutes and 6 seconds on download of file hadcm3spinup_4.09_i686-pc-linux-gnu
Tue 16 Jan 2007 20:30:19 CET||[http_debug] HTTP_OP::init_get(): http://climateapps1.oucs.ox.ac.uk/hadcm3spinup/download/hadcm3spinupse_4.09_i686-pc-linux-gnu.zip
Tue 16 Jan 2007 20:30:19 CET|CPDN HadCM3 Spinup|[file_xfer] Started download of file hadcm3spinupse_4.09_i686-pc-linux-gnu.zip
Tue 16 Jan 2007 20:30:20 CET|CPDN HadCM3 Spinup|[file_xfer] Temporarily failed download of hadcm3spinupse_4.09_i686-pc-linux-gnu.zip: file not found
Tue 16 Jan 2007 20:30:20 CET|CPDN HadCM3 Spinup|Backing off 42 minutes and 54 seconds on download of file hadcm3spinupse_4.09_i686-pc-linux-gnu.zip

ID: 7542 · Report as offensive
Marky-UK

Send message
Joined: 12 Jul 06
Posts: 35
United Kingdom
Message 7549 - Posted: 17 Jan 2007, 8:16:46 UTC - in response to Message 7440.  

Answer from David Anderson:

I checked in a change to use user-friendly names
-- David

I think this change is a little broken and introduces a couple of bugs when BOINC Manager 5.8.2 is used to connect to a remote BOINC Client 5.4.11:

- In Advanced View, the Tasks show no Application at all now, just the version number.

- In Simple View, the BOINC Manager shows no tasks at all! It just says "Error No work available to process".


It seems to me like the application name should fall back to the old application name if the user-friendly name is missing.

I'm not 100% sure if the Simple View isn't working because of this change or something else that changed in 5.8.2, but it worked with BM 5.8.1 connecting to a remote 5.4.11.

David told me he fixed this now as well. So it's available in the next version of BOINC, or the release version of 5.8.2 if that one is the RC. :)

Hi, just tested this with 5.8.3, and although the problem with the Advanced View is fixed when connecting to a remote 5.4.11, the issue with the Simple View is not - it still gives an error and shows no tasks.
ID: 7549 · Report as offensive
Bob Guy

Send message
Joined: 5 Mar 06
Posts: 16
United States
Message 7560 - Posted: 18 Jan 2007, 2:47:56 UTC - in response to Message 7403.  

JM7 fixed the bug that attempted to download 10 days of work for all 10 projects attached (100 days worth of work with the typical deadline being about 2 weeks - not a good idea).

I have no problem with the other fixes included with Boinc 5.8.x, the improved scheduler seems to work much better than 5.4.11 but the displayed estimated time to completion (TOC) needs to be fixed.

Regardless of the changes in the scheduler algorithm, the displayed TOC (re: DCF) ought to adjust itself to match the actual runtimes even if the scheduler uses a different value. I do run multiple projects, typically three at a time now are active.

Are you suggesting that the displayed TOC (re:DCF) will adjust itself (down) to the actual runtime if I were running only one project? The theory being that the scheduler is adjusting the TOC (re:DCF) to some arbitrary value so that I don't download too much work, assuming that I run two or more projects. Even if this is what is happening, the displayed TOC ought to converge to the actual runtime.

Regarding the scheduler and the actual amount of time that Boinc is allowed to run: I often shut down Boinc for part of the day (4 to 12 hours) when I cannot have Boinc affecting other work that the computer is doing. This seems to have absolutely no effect on the displayed TOC (re:DCF) or the amount of work that Boinc is allowed to download. According to what you've said, I might expect the scheduler to download less work, because the scheduler knows (or thinks) that I'm only going to allow Boinc to run part-time. I have not seen this effect at all, the amount of work queued is more or less equal to my connect time as determined by the adding up the displayed TOCs. Currently the actual runtime is ~0.56 of the displayed TOC. So, for two active projects at a connect time of 0.5 days, I can queue about 0.25 days of WUs for each project. I'm sure that that is what the scheduler intends to do, and I think that's a nice featue as it prevents a project from supplying too much work for my computer. In any case I always seem to have an amount of queued WUs equal to about half my connect time and that amount is exactly equal to the amount that would be allowed if computed from the displayed TOC. However, I would still like the displayed TOC to converge on the actual runtime.

I'll have to look carefully at the "Boinc allowed to run" time and see if that's what the problem is. I don't think that's the problem with the displayed TOC though.

According to the theory (I don't know if it's true) that the TOC is a reflection of the number of projects allowed to run, then if I have three projects running should I see the TOC increase to approximately 3 times the actual runtime? I don't think it will, I don't think it does now. In any case the displayed TOC should converge to the actual runtime, regardless of what the scheduler does with the values. I'll have to try running just one project (suspend the others) and see what happens to the displayed TOC.

Also, running Boinc for 24 hours continuously for a couple of days at a time has absolutely no effect on the displayed TOC as you might expect if the DCF was adjusted with regard to the "Boinc allowed to run" value. I do see how the "Boinc allowed to run" would have an effect on EDF and deadline times, I don't think that behavior has changed.

Thanks for putting up with my observations, I just think that once this Boinc version is let loose on the general community that most users are going to see this TOC disparity and wonder what's going on just as I've done. Then again there is blissful ignorance.
ID: 7560 · Report as offensive
jgoldsti

Send message
Joined: 11 Nov 06
Posts: 12
United States
Message 7562 - Posted: 18 Jan 2007, 4:54:46 UTC

I am running BOINC 5.8.3 for Windows on a Windows XP machine with the profile set to "keep applications in memory when preempted" set to no. The running WCG work unit was preempted by another WCG work unit and the preempted work unit is still loaded into memory per the Windows task manager. Is this a bug in BOINC 5.8.3 or something that is being overriden by the WCG work unit setup and therefore needs to be reported on the WCG forum?
ID: 7562 · Report as offensive
Profile KSMarksPsych
Avatar

Send message
Joined: 30 Oct 05
Posts: 1239
United States
Message 7563 - Posted: 18 Jan 2007, 5:45:22 UTC
Last modified: 18 Jan 2007, 5:46:23 UTC

I think it is that the 5.8.x client will no longer swap out a result that hasn't checkpointed yet. I think I ran into that when I was testing memory preferences.

If you want to see when apps checkpoint, you'll need to make a cc_config.xml file and drop it in the BOINC directory after shutting down BOINC. I believe the flag you need to set is task_debug. Then restart BOINC. You'll see in the start up messages...

1/17/2007 9:42:12 PM||Starting BOINC client version 5.8.3 for windows_intelx86
1/17/2007 9:42:12 PM||log flags: task, file_xfer, sched_ops, cpu_sched, task_debug
1/17/2007 9:42:12 PM||Libraries: libcurl/7.16.0 OpenSSL/0.9.8a zlib/1.2.3
1/17/2007 9:42:12 PM||Data directory: C:\\Program Files\\BOINC
1/17/2007 9:42:12 PM||Processor: 1 GenuineIntel Intel(R) Pentium(R) 4 CPU 2.80GHz
1/17/2007 9:42:12 PM||Memory: 446.98 MB physical, 1.78 GB virtual
1/17/2007 9:42:12 PM||Disk: 55.88 GB total, 39.94 GB free




You'll see stuff like this in your logs...


1/17/2007 10:32:59 PM|QMC@HOME|[task_debug] result three_bench22a_jsch2005s22.518_0 checkpointed
1/17/2007 10:32:59 PM|DepSpid|[task_debug] result spider_24040_0 checkpointed
1/17/2007 10:33:38 PM|QMC@HOME|[task_debug] result three_bench22a_jsch2005s22.518_0 checkpointed
1/17/2007 10:33:39 PM|DepSpid|[task_debug] result spider_24039_0 checkpointed
1/17/2007 10:32:59 PM|QMC@HOME|[task_debug] result three_bench22a_jsch2005s22.518_0 checkpointed
1/17/2007 10:32:59 PM|DepSpid|[task_debug] result spider_24040_0 checkpointed
1/17/2007 10:33:38 PM|QMC@HOME|[task_debug] result three_bench22a_jsch2005s22.518_0 checkpointed
1/17/2007 10:33:39 PM|DepSpid|[task_debug] result spider_24039_0 checkpointed


[edit]If you need help setting up that file, let us know. I have one that I can post.[/edit]
Kathryn :o)
ID: 7563 · Report as offensive
rebirther
Avatar

Send message
Joined: 21 Jun 06
Posts: 156
Germany
Message 7564 - Posted: 18 Jan 2007, 9:11:24 UTC

@KSMarksPsych: If you want to use the cc_config.xml file you dont need to restart boinc (>5.8.0). There is a good feature in boinc "read config file", you can edit the xml while running Boinc and activate the feature again and again ;)
ID: 7564 · Report as offensive
Profile KSMarksPsych
Avatar

Send message
Joined: 30 Oct 05
Posts: 1239
United States
Message 7569 - Posted: 18 Jan 2007, 14:40:11 UTC - in response to Message 7564.  

@KSMarksPsych: If you want to use the cc_config.xml file you dont need to restart boinc (>5.8.0). There is a good feature in boinc "read config file", you can edit the xml while running Boinc and activate the feature again and again ;)



Well that's a nifty little feature! I shall have to tuck that away into the depths of brain.

:)
Kathryn :o)
ID: 7569 · Report as offensive
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · Next

Message boards : BOINC Manager : Beta BOINC 5.7.x/5.8.x discussion/problem report

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.