Benchmarking bug - indefinite suspension of computing

Message boards : BOINC client : Benchmarking bug - indefinite suspension of computing
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Les Bayliss
Help desk expert

Send message
Joined: 25 Nov 05
Posts: 1654
Australia
Message 16411 - Posted: 3 Apr 2008, 22:29:09 UTC

Would it be possible to have some code for BOINC to get the time from a net clock somewhere, and run it's own clock? Do it each time it starts, and just before a bench mark.
Perhaps compare it to the system clock each of these times and store an offset, rather than run it's own clock permanently.

Although I do think fiddling with the system clock instead of looking at a real calendar is weird. Even a perpetual calendar in the computer would be better.

ID: 16411 · Report as offensive
W-K ID 666

Send message
Joined: 30 Dec 05
Posts: 456
United Kingdom
Message 16421 - Posted: 4 Apr 2008, 0:49:11 UTC - in response to Message 16406.  

Therefore by default, even if BOINC projects was not the original objective, BOINC is the computer primary function. Therefore why not let them use a BOINC project to check, and if desired reset, the clock.


You seem to be saying give them an option. Then some will use the feature while others do not. Those that don't will still have the problem. Better to implement a solution that solves the problem for everybody but does not mess with the clock.


I agree fixing this problem may be the best option for this problem, but having the computer clock wrong also affects the scheduler, and therefore having the ability to identify the clock is wrong and either posting a warning, or having an option to allow the clock to be corrected automatically would be a beneficial.


People tire of warnings popping up even if said warnings are intended to help and even if they truly do need help. They'll want an option to turn the warning off.

I'm with Nicolas and JM7... BOINC has no business adjusting the system clock. BOINC is supposed to just run quietly in the background. If BOINC starts messing with users' ability to do screw up their clocks, it will be discovered. Then software reviewers and bloggers will leap on it and scream about BOINC taking over the computer, Big Brother and all that. What could BOINC devs say in their defense? The "we decided you needed to be helped" line won't wash even if they need do need help.

Although it would take longer to implement, the better (more reliable) and politically safer fix is the other fix that has been suggested in this thread wherein BOINC, if I understand correctly, would keep its own "clock" and leave the system clock alone.

The thing is although BOINC is supposed to run quietly in the background as you put it. If the computers date/clock is wrong then as described earlier BOINC stops running, or the scheduler could be in difficulties, units past deadline or units that should be in priority mode and not.
Windows and other OS's have options to sync with NTP server, but windows only does it once/week and will NOT adjust if more than 15 hours out.

And I don't really understand this idea of BOINC running its own clock, where is its point of reference on a computer that has been off, for any reason, and is on limited use dial-up?

I agree, on a business computer, that adjusting the clock is not a good idea, and is probably timed synced to local server anyway.

But do you want BOINC to run when it should be running on your computer?
Do you want the scheduler to run the correct units in the correct order?
Would you like a piece of software that spots your stupid mistakes? (If you don't make stupid mistakes, I suppose we better tell Seti we just found ET)

@Les,
If as my friend just pointed out, how do you check day of week for 10 April 2005. If you have computer in front of you. (customer with receipt, faulty item with 3yr guarantee, business does not open Sunday's)
ID: 16421 · Report as offensive
Nicolas

Send message
Joined: 19 Jan 07
Posts: 1179
Argentina
Message 16424 - Posted: 4 Apr 2008, 3:59:18 UTC - in response to Message 16421.  

@Les,
If as my friend just pointed out, how do you check day of week for 10 April 2005. If you have computer in front of you. (customer with receipt, faulty item with 3yr guarantee, business does not open Sunday's)

Not our fault that Windows didn't get that right until Vista.
ID: 16424 · Report as offensive
W-K ID 666

Send message
Joined: 30 Dec 05
Posts: 456
United Kingdom
Message 16425 - Posted: 4 Apr 2008, 4:19:16 UTC

My friends problem was on windows PC running at point of sale, with access to company network only, no internet connection. They have normal wall calender but that only shows current, last and next years. Without leaving position, and would you with person probably making fraudulent claim, what calender do you suggest.

And also if in office why bother with internet, most computers have an office suite and all as far as I know have similar to the MS Word calender wizard. Lotus had it before MS in early 90's.

In your 'design' of BOINC clock you make no mention of network not available, or what happens when computer is switched off, these are the points that confuse me.

Also I am not saying these clock inaccuracy messages and/or correction should be mandatory, but I do think it would be a good option.
ID: 16425 · Report as offensive
Les Bayliss
Help desk expert

Send message
Joined: 25 Nov 05
Posts: 1654
Australia
Message 16426 - Posted: 4 Apr 2008, 4:34:04 UTC

OK, there's 2 ways to check a date:

1) A perpetual calendar (1.5 million web sites!), such as Calendars for the Years 1901 to 2100, and Calendarhome.com. The 2nd looks interesting - 2/3rds down it has a link to Day-of-Week Calculator.

2)
a) (Menu)Suspend BOINC.
b) (Menu) Exit BOINC.
c) Fiddle with clock.
d) Reset clock to correct time.
e) Restart BOINC.
f) Set BOINC to Run.

***************

I've just seen your lastest post.
I don't think that running BOINC on a point-of-sale computer is a terribly good idea. Companies get a bit narky about this sort of thing.

ID: 16426 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5078
United Kingdom
Message 16433 - Posted: 4 Apr 2008, 8:24:36 UTC - in response to Message 16426.  

OK, there's 2 ways to check a date:

1) A perpetual calendar (1.5 million web sites!), such as Calendars for the Years 1901 to 2100, and Calendarhome.com. The 2nd looks interesting - 2/3rds down it has a link to Day-of-Week Calculator.

2)
a) (Menu)Suspend BOINC.
b) (Menu) Exit BOINC.
c) Fiddle with clock.
d) Reset clock to correct time.
e) Restart BOINC.
f) Set BOINC to Run.

There's a third way:

3)
a) Double-click on clock in system tray.
b) Fiddle with clock.
c) Click 'cancel'.

***************

I've just seen your lastest post.
I don't think that running BOINC on a point-of-sale computer is a terribly good idea. Companies get a bit narky about this sort of thing.

Andy's friend wouldn't have been running BOINC on the POS, because it has no internet connection. For the same reason, he/she wouldn't have been able to Google for any of the proper tools.

I agree with everything that's been said about perpetual calendars. However, everyone who posts here is by definition a Nerd or a Geek, and we understand about things like system integrity.

Microsoft, on the other hand, has spent 12 years (since the release of Windows 95) providing end users with a little facility which looks and feels like a perpetual calendar, and which is always guaranteed to be visible onscreen and one doubleclick away from use (unless you're one of those people who hide the taskbar). Think back twelve years: would you have Googled for a perpetual calendar then? With all the overhead of establishing the dial-up internet connection first?

It's no wonder that people have got into an engrained habit of (ab)using the system clock for date look-ups. And because it's there, and because it's a habit, people will go on using it: and it will go on being a problem until the last copy of Windows XP is consigned to the great bit-bucket in the sky. BOINC just has to live in the real world.

Having a robust, independent, self-validating, self-correcting internal time reference for BOINC is obviously the way forward. But my betting is that that isn't going to be in place this year, for all the reasons that people who've got knowledge of the internal code of BOINC have explained already. In the meantime, can I remind you yet again that there is something that causes an indefinite hang between

Suspending computation - running CPU benchmarks

and

[benchmark_debug] Starting floating-point benchmark

Isn't that worth solving?
ID: 16433 · Report as offensive
W-K ID 666

Send message
Joined: 30 Dec 05
Posts: 456
United Kingdom
Message 16436 - Posted: 4 Apr 2008, 9:25:33 UTC

If BOINC has knowledge the hosts date/time has changed how come this Odd graph picture is allowed to happen.

And reference the problems that have been mentioned for starting BOINC without the network being available.
But it could be made to time sync at the next connection, probably some time in the next 24hrs. And if the code is so bad that it needs a clock rather than a stop watch at benchmark time then only run benchmarks after next resync at next connection.
The clock can be adjusted during the time BOINC is up and running and cause some of these problems.
ID: 16436 · Report as offensive
Pepo
Avatar

Send message
Joined: 3 Apr 06
Posts: 547
Slovakia
Message 16440 - Posted: 4 Apr 2008, 10:46:59 UTC - in response to Message 16433.  

In the meantime, can I remind you yet again that there is something that causes an indefinite hang between

Suspending computation - running CPU benchmarks

and

[benchmark_debug] Starting floating-point benchmark

Isn't that worth solving?

For sure it is.
(If I could just correctly link my compiled client executable...)

Peter
ID: 16440 · Report as offensive
W-K ID 666

Send message
Joined: 30 Dec 05
Posts: 456
United Kingdom
Message 16445 - Posted: 4 Apr 2008, 11:51:19 UTC - in response to Message 16444.  
Last modified: 4 Apr 2008, 11:53:10 UTC

If BOINC has knowledge the hosts date/time has changed how come this Odd graph picture is allowed to happen.


BOINC does not have knowledge that the hosts date/time has changed. The changes I propose would give BOINC that knowledge. The changes you propose would not give BOINC that knowledge.



I disagree because from what you proposed if the computer does not have access to an NTP server at startup it will not start.

In fact in alot of cases it probably will not start because a lot of internet security programs inhibit network access until they have completed there checks.
ID: 16445 · Report as offensive
Nicolas

Send message
Joined: 19 Jan 07
Posts: 1179
Argentina
Message 16447 - Posted: 4 Apr 2008, 16:15:44 UTC - in response to Message 16445.  

BOINC does not have knowledge that the hosts date/time has changed. The changes I propose would give BOINC that knowledge. The changes you propose would not give BOINC that knowledge.

I disagree because from what you proposed if the computer does not have access to an NTP server at startup it will not start.

In fact in alot of cases it probably will not start because a lot of internet security programs inhibit network access until they have completed there checks.[/quote]
How do you suggest a program can know when the time changes?

1. get current time
2. wait 1 second
3. get current time again
4. if both time measurements differ by more than 2 seconds (or if the one at 3. is *lower* than the one at 1.), time changed, so you know you need to do some corrections

That looks like it should work. But nope! What if you suspend your computer? When the computer gets out of suspend mode, gets time in step 3., and it would differ by some hours.

And it's also possible that it fails in a normal case, depending on how the "wait 1 second" works internally.

ID: 16447 · Report as offensive
Les Bayliss
Help desk expert

Send message
Joined: 25 Nov 05
Posts: 1654
Australia
Message 16450 - Posted: 4 Apr 2008, 18:45:01 UTC

In the event that net access is not available for some reason, (another reason: a laptop that is traveling without access, such as on a ship), how about issuing a warning message, (popup?), something like: "The sytem clock appears to have changed, and BOINC needs to access the internet to get the correct time. Continuing for a long period without doing this may cause WUs to be late, and rejected."
Perhaps with an option to manually enter the correct time.

ID: 16450 · Report as offensive
W-K ID 666

Send message
Joined: 30 Dec 05
Posts: 456
United Kingdom
Message 16459 - Posted: 5 Apr 2008, 0:12:01 UTC

I'm glad we got that sorted.
ID: 16459 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5078
United Kingdom
Message 16467 - Posted: 5 Apr 2008, 10:28:02 UTC

Could anyone reading this thread comment on how/when BOINC updates its time stats, please?

When I performed the clock forward / clock back experiment that started this thread, one of the side effects that I noticed was a drop in time metrics:

    <active_frac>0.045279</active_frac>

Six days later, with 24/7 BOINC running (v5.10.45 service install under Windows XP - neither BOINC nor the computer have been restarted), the active_frac remains exactly the same to six decimal places - compare the code above, which is a current paste from client_state, with the figure in my message 16142.

???

I would have expected a similar sort of trap-door function as TDCF - in this case, quick to fall and slow to rise, but no recovery at all?

The frac is so low that even on this medium-speed machine (2.0GHz P4), new Einstein tasks at 50% share go immediately into high priority. If, as I suspect, there's no automatic recovery mechanism (or a broken mechanism) for active_frac, that might explain why so many people report problems with high priority and cache sizes on the various project message boards.

Please, no sticking-plaster replies: I know what to change and how to change it, but I'm researching whether there's a need to put in another bug report]
ID: 16467 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5078
United Kingdom
Message 16469 - Posted: 5 Apr 2008, 11:29:45 UTC - in response to Message 16468.  
Last modified: 5 Apr 2008, 12:08:35 UTC

Why dont you post the whole top section of the client_state.xml + some project DCF's so we can have a integral view rather than this step by step guessing game.

From current client_state.xml:

<host_info>
    <timezone>3600</timezone>
    <domain_name>ANONYMOUS</domain_name>
    <ip_addr>192.168.173.13</ip_addr>
    <host_cpid>e90761a879d5bf174f2e7e32671872db</host_cpid>
    <p_ncpus>1</p_ncpus>
    <p_vendor>GenuineIntel</p_vendor>
    <p_model>              Intel(R) Pentium(R) 4 CPU 2.00GHz [x86 Family 15 Model 2 Stepping 4]</p_model>
    <p_features>fpu tsc sse sse2 mmx</p_features>
    <p_fpops>1050903119.868637</p_fpops>
    <p_iops>1698914891.321735</p_iops>
    <p_membw>1000000000.000000</p_membw>
    <p_calculated>1207315872.154749</p_calculated>
    <m_nbytes>536133632.000000</m_nbytes>
    <m_cache>1000000.000000</m_cache>
    <m_swap>1310920704.000000</m_swap>
    <d_total>39990591488.000000</d_total>
    <d_free>5179965440.000000</d_free>
    <os_name>Microsoft Windows XP</os_name>
    <os_version>Home Edition, Service Pack 2, (05.01.2600.00)</os_version>
    <accelerators>NVIDIA GeForce3 Ti 200</accelerators>
</host_info>
<time_stats>
    <on_frac>0.805489</on_frac>
    <connected_frac>-1.000000</connected_frac>
    <active_frac>0.045279</active_frac>
    <cpu_efficiency>0.937373</cpu_efficiency>
    <last_update>1209561940.387707</last_update>
</time_stats>
<net_stats>
    <bwup>6416.736822</bwup>
    <avg_up>29442415.841511</avg_up>
    <avg_time_up>1207383404.717249</avg_time_up>
    <bwdown>61101.566988</bwdown>
    <avg_down>1165019550.171246</avg_down>
    <avg_time_down>1207380038.842249</avg_time_down>
</net_stats>

Einstein:

   <duration_correction_factor>0.342887</duration_correction_factor>

SETI:

    <duration_correction_factor>0.260244</duration_correction_factor>

(both with Power/Optimised apps, respectively). No other Project entries.

From a client_state.xml.bak file dated 26 May 2006 - must have been the last time I used BoincDV to reset debts:

<time_stats>
    <on_frac>0.998095</on_frac>
    <connected_frac>1.000000</connected_frac>
    <active_frac>0.999851</active_frac>
    <cpu_efficiency>0.949493</cpu_efficiency>
    <last_update>1148667485.875000</last_update>
</time_stats>

- I would judge that to be pretty normal for this machine: _efficiency @ ~95% is partly because it's the BoincView monitoring host for my LAN.

Since there is no easy way to monitor changes in debt values over time (the subject of a different bug report), I wrote myself a small utility to record and graph project debt values. I'll adapt it to log time_stats over time, and report next weekend. Any other tags you would like a time series for?

Edit - here are the Einstein tasks for this host. The report/fetch contact at 5 Apr 2008 11:45:46 UTC today was triggered as the LTD from the last high priority run rose above -3600.

Initial metrics for the new task are:
Computation time to completion: 16 hours 34 minutes
'Work buffer' from BoincView: 16 days 12 hours

Project shares are equal (50% - 100::100)

Task immediately went into high priority:

05/04/2008 12:43:48|Einstein@Home|Sending scheduler request: To fetch work.  Requesting 1490 seconds of work, reporting 1 completed tasks
05/04/2008 12:43:58|Einstein@Home|Scheduler request succeeded: got 1 new tasks
05/04/2008 12:44:00|Einstein@Home|Starting h1_0907.30_S5R3__78_S5R3b_0
05/04/2008 12:44:08|Einstein@Home|Starting task h1_0907.30_S5R3__78_S5R3b_0 using einstein_S5R3 version 436

2nd. edit: Another observation - I run a 0.01 day CI, and a 1 day AC. That work fetch would normally be for (87264 - ε) seconds, where ε = ~4,363 seconds for a 95% CPU efficiency. That demonstrates how the active_frac corruption impinges on work fetch.
ID: 16469 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5078
United Kingdom
Message 16471 - Posted: 5 Apr 2008, 13:11:28 UTC - in response to Message 16470.  

For now, suggest to exit BOINC, open client_state.xml with ASCII text-editor and set that value to the the march 26 one. Also set the DCF's to 1.000000 so at least crunching and work fetching return to normality. The DCF's are indicative of the situation slowly returning to normality.... right now BOINC figures things complete much faster than the other parms indicate.

That's exactly the reply I was trying to avoid. In my original post (message 16467), I put a footnote in small print. If you had followed recommended Forum practice, and used 'Reply to Post' (for threading purposes), instead of 'Post to Thread', you would have seen it.

Since you clearly missed it, here it is for the visually-impaired:

Please, no sticking-plaster replies: I know what to change and how to change it, but I'm researching whether there's a need to put in another bug report
ID: 16471 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5078
United Kingdom
Message 16474 - Posted: 5 Apr 2008, 15:05:51 UTC
Last modified: 5 Apr 2008, 15:32:31 UTC

OK, this is a new post - the previous poster wasn't worth replying to, so I won't reply.

He was right, however, right to say that I have all the data available, and he was also right to say that I should have posted the whole story.

The key datum is

<last_update>1209561940.387707</last_update>

in the <time_stats> in my post of 11:29 UTC.

Using http://www.onlineconversion.com/unix_time.htm, that equates to Wed, 30 Apr 2008 13:25:40 UTC - still 25 days in the future.

Here are lines 127-151 of time_stats.C:
// Update time statistics based on current activities
// NOTE: we don't set the state-file dirty flag here,
// so these get written to disk only when other activities
// cause this to happen.  Maybe should change this.
//
void TIME_STATS::update(int suspend_reason) {
    double dt, w1, w2;

    bool is_active = !(suspend_reason & ~SUSPEND_REASON_CPU_USAGE_LIMIT);
    if (last_update == 0) {
        // this is the first time this client has executed.
        // Assume that everything is active

        on_frac = 1;
        connected_frac = 1;
        active_frac = 1;
        first = false;
        last_update = gstate.now;
        log_append("power_on", gstate.now);
    } else {
        dt = gstate.now - last_update;
        [color=red]if (dt <= 10) return;[/color]
        w1 = 1 - exp(-dt/ALPHA);    // weight for recent period
        w2 = 1 - w1;                // weight for everything before that
                                    // (close to zero if long gap)

(sorry, I can't use [code] for code, because of the indent bug on these boards)

I call BUG at line 148 (highlighted).

This contains an implied assumption that time is always monotonic (i.e. the clock hasn't been fiddled with - which is where we came in).

The intention of the test is clearly to reduce workload by only re-calculating active_frac at intervals of 10 time_units or more: the effect is to inhibit updating following a clock-fiddle until (MAX(clock) + 10) is reached.

The test should be

[color=red]if (ABS(dt) <= 10) return;[/color]
(or whatever the C construct is - sorry, I'm a VB programmer)

Now, I suppose it's up to me to find the line number of the original benchmarking bug.

Correction: the 10 time_unit test is at line number 69 of http://boinc.berkeley.edu/trac/browser/trunk/boinc/client/time_stats.C?rev=4610 - I got the first number from my Visual Studio editor, working on the text version of the file which the BOINC/Wiki search function found first.
ID: 16474 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5078
United Kingdom
Message 16477 - Posted: 5 Apr 2008, 16:46:12 UTC

Found it!

(Subject to checking and validation - please confirm)

http://boinc.berkeley.edu/trac/browser/trunk/boinc/client/cs_benchmark.C?rev=12128
307 bool CLIENT_STATE::cpu_benchmarks_poll() { 
308     int i; 
309     static double last_time = 0; 
310     if (!benchmarks_running) return false; 
311  
312     if (now < last_time + 1) return false; 
313     last_time = now; 
314  
315     active_tasks.send_heartbeats(); 
If benchmarks have been run in the current BOINC session, at some time in the future (as a result of the clock fumbling we've been talking about), the static variable last_time will have been initialised and will have a value of, for example, Wed, 30 Apr 2008 13:25:40 UTC.

So the test at line 312 will be satisfied, and the application will loop until the cows come home (or Wed, 30 Apr 2008 13:25:40 UTC, whichever comes sooner).

That explains why exiting BOINC and re-starting allows benchmarks to run properly: the variable will be undefined and correctly initialised to zero.

Solution: explicitly set the value of last_time to zero on all possible exit routes out of the benchmarking loop, so that it's properly initialised for next time.

NB that's OK: this is a timing variable for the benchmark duration, nothing to do with the 5-day interval between benchmarks. That's tested at
250     double diff = now - host_info.p_calculated; 
251     if (diff < 0) return true; 
252  
253     return ((run_cpu_benchmarks || diff > BENCHMARK_PERIOD)); 
ID: 16477 · Report as offensive
Previous · 1 · 2

Message boards : BOINC client : Benchmarking bug - indefinite suspension of computing

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.