BOINC client exits.

Message boards : BOINC client : BOINC client exits.
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Jean-David

Send message
Joined: 19 Dec 05
Posts: 89
United States
Message 17885 - Posted: 16 Jun 2008, 10:02:18 UTC

I am running Red Hat Enterprise Linux 5.2 on a dual hyperthreaded Xeon machine with 8 GBytes RAM. My boinc client was from boinc_5.10.45_i686-pc-linux-gnu.sh. It has been running fine for months if not longer. I normally run it 24/7.

Yesterday and today, the boinc client has exited with the following in the error log:

Skipping: 100
Skipping: /max_ncpus_pct
Unrecognized XML in GLOBAL_PREFS::parse_override: max_ncpus_pct
Skipping: 100
Skipping: /max_ncpus_pct
Unrecognized XML in GLOBAL_PREFS::parse_override: max_ncpus_pct
Skipping: 100
Skipping: /max_ncpus_pct
Unrecognized XML in GLOBAL_PREFS::parse_override: max_ncpus_pct
Skipping: 100
Skipping: /max_ncpus_pct
Unrecognized XML in GLOBAL_PREFS::parse_override: max_ncpus_pct
Skipping: 100
Skipping: /max_ncpus_pct
Unrecognized XML in GLOBAL_PREFS::parse_override: max_ncpus_pct
Skipping: 100
Skipping: /max_ncpus_pct
Unrecognized XML in GLOBAL_PREFS::parse_override: max_ncpus_pct
Skipping: 100
Skipping: /max_ncpus_pct
Unrecognized XML in GLOBAL_PREFS::parse_override: max_ncpus_pct
Skipping: 100
Skipping: /max_ncpus_pct
Unrecognized XML in GLOBAL_PREFS::parse_override: max_ncpus_pct
Skipping: 100
Skipping: /max_ncpus_pct
Unrecognized XML in GLOBAL_PREFS::parse_override: max_ncpus_pct
Skipping: 100
Skipping: /max_ncpus_pct
SIGSEGV: segmentation violation
Stack trace (2 frames):
/home/boinc/BOINC/boinc[0x808e90a]
[0xe38420]

Exiting...
Unrecognized XML in GLOBAL_PREFS::parse_override: max_ncpus_pct
Skipping: 100
Skipping: /max_ncpus_pct
SIGSEGV: segmentation violation
Stack trace (2 frames):
/home/boinc/BOINC/boinc[0x808e90a]
[0x679420]

Exiting...

The error.log file is listed as:

-rw-r--r-- 1 root root 1361 Jun 16 01:32 error.log

and this represents two crashes. I restarted it after the first one and it ran around 24 hours before the second one.

I have no clue what this means (other than it exits. I know what a segmentation violation is, but I do not know how to interpret the error log. Should I reinstall the program, or what?

P.s.: I am running climateprediction, hydrogen@home, rosetta@home, seti@home, worldcommunitygrid, malariacontrol, and predictor@home (no work in a long time), if that matters.
ID: 17885 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 17886 - Posted: 16 Jun 2008, 10:14:21 UTC - in response to Message 17885.  

These are benign errors.
Skipping: 100
Skipping: /max_ncpus_pct
Unrecognized XML in GLOBAL_PREFS::parse_override: max_ncpus_pct

They occur because the science application isn't compiled with the latest BOINC version, while the back-end is running the latest version. The max_ncpus_pct only works on BOINC 6.

Which leaves this as the real fault.
SIGSEGV: segmentation violation

Possible causes include:

- Bad task.
- Bad RAM.
- Bad CPU.
- Bad Page File.
- Bad disk.
- Heat.

Since you had two in a row and both times the client crashed as well, go check your CPU and RAM first and foremost. Do check that your system isn't clogged up with dust.
ID: 17886 · Report as offensive
Jean-David

Send message
Joined: 19 Dec 05
Posts: 89
United States
Message 17890 - Posted: 16 Jun 2008, 11:58:36 UTC - in response to Message 17886.  

These are benign errors.
Skipping: 100
Skipping: /max_ncpus_pct
Unrecognized XML in GLOBAL_PREFS::parse_override: max_ncpus_pct

They occur because the science application isn't compiled with the latest BOINC version, while the back-end is running the latest version. The max_ncpus_pct only works on BOINC 6.


I see you must be right. Both because they are absent in the second crash and because I have restarted the boinc client and I got one, but everything is running fine.

Which leaves this as the real fault.
SIGSEGV: segmentation violation

Possible causes include:

- Bad task.

Is this something I can fix; e.g., by resetting the project? I do not know which project, if any, is causing the problem. The newest one I am running is hydrogen@home, so if I must suspect one, that is what I would pick. I will suspend it and see what happens tomorrow.

- Bad RAM.

I ran memtest86 overnight March 17 for 9 hours and it was OK. Of course, the memory (half over 4 years old and the other half about 2 years old) could have gone bad since then.

- Bad CPU.

I have two of these. I do not really know how to test them other than running a lot of stuff. Right now, running BOINC is the toughest test because it keeps the cpu's almost 100% busy 24/7, but as a test it is not much good. Nothing else is crashing.

- Bad Page File.

Possible, but unlikely. For one thing, this machine almost never (less than once a day) pages:

Mem: 8185240k total, 7858392k used, 326848k free, 175276k buffers
Swap: 4096496k total, 612k used, 4095884k free, 6547776k cached

$ vmstat 5
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
4 0 612 362244 175388 6547752 0 0 43 22 15 21 87 4 9 0 0
4 0 612 360508 175412 6547768 0 0 0 31 1115 653 95 4 1 0 0
4 0 612 333484 175436 6518692 0 0 0 7 1192 881 84 14 3 0 0
4 0 612 285976 175476 6540288 0 0 0 3854 1060 430 76 23 1 0 0
4 0 612 317520 175492 6542724 0 0 0 2996 1176 1088 86 6 8 0 0
5 0 612 330000 175520 6542720 0 0 0 14 1105 783 94 4 2 0 0
4 0 612 411532 175592 6543976 0 0 258 111 1118 1108 91 6 3 0 0

- Bad disk.

I have 6 hard drives, but 4 are used only for a database application. The temperatures of the hard drives around the times of the crashes were:

Jun 15 00:54:48 trillian smartd[3991]: Device: /dev/sda, Temperature changed -2 Celsius to 38 Celsius since last report
Jun 15 00:54:48 trillian smartd[3991]: Device: /dev/sdb, Temperature changed -2 Celsius to 40 Celsius since last report

Jun 15 20:24:48 trillian smartd[3991]: Device: /dev/sda, Temperature changed -2 Celsius to 38 Celsius since last report
Jun 15 20:24:48 trillian smartd[3991]: Device: /dev/sdb, Temperature changed -2 Celsius to 40 Celsius since last report
...
Jun 16 01:54:48 trillian smartd[3991]: Device: /dev/sda, Temperature changed -3 Celsius to 35 Celsius since last report
Jun 16 01:54:48 trillian smartd[3991]: Device: /dev/sdb, Temperature changed -2 Celsius to 38 Celsius since last report

so they are not too hot.

smartctl reveals they are working fine:

# /usr/sbin/smartctl -a /dev/sda
smartctl version 5.36 [i686-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device: MAXTOR ATLAS10K5_73WLS Version: JNZH
Serial number: D21C7CZK
Device type: disk
Transport protocol: Parallel SCSI (SPI-4)
Local Time is: Mon Jun 16 07:46:04 2008 EDT
Device supports SMART and is Enabled
Temperature Warning Enabled
SMART Health Status: OK

Current Drive Temperature: 37 C
Manufactured in week 05 of year
Current start stop count: 1074003968 times
Recommended maximum start stop count: 1124401151 times
Elements in grown defect list: 0

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 694147 0 0 0 0 2315.327 0
write: 0 0 0 0 0 3568.492 0

Non-medium error count: 87

Last n error events log page
No self-tests have been logged
Long (extended) Self Test duration: 1440 seconds [24.0 minutes]

# /usr/sbin/smartctl -a /dev/sdb
smartctl version 5.36 [i686-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device: MAXTOR ATLAS10K5_73WLS Version: JNZH
Serial number: D21C6NBK
Device type: disk
Transport protocol: Parallel SCSI (SPI-4)
Local Time is: Mon Jun 16 07:46:20 2008 EDT
Device supports SMART and is Enabled
Temperature Warning Enabled
SMART Health Status: OK

Current Drive Temperature: 39 C
Manufactured in week 05 of year
Current start stop count: 1074003968 times
Recommended maximum start stop count: 1124401151 times
Elements in grown defect list: 0

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 14017 70 0 0 0 1169.196 0
write: 0 0 0 0 0 1945.042 0

Non-medium error count: 129

Last n error events log page
No self-tests have been logged
Long (extended) Self Test duration: 1440 seconds [24.0 minutes]

- Heat.

Sun Jun 15 04:00:01 EDT 2008
w83627hf-isa-0290
Adapter: ISA adapter
VCore: +1.46 V (min = +1.36 V, max = +1.47 V)
+3.3V: +3.31 V (min = +3.14 V, max = +3.46 V)
VBat: +3.17 V (min = +2.40 V, max = +3.60 V)
+5V: +4.92 V (min = +4.84 V, max = +5.24 V)
+12V: +11.86 V (min = +11.49 V, max = +12.59 V)
-12V: -11.78 V (min = -13.02 V, max = -11.37 V)
V5SB: +5.43 V (min = +4.84 V, max = +5.24 V)
CPU0 fan: 3516 RPM (min = 1592 RPM, div = 8)
CPU1 fan: 2556 RPM (min = 1592 RPM, div = 8)
System: +42 C (high = +50 C, hyst = +48 C) sensor = thermistor
CPU0: +56.5 C (high = +60 C, hyst = +58 C) sensor = thermistor
CPU1: +54.5 C (high = +60 C, hyst = +58 C) sensor = thermistor
vid: +1.525 V (VRM Version 9.0)
alarms: Chassis intrusion detection
beep_enable:
Sound alarm disabled

Sun Jun 15 04:15:01 EDT 2008
w83627hf-isa-0290
Adapter: ISA adapter
VCore: +1.46 V (min = +1.36 V, max = +1.47 V)
+3.3V: +3.31 V (min = +3.14 V, max = +3.46 V)
VBat: +3.18 V (min = +2.40 V, max = +3.60 V)
+5V: +4.95 V (min = +4.84 V, max = +5.24 V)
+12V: +11.98 V (min = +11.49 V, max = +12.59 V)
-12V: -11.78 V (min = -13.02 V, max = -11.37 V)
V5SB: +5.43 V (min = +4.84 V, max = +5.24 V)
CPU0 fan: 2909 RPM (min = 1592 RPM, div = 8)
CPU1 fan: 2410 RPM (min = 1592 RPM, div = 8)
System: +41 C (high = +50 C, hyst = +48 C) sensor = thermistor
CPU0: +41.0 C (high = +60 C, hyst = +58 C) sensor = thermistor
CPU1: +40.5 C (high = +60 C, hyst = +58 C) sensor = thermistor
vid: +1.525 V (VRM Version 9.0)
alarms: Chassis intrusion detection
beep_enable:
Sound alarm disabled


Mon Jun 16 01:30:01 EDT 2008
w83627hf-isa-0290
Adapter: ISA adapter
VCore: +1.44 V (min = +1.36 V, max = +1.47 V)
+3.3V: +3.31 V (min = +3.14 V, max = +3.46 V)
VBat: +3.18 V (min = +2.40 V, max = +3.60 V)
+5V: +4.92 V (min = +4.84 V, max = +5.24 V)
+12V: +11.86 V (min = +11.49 V, max = +12.59 V)
-12V: -11.87 V (min = -13.02 V, max = -11.37 V)
V5SB: +5.43 V (min = +4.84 V, max = +5.24 V)
CPU0 fan: 3375 RPM (min = 1592 RPM, div = 8)
CPU1 fan: 2482 RPM (min = 1592 RPM, div = 8)
System: +41 C (high = +50 C, hyst = +48 C) sensor = thermistor
CPU0: +56.0 C (high = +60 C, hyst = +58 C) sensor = thermistor
CPU1: +54.0 C (high = +60 C, hyst = +58 C) sensor = thermistor
vid: +1.525 V (VRM Version 9.0)
alarms: Chassis intrusion detection
beep_enable:
Sound alarm disabled

Mon Jun 16 01:45:01 EDT 2008
w83627hf-isa-0290
Adapter: ISA adapter
VCore: +1.47 V (min = +1.36 V, max = +1.47 V)
+3.3V: +3.31 V (min = +3.14 V, max = +3.46 V)
VBat: +3.18 V (min = +2.40 V, max = +3.60 V)
+5V: +4.97 V (min = +4.84 V, max = +5.24 V)
+12V: +11.92 V (min = +11.49 V, max = +12.59 V)
-12V: -11.70 V (min = -13.02 V, max = -11.37 V)
V5SB: +5.46 V (min = +4.84 V, max = +5.24 V)
CPU0 fan: 2482 RPM (min = 1592 RPM, div = 8)
CPU1 fan: 2220 RPM (min = 1592 RPM, div = 8)
System: +39 C (high = +50 C, hyst = +48 C) sensor = thermistor
CPU0: +37.0 C (high = +60 C, hyst = +58 C) sensor = thermistor
CPU1: +36.5 C (high = +60 C, hyst = +58 C) sensor = thermistor
vid: +1.525 V (VRM Version 9.0)
alarms: Chassis intrusion detection
beep_enable:
Sound alarm disabled

So they were not too hot either.

Since you had two in a row and both times the client crashed as well, go check your CPU and RAM first and foremost. Do check that your system isn't clogged up with dust.


I just looked at the regular log (not the error log) at the times of the crashes (not the same time, so cron probably not causing the problem. At about the times of the crashes, they say:

15-Jun-2008 04:10:20 [Hydrogen@Home] Sending scheduler request: To report completed tasks. Requesting 25264 seconds of work, reporting 1 completed tasks
15-Jun-2008 04:10:25 [Hydrogen@Home] Scheduler request succeeded: got 0 new tasks
15-Jun-2008 04:10:25 [Hydrogen@Home] Message from server: No work sent
Resuming CPDN!
Resuming CPDN!
Resuming CPDN!
hadcm3istd_0bkp_1920_160_15936335 - PH 1 TS 1400977 A - 19/12/1974 00:30 - H:M:S=1587:15:10 AVG= 4.08 DLT= 0.99
Cleaning up graphics data...
Detaching shared memory...
Cleaning up graphics data...
Detaching shared memory...
15-Jun-2008 07:02:09 [---] Starting BOINC client version 5.10.45 for i686-pc-linux-gnu

and

16-Jun-2008 01:32:26 [Hydrogen@Home] Sending scheduler request: To fetch work. Requesting 26832 seconds of work, reporting 0 completed tasks
16-Jun-2008 01:32:31 [Hydrogen@Home] Scheduler request succeeded: got 0 new tasks
16-Jun-2008 01:32:31 [Hydrogen@Home] Message from server: No work sent
hadam3h_c_61s04_2000_2000_0 - PH 1 TS 0027517 A - 12/10/2000 02:10 - H:M:S=0374:17:26 AVG=48.97 DLT=35.81
hadam3h_c_61s04_2000_2000_0 - PH 1 TS 0027518 A - 12/10/2000 02:20 - H:M:S=0374:18:00 AVG=48.97 DLT=34.06
Resuming CPDN!
Resuming CPDN!
Cleaning up graphics data...
Detaching shared memory...
Cleaning up graphics data...
Detaching shared memory...
16-Jun-2008 06:05:46 [---] Starting BOINC client version 5.10.45 for i686-pc-linux-gnu



ID: 17890 · Report as offensive
Jean-David

Send message
Joined: 19 Dec 05
Posts: 89
United States
Message 17910 - Posted: 17 Jun 2008, 10:28:19 UTC - in response to Message 17886.  

These are benign errors.

Which leaves this as the real fault.
SIGSEGV: segmentation violation

Possible causes include:

- Bad task.
- Bad RAM.
- Bad CPU.
- Bad Page File.
- Bad disk.
- Heat.

Since you had two in a row and both times the client crashed as well, go check your CPU and RAM first and foremost. Do check that your system isn't clogged up with dust.


It stayed up all day and all night up until this instant when it is still up. The only change I made was to suspend the hydrogen@home project. I do not know if this is a red herring or not.

P.S., I note that SIGSEGV also comes out for stack overflow. For me, the stack limit is:

$ ulimit -a

data seg size (kbytes, -d) unlimited

max locked memory (kbytes, -l) 32
max memory size (kbytes, -m) unlimited

stack size (kbytes, -s) 10240 <---<<<

This, too, may not mean anything. 10 megabytes seems like a lot.
I doubt the death of a child will kill the boinc client. Is this correct (that it will not)?



ID: 17910 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 17911 - Posted: 17 Jun 2008, 10:36:12 UTC - in response to Message 17910.  
Last modified: 17 Jun 2008, 10:38:35 UTC

I doubt the death of a child will kill the boinc client. Is this correct (that it will not)?

It shouldn't do that anymore, but it is still possible. Sometimes if an application goes it takes the whole client along.

Would you care to test if it's hydrogen that's causing it, by running mainly that project for a bit? It would help Jack fix possible problems with his application(s).

(What you could try is exit BOINC, delete all the files in the \projects\hydrogenathome.org\ directory and restart BOINC, then allow H2 to fetch work again for the test. In the mean time I'll ask Jack to come take a look here).
ID: 17911 · Report as offensive
Jean-David

Send message
Joined: 19 Dec 05
Posts: 89
United States
Message 18085 - Posted: 28 Jun 2008, 17:32:03 UTC - in response to Message 17911.  
Last modified: 28 Jun 2008, 18:28:08 UTC

I doubt the death of a child will kill the boinc client. Is this correct (that it will not)?

It shouldn't do that anymore, but it is still possible. Sometimes if an application goes it takes the whole client along.

Would you care to test if it's hydrogen that's causing it, by running mainly that project for a bit? It would help Jack fix possible problems with his application(s).

(What you could try is exit BOINC, delete all the files in the \projects\hydrogenathome.org\ directory and restart BOINC, then allow H2 to fetch work again for the test. In the mean time I'll ask Jack to come take a look here).

I did not notice your post until just now.

I then stopped the BOINC client, got into ~/BOINC/projects and did
rm -fr hydrogenathome.org.
I then restarted the BOINC client and less than a minute later, the hydrogenathome.org was refilled with files -- with the current date.

I have set the other projects to get no new tasks. So pretty soon, all but climate prediction should stop running. I hope hydrogen will supply me with enough work.

BTW, it has not been crashing for over a week until this morning. It was not hot either outside or inside my computer at the time of the crash.
ID: 18085 · Report as offensive
Jean-David

Send message
Joined: 19 Dec 05
Posts: 89
United States
Message 18166 - Posted: 2 Jul 2008, 17:25:33 UTC - in response to Message 18085.  

[quote]I doubt the death of a child will kill the boinc client. Is this correct (that it will not)?

It shouldn't do that anymore, but it is still possible. Sometimes if an application goes it takes the whole client along.

Would you care to test if it's hydrogen that's causing it, by running mainly that project for a bit? It would help Jack fix possible problems with his application(s).

(What you could try is exit BOINC, delete all the files in the \projects\hydrogenathome.org\ directory and restart BOINC, then allow H2 to fetch work again for the test. In the mean time I'll ask Jack to come take a look here).


Today, my other machine just crashed BOINC client. So it is not a hardware problem of my first machine. I cannot prove it was hydrogen@home because that client was running anything that came along from the list of

[SETI@home]
[climateprediction.net]
[Predictor @ Home]
[rosetta@home]
[Hydrogen@Home]
[malariacontrol.net]
[World Community Grid]

However, the last lines of my client log were:

02-Jul-2008 12:10:04 [Hydrogen@Home] Sending scheduler request: To fetch work. Requesting 13357 seconds of work, reporting 0 completed tasks
02-Jul-2008 12:10:09 [Hydrogen@Home] Scheduler request succeeded: got 1 new tasks
02-Jul-2008 12:10:11 [Hydrogen@Home] Started download of 1215014998_nsc1800.mol2
02-Jul-2008 12:10:11 [Hydrogen@Home] Started download of 1215014998_pdb1c5z.pdb
02-Jul-2008 12:10:12 [Hydrogen@Home] Finished download of 1215014998_nsc1800.mol2
02-Jul-2008 12:10:12 [Hydrogen@Home] Started download of ad_nsc1800.mol2_pdb1c5z.pdb_1215014998
02-Jul-2008 12:10:13 [Hydrogen@Home] Finished download of 1215014998_pdb1c5z.pdb
02-Jul-2008 12:10:13 [Hydrogen@Home] Finished download of ad_nsc1800.mol2_pdb1c5z.pdb_1215014998
02-Jul-2008 12:10:19 [Hydrogen@Home] Sending scheduler request: To fetch work. Requesting 13270 seconds of work, reporting 0 completed tasks
02-Jul-2008 12:10:24 [Hydrogen@Home] Scheduler request succeeded: got 0 new tasks
02-Jul-2008 12:10:24 [Hydrogen@Home] Message from server: No work sent
02-Jul-2008 12:11:25 [Hydrogen@Home] Sending scheduler request: To fetch work. Requesting 13270 seconds of work, reporting 0 completed tasks
02-Jul-2008 12:11:30 [Hydrogen@Home] Scheduler request succeeded: got 0 new tasks
02-Jul-2008 12:11:30 [Hydrogen@Home] Message from server: No work sent
02-Jul-2008 12:12:30 [Hydrogen@Home] Sending scheduler request: To report completed tasks. Requesting 13358 seconds of work, reporting 1 completed tasks
02-Jul-2008 12:12:35 [Hydrogen@Home] Scheduler request succeeded: got 0 new tasks
02-Jul-2008 12:12:35 [Hydrogen@Home] Message from server: No work sent
02-Jul-2008 12:13:35 [Hydrogen@Home] Sending scheduler request: To fetch work. Requesting 13358 seconds of work, reporting 0 completed tasks
02-Jul-2008 12:13:40 [Hydrogen@Home] Scheduler request succeeded: got 0 new tasks
02-Jul-2008 12:13:40 [Hydrogen@Home] Message from server: No work sent
02-Jul-2008 12:14:41 [Hydrogen@Home] Sending scheduler request: To fetch work. Requesting 13358 seconds of work, reporting 0 completed tasks
02-Jul-2008 12:14:46 [Hydrogen@Home] Scheduler request succeeded: got 0 new tasks
02-Jul-2008 12:14:46 [Hydrogen@Home] Message from server: No work sent

hadsm3fub_jqg3_005952406 - PH 1 TS 0095329 A - 07/06/1816 00:30 - H:M:S=0392:42:50 AVG=14.83 DLT= 6.96
02-Jul-2008 12:16:01 [Hydrogen@Home] Sending scheduler request: To fetch work. Requesting 13358 seconds of work, reporting 0 completed tasks
02-Jul-2008 12:16:06 [Hydrogen@Home] Scheduler request succeeded: got 0 new tasks
02-Jul-2008 12:16:06 [Hydrogen@Home] Message from server: No work sent
02-Jul-2008 12:17:39 [Hydrogen@Home] Starting <B0>*<98><DD>mol2_pdb1c5z.pdb_1215014998_0
Cleaning up graphics data...
Detaching shared memory...

and the last few lines from the error log were:

SIGSEGV: segmentation violation
Stack trace (2 frames):
/boinc/BOINC/boinc[0x808e90a]
/lib/tls/libc.so.6[0xa4e908]

Exiting...

These files were both last written when I was at lunch, viz.:

Jul 2 12:17 valinuxl.error.log
-

ID: 18166 · Report as offensive
Jean-David

Send message
Joined: 19 Dec 05
Posts: 89
United States
Message 18173 - Posted: 2 Jul 2008, 23:05:50 UTC - in response to Message 18166.  

[quote]I doubt the death of a child will kill the boinc client. Is this correct (that it will not)?

It shouldn't do that anymore, but it is still possible. Sometimes if an application goes it takes the whole client along.

Would you care to test if it's hydrogen that's causing it, by running mainly that project for a bit? It would help Jack fix possible problems with his application(s).

(What you could try is exit BOINC, delete all the files in the \projects\hydrogenathome.org\ directory and restart BOINC, then allow H2 to fetch work again for the test. In the mean time I'll ask Jack to come take a look here).


Today, my other machine just crashed BOINC client. So it is not a hardware problem of my first machine. I cannot prove it was hydrogen@home because that client was running anything that came along from the list of

[SETI@home]
[climateprediction.net]
[Predictor @ Home]
[rosetta@home]
[Hydrogen@Home]
[malariacontrol.net]
[World Community Grid]



My other machine did it again while I was at dinner.

$ tail valinuxl.error.log
Exiting...
Unrecognized XML in GLOBAL_PREFS::parse_override: max_ncpus_pct
Skipping: 100
Skipping: /max_ncpus_pct
SIGSEGV: segmentation violation
Stack trace (2 frames):
/boinc/BOINC/boinc[0x808e90a]
/lib/tls/libc.so.6[0xa4e908]

Exiting...

$ tail -20 valinuxl.boinc.log
02-Jul-2008 18:30:31 [Hydrogen@Home] Scheduler request succeeded: got 1 new tasks
02-Jul-2008 18:30:33 [Hydrogen@Home] Started download of 1215037815_nsc36937.mol2
02-Jul-2008 18:30:33 [Hydrogen@Home] Started download of 1215037815_pdb1gow.pdb
02-Jul-2008 18:30:34 [Hydrogen@Home] Finished download of 1215037815_nsc36937.mol2
02-Jul-2008 18:30:34 [Hydrogen@Home] Started download of ad_nsc36937.mol2_pdb1gow.pdb_1215037815
02-Jul-2008 18:30:35 [Hydrogen@Home] Finished download of 1215037815_pdb1gow.pdb
02-Jul-2008 18:30:35 [Hydrogen@Home] Finished download of ad_nsc36937.mol2_pdb1gow.pdb_1215037815
02-Jul-2008 18:30:41 [Hydrogen@Home] Sending scheduler request: To fetch work. Requesting 13268 seconds of work, reporting 0 completed tasks
02-Jul-2008 18:30:46 [Hydrogen@Home] Scheduler request succeeded: got 0 new tasks
02-Jul-2008 18:30:46 [Hydrogen@Home] Message from server: No work sent
02-Jul-2008 18:31:47 [Hydrogen@Home] Sending scheduler request: To fetch work. Requesting 13268 seconds of work, reporting 0 completed tasks
02-Jul-2008 18:31:52 [Hydrogen@Home] Scheduler request succeeded: got 0 new tasks
02-Jul-2008 18:31:52 [Hydrogen@Home] Message from server: No work sent
02-Jul-2008 18:32:52 [Hydrogen@Home] Sending scheduler request: To fetch work. Requesting 13304 seconds of work, reporting 1 completed tasks
02-Jul-2008 18:32:57 [Hydrogen@Home] Scheduler request succeeded: got 0 new tasks
02-Jul-2008 18:32:57 [Hydrogen@Home] Message from server: No work sent
Resuming CPDN!
Resuming CPDN!
Cleaning up graphics data...
Detaching shared memory...


ID: 18173 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 18174 - Posted: 2 Jul 2008, 23:37:26 UTC

Looks like it's CPDN that's causing it. Let me advertise this thread on the CPDN forums, get some of their people over here to check things with you.
ID: 18174 · Report as offensive
MikeMarsUK

Send message
Joined: 16 Apr 06
Posts: 386
United Kingdom
Message 18176 - Posted: 3 Jul 2008, 8:14:21 UTC
Last modified: 3 Jul 2008, 8:15:10 UTC

It might be worth running a stress test on the machine. I like to use Prime95 (which can be downloaded from http://www.mersenne.org/). The Linux version is called mprime.

You need to run one copy per core, making sure that you use the affinity option to keep them from stepping on their own toes. If you can run it for 24 hours without any errors then it is a good way to demonstrate that the hardware side of things is working OK.

(Look in the 'hardware' section of the following post for some ideas, although it is written more with MS-XP in mind. http://www.climateprediction.net/board/viewtopic.php?t=5896. sensors / lm-sensors can be used to detect the CPU temperature since this can also cause intermittent hardware faults)
ID: 18176 · Report as offensive
Jean-David

Send message
Joined: 19 Dec 05
Posts: 89
United States
Message 18202 - Posted: 4 Jul 2008, 10:27:34 UTC - in response to Message 18174.  

Looks like it's CPDN that's causing it. Let me advertise this thread on the CPDN forums, get some of their people over here to check things with you.


Whatever it is, it crashed the boinc client on both machines last night.

It did not crash the machines themselves and they continued running all night, doing the things they do. In fact, I do not remember that either of these machines has ever crashed. The only machine I ever had (new in 1996) that crashed did so mostly when running Windows 95 and Red Hat Linux 5.0, and 6.0, and maybe 6.2. By the time Red Hat Linux 7.3 came out, it never crashed when running Linux. That machine has been retired for a few years now. I do not need three machines.

Both machines have APC Smart-UPSs supplying power, and neither recorded a power event overnight.

If the machines are experiencing hardware problems, it seems to me strange that they should both start doing it at about the same time. The machines are different. And the problem targets only boinc-related processes.

The older one (new in early 2000) is by VA Linux Systems and it has dual 550 MHz Pentium III processors, 512 Megabytes Ram. The newer one (new in early 2004) I put together myself and has dual hyperthreaded 3.06 GHz Xeon processors (32-bit) and 8 GBytes Ram. They have different chip sets.

I restarted the boinc client on the new machine and it started right up.
Likewise, the old machine.

I now have them both set to get new tasks only for setiathome, and I have suspended all climate prediction work units. So in a few hours or a day or so, they should be running only setiathome. Then we may see.

ID: 18202 · Report as offensive
Jean-David

Send message
Joined: 19 Dec 05
Posts: 89
United States
Message 18203 - Posted: 4 Jul 2008, 10:37:17 UTC - in response to Message 18176.  

It might be worth running a stress test on the machine. I like to use Prime95 (which can be downloaded from http://www.mersenne.org/). The Linux version is called mprime.


I really doubt it is hardware problems since both my machines are having problems. And it is unlikely to be a power problem because both machines are on APC Smart-UPS power and neither recorded any power events.

You need to run one copy per core, making sure that you use the affinity option to keep them from stepping on their own toes. If you can run it for 24 hours without any errors then it is a good way to demonstrate that the hardware side of things is working OK.

(Look in the 'hardware' section of the following post for some ideas, although it is written more with MS-XP in mind. http://www.climateprediction.net/board/viewtopic.php?t=5896. sensors / lm-sensors can be used to detect the CPU temperature since this can also cause intermittent hardware faults)


It is unlikely to be temperature problems because it was cool last night and the lm_sensors recorded no unusual temperatures or fan speeds last night.

Here is how it came down:

04-Jul-2008 02:34:58 [Hydrogen@Home] Sending scheduler request: To fetch work. Reque
sting 72352 seconds of work, reporting 0 completed tasks
04-Jul-2008 02:35:03 [Hydrogen@Home] Scheduler request succeeded: got 0 new tasks
04-Jul-2008 02:35:03 [Hydrogen@Home] Message from server: No work sent
Resuming CPDN!
hadam3h_c_62s02_2000_2000_1 - PH 1 TS 0014904 A - 14/07/2000 12:00 - H:M:S=0209:51:33
AVG=50.69 DLT=40.99
hadam3h_c_62s02_2000_2000_1 - PH 1 TS 0014905 A - 14/07/2000 12:10 - H:M:S=0209:52:14
AVG=50.69 DLT=41.00
04-Jul-2008 02:36:20 [Hydrogen@Home] Sending scheduler request: To fetch work. Reque
sting 73761 seconds of work, reporting 0 completed tasks
04-Jul-2008 02:36:25 [Hydrogen@Home] Scheduler request succeeded: got 1 new tasks
Cleaning up graphics data...
Detaching shared memory...

And here are the temperatures, processor fans, and voltages:

Fri Jul 4 02:30:01 EDT 2008
w83627hf-isa-0290
Adapter: ISA adapter
VCore: +1.44 V (min = +1.36 V, max = +1.47 V)
+3.3V: +3.31 V (min = +3.14 V, max = +3.46 V)
VBat: +3.18 V (min = +2.40 V, max = +3.60 V)
+5V: +4.92 V (min = +4.84 V, max = +5.24 V)
+12V: +11.92 V (min = +11.49 V, max = +12.59 V)
-12V: -11.78 V (min = -13.02 V, max = -11.37 V)
V5SB: +5.43 V (min = +4.84 V, max = +5.24 V)
CPU0 fan: 3668 RPM (min = 1592 RPM, div = 8)
CPU1 fan: 2722 RPM (min = 1592 RPM, div = 8)
System: +42 C (high = +50 C, hyst = +48 C) sensor = thermistor
CPU0: +55.5 C (high = +60 C, hyst = +58 C) sensor = thermistor
CPU1: +54.0 C (high = +60 C, hyst = +58 C) sensor = thermistor
vid: +1.525 V (VRM Version 9.0)
alarms: Chassis intrusion detection
beep_enable:
Sound alarm disabled

Fri Jul 4 02:45:01 EDT 2008
w83627hf-isa-0290
Adapter: ISA adapter
VCore: +1.46 V (min = +1.36 V, max = +1.47 V)
+3.3V: +3.31 V (min = +3.14 V, max = +3.46 V)
VBat: +3.18 V (min = +2.40 V, max = +3.60 V)
+5V: +4.97 V (min = +4.84 V, max = +5.24 V)
+12V: +12.04 V (min = +11.49 V, max = +12.59 V)
-12V: -11.70 V (min = -13.02 V, max = -11.37 V)
V5SB: +5.46 V (min = +4.84 V, max = +5.24 V)
CPU0 fan: 2909 RPM (min = 1592 RPM, div = 8)
CPU1 fan: 2482 RPM (min = 1592 RPM, div = 8)
System: +41 C (high = +50 C, hyst = +48 C) sensor = thermistor
CPU0: +38.5 C (high = +60 C, hyst = +58 C) sensor = thermistor
CPU1: +38.5 C (high = +60 C, hyst = +58 C) sensor = thermistor
vid: +1.525 V (VRM Version 9.0)
alarms: Chassis intrusion detection
beep_enable:
Sound alarm disabled

ID: 18203 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 18204 - Posted: 4 Jul 2008, 11:13:00 UTC

One thing I see now I check through all your supplied logs. It seems to 'crash' each time after you either just started a task (independent of whichever project) or had a scheduler moment. At all these times there's reading from & writing to disk going on.

So could it be your disk controller?

Would you be willing to test the BOINC 6.2 that is about to be released? See if it does this as well?
ID: 18204 · Report as offensive
Jean-David

Send message
Joined: 19 Dec 05
Posts: 89
United States
Message 18217 - Posted: 4 Jul 2008, 15:56:20 UTC - in response to Message 18204.  

One thing I see now I check through all your supplied logs. It seems to 'crash' each time after you either just started a task (independent of whichever project) or had a scheduler moment. At all these times there's reading from & writing to disk going on.

So could it be your disk controller?

Would you be willing to test the BOINC 6.2 that is about to be released? See if it does this as well?


Recall that I am running two different machines and get the same problems on each. So if it is the disk controller (One one machine, all disks are on SCSI controllers, and on the other the one used by the Boinc client is on an EIDE controller.

On the machine with the EIDE controller, that hard drive has had two errors, both at power on, over the last 8+ years. Measured with smartctl program. One was 986 days and the other was 1006 days from initial installation. Since the machine is about 2920 days old, and the problems are recent, I would guess it is not a hard drive problem.

On the machine with the SCSI controllers, there have been no uncorrected errors in the life of the hard drive with BOINC in it.

I downloaded the BOINC 6.2 onto my old machine. It says ubuntu and I am running CentOS 4 on that machine. I hope it will be OK. I am currently running two instances of vprime on that machine (one for each processor) and want to let it run some more. It has found no errors in over 4 hours (each processor), but I think I should let it run a few more hours before deciding that the memory and processors are OK. To be honest, I have already decided that, but I might as well let them run a little longer.

ID: 18217 · Report as offensive
Jean-David

Send message
Joined: 19 Dec 05
Posts: 89
United States
Message 18218 - Posted: 4 Jul 2008, 17:30:47 UTC - in response to Message 18217.  


I downloaded the BOINC 6.2 onto my old machine. It says ubuntu and I am running CentOS 4 on that machine. I hope it will be OK. I am currently running two instances of vprime on that machine (one for each processor) and want to let it run some more. It has found no errors in over 4 hours (each processor), but I think I should let it run a few more hours before deciding that the memory and processors are OK. To be honest, I have already decided that, but I might as well let them run a little longer.


OK: the mersenne prime torture test ran on both processors with no problems for 6 hours each.

I then tried to boot the BOINC 6.2, but it will not boot because it wants a 2.4 glibc and mine is glibc-2.3.4-2.39
ID: 18218 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 18227 - Posted: 4 Jul 2008, 21:44:37 UTC - in response to Message 18218.  

Can you update glibc on your distro? The developers came back to me and are very interested if you manage to crash the 6.2 client. Especially since no application should ever crash the client.
ID: 18227 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 18229 - Posted: 4 Jul 2008, 22:16:57 UTC

OK, scratch the former post. Please first try the 5.10.45 debug version.
Any and all crashes you get with it, post the stack trace that it leaves with that. The information should be saved in stderrdae.txt so perhaps you want to clear it before going to test this version (just delete it, it'll be made again by BOINC).

You can still try 6.2.11 but need the compatible version, not the Ubuntu version. Only the Ubuntu version needs glibc 2.4
6.2.11 compatible version
6.2.11 compatible debug version

As far as I know, these two don't use a BOINC Manager. Although you should be able to use the BM of 5.10.45 if you keep it around.
ID: 18229 · Report as offensive
Jean-David

Send message
Joined: 19 Dec 05
Posts: 89
United States
Message 18231 - Posted: 5 Jul 2008, 0:55:36 UTC - in response to Message 18229.  

OK, scratch the former post. Please first try the 5.10.45 debug version.
Any and all crashes you get with it, post the stack trace that it leaves with that. The information should be saved in stderrdae.txt so perhaps you want to clear it before going to test this version (just delete it, it'll be made again by BOINC).

You can still try 6.2.11 but need the compatible version, not the Ubuntu version. Only the Ubuntu version needs glibc 2.4
6.2.11 compatible version
6.2.11 compatible debug version

As far as I know, these two don't use a BOINC Manager. Although you should be able to use the BM of 5.10.45 if you keep it around.


OK, I have boinc_6.2.11_i686-pc-linux-gnu_debug.sh]6.2.11 running on my older machine. That machine will fetch no new work except from setiathome. Do you want me to change that to get any other applications? I have suspended the only climateprediction work unit. Right now it has a world community grid, a rosetta, two setiathome work units in it.

boinc client did not create the error file. Does it wait until it needs to write it?

ID: 18231 · Report as offensive
Profile KSMarksPsych
Avatar

Send message
Joined: 30 Oct 05
Posts: 1239
United States
Message 18232 - Posted: 5 Jul 2008, 1:10:27 UTC - in response to Message 18231.  

boinc client did not create the error file. Does it wait until it needs to write it?


How did you set up BOINC? The Linux version doesn't create the logs if you just do ./boincmgr from the directory it's installed to. You can try ./boinc --daemon and then ./boincmgr (I think it's still called boinc, not boinc_client, but it'll be obvious what you want, it's not the one with cmd or mgr in it).
Kathryn :o)
ID: 18232 · Report as offensive
Jean-David

Send message
Joined: 19 Dec 05
Posts: 89
United States
Message 18233 - Posted: 5 Jul 2008, 2:42:41 UTC - in response to Message 18232.  

boinc client did not create the error file. Does it wait until it needs to write it?


How did you set up BOINC? The Linux version doesn't create the logs if you just do ./boincmgr from the directory it's installed to. You can try ./boinc --daemon and then ./boincmgr (I think it's still called boinc, not boinc_client, but it'll be obvious what you want, it's not the one with cmd or mgr in it).


I do it so it comes up automatically when I boot the system.

So there is stuff in /etc/sysconfig/boinc and /etc/rc.d/init.d/boinc to do it.

So the stuff in init.d/boinc includes:

case "$1" in
start)
cd $BOINCDIR
if [ ! -f client_state.xml ] ; then
echo -n "BOINC client requires initialization first."
echo_failure
exit 3
fi
echo -n "Starting BOINC client: "


su - boinc -c "$BOINCDIR/BOINC/boinc >>$BOINCDIR/$LOGFILE 2>>$BOINCDIR/$ERRORLOG &"
# su - boinc -c "$BOINCDIR/BOINC/boinc >/dev/null 2>>$BOINCDIR/$ERRORLOG &"
echo_success
echo
;;

And some of those definitions come from /etc/sysconfig/boinc, as follows:

# Configuration for boinc client.

BOINCUSER=boinc
BOINCDIR=/boinc
BUILD_ARCH=i686-pc-linux-gnu
LOGFILE=valinuxl.boinc.log
ERRORLOG=valinuxl.error.log

I just restarted it and the error log is empty and the logfile is:

04-Jul-2008 22:36:47 [---] Starting BOINC client version 6.2.11 for i686-pc-linux-gnu
04-Jul-2008 22:36:47 [---] This a development version of BOINC and may not function properly
04-Jul-2008 22:36:47 [---] log flags: task, file_xfer, sched_ops
04-Jul-2008 22:36:47 [---] Libraries: libcurl/7.18.0 OpenSSL/0.9.8g zlib/1.2.1.2 c-ares/1.5.1
04-Jul-2008 22:36:47 [---] Data directory: /boinc
04-Jul-2008 22:36:47 [---] Processor: 2 GenuineIntel Pentium III (Katmai) [Family 6 Model 7 Stepping 3]
04-Jul-2008 22:36:47 [---] Processor features: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 mmx fxsr sse
04-Jul-2008 22:36:47 [---] OS: Linux: 2.6.9-67.0.20.ELsmp
04-Jul-2008 22:36:47 [---] Memory: 502.41 MB physical, 1.00 GB virtual
04-Jul-2008 22:36:47 [---] Disk: 7.88 GB total, 6.99 GB free
04-Jul-2008 22:36:47 [---] Local time is UTC -4 hours
04-Jul-2008 22:36:47 [---] No coprocessors
04-Jul-2008 22:36:47 [---] Version change (5.10.45 -> 6.2.11)
04-Jul-2008 22:36:47 [SETI@home] URL: http://setiathome.berkeley.edu/; Computer ID: 859259; location: home; project prefs: default
04-Jul-2008 22:36:47 [climateprediction.net] URL: http://climateprediction.net/; Computer ID: 164427; location: home; project prefs: default
04-Jul-2008 22:36:47 [Predictor @ Home] URL: http://predictor.chem.lsa.umich.edu/; Computer ID: 101216; location: home; project prefs: default
04-Jul-2008 22:36:47 [rosetta@home] URL: http://boinc.bakerlab.org/rosetta/; Computer ID: 52404; location: home; project prefs: default
04-Jul-2008 22:36:47 [Hydrogen@Home] URL: http://hydrogenathome.org/; Computer ID: 4152; location: (none); project prefs: default
04-Jul-2008 22:36:47 [malariacontrol.net] URL: http://www.malariacontrol.net/; Computer ID: 12250; location: home; project prefs: default
04-Jul-2008 22:36:47 [World Community Grid] URL: http://www.worldcommunitygrid.org/; Computer ID: 603471; location: (none); project prefs: default
04-Jul-2008 22:36:47 [---] General prefs: from malariacontrol.net (last modified 21-May-2008 09:19:36)
04-Jul-2008 22:36:47 [---] Computer location: home
04-Jul-2008 22:36:47 [---] General prefs: no separate prefs for home; using your defaults
04-Jul-2008 22:36:47 [---] Preferences limit memory usage when active to 376.80MB
04-Jul-2008 22:36:47 [---] Preferences limit memory usage when idle to 452.17MB
04-Jul-2008 22:36:47 [---] Preferences limit disk usage to 6.90GB
04-Jul-2008 22:36:47 [---] Running CPU benchmarks
04-Jul-2008 22:37:18 [---] Benchmark results:
04-Jul-2008 22:37:18 [---] Number of CPUs: 2
04-Jul-2008 22:37:18 [---] 299 floating point MIPS (Whetstone) per CPU
04-Jul-2008 22:37:18 [---] 550 integer MIPS (Dhrystone) per CPU
04-Jul-2008 22:37:19 [World Community Grid] Restarting task faah4143_AB3_MIN3_xmd06240_02_0 using faah version 605
04-Jul-2008 22:37:21 [rosetta@home] Restarting task FRA_t454_CASP8_2CIR_5_axnew.0167_0003_4170_854_0 using rosetta_beta version 598

which looks harmless.

climate prediction is still suspended and the only work units that will get new tasks come from setiathome. Do you want me to enable other work units?
ID: 18233 · Report as offensive
1 · 2 · 3 · Next

Message boards : BOINC client : BOINC client exits.

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.