Message boards : BOINC client : BOINC client exits.
Message board moderation
Author | Message |
---|---|
Send message Joined: 19 Dec 05 Posts: 93 |
I am running Red Hat Enterprise Linux 5.2 on a dual hyperthreaded Xeon machine with 8 GBytes RAM. My boinc client was from boinc_5.10.45_i686-pc-linux-gnu.sh. It has been running fine for months if not longer. I normally run it 24/7. Yesterday and today, the boinc client has exited with the following in the error log: Skipping: 100 Skipping: /max_ncpus_pct Unrecognized XML in GLOBAL_PREFS::parse_override: max_ncpus_pct Skipping: 100 Skipping: /max_ncpus_pct Unrecognized XML in GLOBAL_PREFS::parse_override: max_ncpus_pct Skipping: 100 Skipping: /max_ncpus_pct Unrecognized XML in GLOBAL_PREFS::parse_override: max_ncpus_pct Skipping: 100 Skipping: /max_ncpus_pct Unrecognized XML in GLOBAL_PREFS::parse_override: max_ncpus_pct Skipping: 100 Skipping: /max_ncpus_pct Unrecognized XML in GLOBAL_PREFS::parse_override: max_ncpus_pct Skipping: 100 Skipping: /max_ncpus_pct Unrecognized XML in GLOBAL_PREFS::parse_override: max_ncpus_pct Skipping: 100 Skipping: /max_ncpus_pct Unrecognized XML in GLOBAL_PREFS::parse_override: max_ncpus_pct Skipping: 100 Skipping: /max_ncpus_pct Unrecognized XML in GLOBAL_PREFS::parse_override: max_ncpus_pct Skipping: 100 Skipping: /max_ncpus_pct Unrecognized XML in GLOBAL_PREFS::parse_override: max_ncpus_pct Skipping: 100 Skipping: /max_ncpus_pct SIGSEGV: segmentation violation Stack trace (2 frames): /home/boinc/BOINC/boinc[0x808e90a] [0xe38420] Exiting... Unrecognized XML in GLOBAL_PREFS::parse_override: max_ncpus_pct Skipping: 100 Skipping: /max_ncpus_pct SIGSEGV: segmentation violation Stack trace (2 frames): /home/boinc/BOINC/boinc[0x808e90a] [0x679420] Exiting... The error.log file is listed as: -rw-r--r-- 1 root root 1361 Jun 16 01:32 error.log and this represents two crashes. I restarted it after the first one and it ran around 24 hours before the second one. I have no clue what this means (other than it exits. I know what a segmentation violation is, but I do not know how to interpret the error log. Should I reinstall the program, or what? P.s.: I am running climateprediction, hydrogen@home, rosetta@home, seti@home, worldcommunitygrid, malariacontrol, and predictor@home (no work in a long time), if that matters. |
Send message Joined: 29 Aug 05 Posts: 15569 |
These are benign errors. Skipping: 100 They occur because the science application isn't compiled with the latest BOINC version, while the back-end is running the latest version. The max_ncpus_pct only works on BOINC 6. Which leaves this as the real fault. SIGSEGV: segmentation violation Possible causes include: - Bad task. - Bad RAM. - Bad CPU. - Bad Page File. - Bad disk. - Heat. Since you had two in a row and both times the client crashed as well, go check your CPU and RAM first and foremost. Do check that your system isn't clogged up with dust. |
Send message Joined: 19 Dec 05 Posts: 93 |
These are benign errors. I see you must be right. Both because they are absent in the second crash and because I have restarted the boinc client and I got one, but everything is running fine. Which leaves this as the real fault. Is this something I can fix; e.g., by resetting the project? I do not know which project, if any, is causing the problem. The newest one I am running is hydrogen@home, so if I must suspect one, that is what I would pick. I will suspend it and see what happens tomorrow. - Bad RAM. I ran memtest86 overnight March 17 for 9 hours and it was OK. Of course, the memory (half over 4 years old and the other half about 2 years old) could have gone bad since then. - Bad CPU. I have two of these. I do not really know how to test them other than running a lot of stuff. Right now, running BOINC is the toughest test because it keeps the cpu's almost 100% busy 24/7, but as a test it is not much good. Nothing else is crashing. - Bad Page File. Possible, but unlikely. For one thing, this machine almost never (less than once a day) pages: Mem: 8185240k total, 7858392k used, 326848k free, 175276k buffers Swap: 4096496k total, 612k used, 4095884k free, 6547776k cached $ vmstat 5 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 4 0 612 362244 175388 6547752 0 0 43 22 15 21 87 4 9 0 0 4 0 612 360508 175412 6547768 0 0 0 31 1115 653 95 4 1 0 0 4 0 612 333484 175436 6518692 0 0 0 7 1192 881 84 14 3 0 0 4 0 612 285976 175476 6540288 0 0 0 3854 1060 430 76 23 1 0 0 4 0 612 317520 175492 6542724 0 0 0 2996 1176 1088 86 6 8 0 0 5 0 612 330000 175520 6542720 0 0 0 14 1105 783 94 4 2 0 0 4 0 612 411532 175592 6543976 0 0 258 111 1118 1108 91 6 3 0 0 - Bad disk. I have 6 hard drives, but 4 are used only for a database application. The temperatures of the hard drives around the times of the crashes were: Jun 15 00:54:48 trillian smartd[3991]: Device: /dev/sda, Temperature changed -2 Celsius to 38 Celsius since last report Jun 15 00:54:48 trillian smartd[3991]: Device: /dev/sdb, Temperature changed -2 Celsius to 40 Celsius since last report Jun 15 20:24:48 trillian smartd[3991]: Device: /dev/sda, Temperature changed -2 Celsius to 38 Celsius since last report Jun 15 20:24:48 trillian smartd[3991]: Device: /dev/sdb, Temperature changed -2 Celsius to 40 Celsius since last report ... Jun 16 01:54:48 trillian smartd[3991]: Device: /dev/sda, Temperature changed -3 Celsius to 35 Celsius since last report Jun 16 01:54:48 trillian smartd[3991]: Device: /dev/sdb, Temperature changed -2 Celsius to 38 Celsius since last report so they are not too hot. smartctl reveals they are working fine: # /usr/sbin/smartctl -a /dev/sda smartctl version 5.36 [i686-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ Device: MAXTOR ATLAS10K5_73WLS Version: JNZH Serial number: D21C7CZK Device type: disk Transport protocol: Parallel SCSI (SPI-4) Local Time is: Mon Jun 16 07:46:04 2008 EDT Device supports SMART and is Enabled Temperature Warning Enabled SMART Health Status: OK Current Drive Temperature: 37 C Manufactured in week 05 of year Current start stop count: 1074003968 times Recommended maximum start stop count: 1124401151 times Elements in grown defect list: 0 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 694147 0 0 0 0 2315.327 0 write: 0 0 0 0 0 3568.492 0 Non-medium error count: 87 Last n error events log page No self-tests have been logged Long (extended) Self Test duration: 1440 seconds [24.0 minutes] # /usr/sbin/smartctl -a /dev/sdb smartctl version 5.36 [i686-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ Device: MAXTOR ATLAS10K5_73WLS Version: JNZH Serial number: D21C6NBK Device type: disk Transport protocol: Parallel SCSI (SPI-4) Local Time is: Mon Jun 16 07:46:20 2008 EDT Device supports SMART and is Enabled Temperature Warning Enabled SMART Health Status: OK Current Drive Temperature: 39 C Manufactured in week 05 of year Current start stop count: 1074003968 times Recommended maximum start stop count: 1124401151 times Elements in grown defect list: 0 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 14017 70 0 0 0 1169.196 0 write: 0 0 0 0 0 1945.042 0 Non-medium error count: 129 Last n error events log page No self-tests have been logged Long (extended) Self Test duration: 1440 seconds [24.0 minutes] - Heat. Sun Jun 15 04:00:01 EDT 2008 w83627hf-isa-0290 Adapter: ISA adapter VCore: +1.46 V (min = +1.36 V, max = +1.47 V) +3.3V: +3.31 V (min = +3.14 V, max = +3.46 V) VBat: +3.17 V (min = +2.40 V, max = +3.60 V) +5V: +4.92 V (min = +4.84 V, max = +5.24 V) +12V: +11.86 V (min = +11.49 V, max = +12.59 V) -12V: -11.78 V (min = -13.02 V, max = -11.37 V) V5SB: +5.43 V (min = +4.84 V, max = +5.24 V) CPU0 fan: 3516 RPM (min = 1592 RPM, div = 8) CPU1 fan: 2556 RPM (min = 1592 RPM, div = 8) System: +42 C (high = +50 C, hyst = +48 C) sensor = thermistor CPU0: +56.5 C (high = +60 C, hyst = +58 C) sensor = thermistor CPU1: +54.5 C (high = +60 C, hyst = +58 C) sensor = thermistor vid: +1.525 V (VRM Version 9.0) alarms: Chassis intrusion detection beep_enable: Sound alarm disabled Sun Jun 15 04:15:01 EDT 2008 w83627hf-isa-0290 Adapter: ISA adapter VCore: +1.46 V (min = +1.36 V, max = +1.47 V) +3.3V: +3.31 V (min = +3.14 V, max = +3.46 V) VBat: +3.18 V (min = +2.40 V, max = +3.60 V) +5V: +4.95 V (min = +4.84 V, max = +5.24 V) +12V: +11.98 V (min = +11.49 V, max = +12.59 V) -12V: -11.78 V (min = -13.02 V, max = -11.37 V) V5SB: +5.43 V (min = +4.84 V, max = +5.24 V) CPU0 fan: 2909 RPM (min = 1592 RPM, div = 8) CPU1 fan: 2410 RPM (min = 1592 RPM, div = 8) System: +41 C (high = +50 C, hyst = +48 C) sensor = thermistor CPU0: +41.0 C (high = +60 C, hyst = +58 C) sensor = thermistor CPU1: +40.5 C (high = +60 C, hyst = +58 C) sensor = thermistor vid: +1.525 V (VRM Version 9.0) alarms: Chassis intrusion detection beep_enable: Sound alarm disabled Mon Jun 16 01:30:01 EDT 2008 w83627hf-isa-0290 Adapter: ISA adapter VCore: +1.44 V (min = +1.36 V, max = +1.47 V) +3.3V: +3.31 V (min = +3.14 V, max = +3.46 V) VBat: +3.18 V (min = +2.40 V, max = +3.60 V) +5V: +4.92 V (min = +4.84 V, max = +5.24 V) +12V: +11.86 V (min = +11.49 V, max = +12.59 V) -12V: -11.87 V (min = -13.02 V, max = -11.37 V) V5SB: +5.43 V (min = +4.84 V, max = +5.24 V) CPU0 fan: 3375 RPM (min = 1592 RPM, div = 8) CPU1 fan: 2482 RPM (min = 1592 RPM, div = 8) System: +41 C (high = +50 C, hyst = +48 C) sensor = thermistor CPU0: +56.0 C (high = +60 C, hyst = +58 C) sensor = thermistor CPU1: +54.0 C (high = +60 C, hyst = +58 C) sensor = thermistor vid: +1.525 V (VRM Version 9.0) alarms: Chassis intrusion detection beep_enable: Sound alarm disabled Mon Jun 16 01:45:01 EDT 2008 w83627hf-isa-0290 Adapter: ISA adapter VCore: +1.47 V (min = +1.36 V, max = +1.47 V) +3.3V: +3.31 V (min = +3.14 V, max = +3.46 V) VBat: +3.18 V (min = +2.40 V, max = +3.60 V) +5V: +4.97 V (min = +4.84 V, max = +5.24 V) +12V: +11.92 V (min = +11.49 V, max = +12.59 V) -12V: -11.70 V (min = -13.02 V, max = -11.37 V) V5SB: +5.46 V (min = +4.84 V, max = +5.24 V) CPU0 fan: 2482 RPM (min = 1592 RPM, div = 8) CPU1 fan: 2220 RPM (min = 1592 RPM, div = 8) System: +39 C (high = +50 C, hyst = +48 C) sensor = thermistor CPU0: +37.0 C (high = +60 C, hyst = +58 C) sensor = thermistor CPU1: +36.5 C (high = +60 C, hyst = +58 C) sensor = thermistor vid: +1.525 V (VRM Version 9.0) alarms: Chassis intrusion detection beep_enable: Sound alarm disabled So they were not too hot either. Since you had two in a row and both times the client crashed as well, go check your CPU and RAM first and foremost. Do check that your system isn't clogged up with dust. I just looked at the regular log (not the error log) at the times of the crashes (not the same time, so cron probably not causing the problem. At about the times of the crashes, they say: 15-Jun-2008 04:10:20 [Hydrogen@Home] Sending scheduler request: To report completed tasks. Requesting 25264 seconds of work, reporting 1 completed tasks 15-Jun-2008 04:10:25 [Hydrogen@Home] Scheduler request succeeded: got 0 new tasks 15-Jun-2008 04:10:25 [Hydrogen@Home] Message from server: No work sent Resuming CPDN! Resuming CPDN! Resuming CPDN! hadcm3istd_0bkp_1920_160_15936335 - PH 1 TS 1400977 A - 19/12/1974 00:30 - H:M:S=1587:15:10 AVG= 4.08 DLT= 0.99 Cleaning up graphics data... Detaching shared memory... Cleaning up graphics data... Detaching shared memory... 15-Jun-2008 07:02:09 [---] Starting BOINC client version 5.10.45 for i686-pc-linux-gnu and 16-Jun-2008 01:32:26 [Hydrogen@Home] Sending scheduler request: To fetch work. Requesting 26832 seconds of work, reporting 0 completed tasks 16-Jun-2008 01:32:31 [Hydrogen@Home] Scheduler request succeeded: got 0 new tasks 16-Jun-2008 01:32:31 [Hydrogen@Home] Message from server: No work sent hadam3h_c_61s04_2000_2000_0 - PH 1 TS 0027517 A - 12/10/2000 02:10 - H:M:S=0374:17:26 AVG=48.97 DLT=35.81 hadam3h_c_61s04_2000_2000_0 - PH 1 TS 0027518 A - 12/10/2000 02:20 - H:M:S=0374:18:00 AVG=48.97 DLT=34.06 Resuming CPDN! Resuming CPDN! Cleaning up graphics data... Detaching shared memory... Cleaning up graphics data... Detaching shared memory... 16-Jun-2008 06:05:46 [---] Starting BOINC client version 5.10.45 for i686-pc-linux-gnu |
Send message Joined: 19 Dec 05 Posts: 93 |
These are benign errors. It stayed up all day and all night up until this instant when it is still up. The only change I made was to suspend the hydrogen@home project. I do not know if this is a red herring or not. P.S., I note that SIGSEGV also comes out for stack overflow. For me, the stack limit is: $ ulimit -a data seg size (kbytes, -d) unlimited max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited stack size (kbytes, -s) 10240 <---<<< This, too, may not mean anything. 10 megabytes seems like a lot. I doubt the death of a child will kill the boinc client. Is this correct (that it will not)? |
Send message Joined: 29 Aug 05 Posts: 15569 |
I doubt the death of a child will kill the boinc client. Is this correct (that it will not)? It shouldn't do that anymore, but it is still possible. Sometimes if an application goes it takes the whole client along. Would you care to test if it's hydrogen that's causing it, by running mainly that project for a bit? It would help Jack fix possible problems with his application(s). (What you could try is exit BOINC, delete all the files in the \projects\hydrogenathome.org\ directory and restart BOINC, then allow H2 to fetch work again for the test. In the mean time I'll ask Jack to come take a look here). |
Send message Joined: 19 Dec 05 Posts: 93 |
I doubt the death of a child will kill the boinc client. Is this correct (that it will not)? I did not notice your post until just now. I then stopped the BOINC client, got into ~/BOINC/projects and did rm -fr hydrogenathome.org. I then restarted the BOINC client and less than a minute later, the hydrogenathome.org was refilled with files -- with the current date. I have set the other projects to get no new tasks. So pretty soon, all but climate prediction should stop running. I hope hydrogen will supply me with enough work. BTW, it has not been crashing for over a week until this morning. It was not hot either outside or inside my computer at the time of the crash. |
Send message Joined: 19 Dec 05 Posts: 93 |
[quote]I doubt the death of a child will kill the boinc client. Is this correct (that it will not)? Today, my other machine just crashed BOINC client. So it is not a hardware problem of my first machine. I cannot prove it was hydrogen@home because that client was running anything that came along from the list of [SETI@home] [climateprediction.net] [Predictor @ Home] [rosetta@home] [Hydrogen@Home] [malariacontrol.net] [World Community Grid] However, the last lines of my client log were: 02-Jul-2008 12:10:04 [Hydrogen@Home] Sending scheduler request: To fetch work. Requesting 13357 seconds of work, reporting 0 completed tasks 02-Jul-2008 12:10:09 [Hydrogen@Home] Scheduler request succeeded: got 1 new tasks 02-Jul-2008 12:10:11 [Hydrogen@Home] Started download of 1215014998_nsc1800.mol2 02-Jul-2008 12:10:11 [Hydrogen@Home] Started download of 1215014998_pdb1c5z.pdb 02-Jul-2008 12:10:12 [Hydrogen@Home] Finished download of 1215014998_nsc1800.mol2 02-Jul-2008 12:10:12 [Hydrogen@Home] Started download of ad_nsc1800.mol2_pdb1c5z.pdb_1215014998 02-Jul-2008 12:10:13 [Hydrogen@Home] Finished download of 1215014998_pdb1c5z.pdb 02-Jul-2008 12:10:13 [Hydrogen@Home] Finished download of ad_nsc1800.mol2_pdb1c5z.pdb_1215014998 02-Jul-2008 12:10:19 [Hydrogen@Home] Sending scheduler request: To fetch work. Requesting 13270 seconds of work, reporting 0 completed tasks 02-Jul-2008 12:10:24 [Hydrogen@Home] Scheduler request succeeded: got 0 new tasks 02-Jul-2008 12:10:24 [Hydrogen@Home] Message from server: No work sent 02-Jul-2008 12:11:25 [Hydrogen@Home] Sending scheduler request: To fetch work. Requesting 13270 seconds of work, reporting 0 completed tasks 02-Jul-2008 12:11:30 [Hydrogen@Home] Scheduler request succeeded: got 0 new tasks 02-Jul-2008 12:11:30 [Hydrogen@Home] Message from server: No work sent 02-Jul-2008 12:12:30 [Hydrogen@Home] Sending scheduler request: To report completed tasks. Requesting 13358 seconds of work, reporting 1 completed tasks 02-Jul-2008 12:12:35 [Hydrogen@Home] Scheduler request succeeded: got 0 new tasks 02-Jul-2008 12:12:35 [Hydrogen@Home] Message from server: No work sent 02-Jul-2008 12:13:35 [Hydrogen@Home] Sending scheduler request: To fetch work. Requesting 13358 seconds of work, reporting 0 completed tasks 02-Jul-2008 12:13:40 [Hydrogen@Home] Scheduler request succeeded: got 0 new tasks 02-Jul-2008 12:13:40 [Hydrogen@Home] Message from server: No work sent 02-Jul-2008 12:14:41 [Hydrogen@Home] Sending scheduler request: To fetch work. Requesting 13358 seconds of work, reporting 0 completed tasks 02-Jul-2008 12:14:46 [Hydrogen@Home] Scheduler request succeeded: got 0 new tasks 02-Jul-2008 12:14:46 [Hydrogen@Home] Message from server: No work sent hadsm3fub_jqg3_005952406 - PH 1 TS 0095329 A - 07/06/1816 00:30 - H:M:S=0392:42:50 AVG=14.83 DLT= 6.96 02-Jul-2008 12:16:01 [Hydrogen@Home] Sending scheduler request: To fetch work. Requesting 13358 seconds of work, reporting 0 completed tasks 02-Jul-2008 12:16:06 [Hydrogen@Home] Scheduler request succeeded: got 0 new tasks 02-Jul-2008 12:16:06 [Hydrogen@Home] Message from server: No work sent 02-Jul-2008 12:17:39 [Hydrogen@Home] Starting <B0>*<98><DD>mol2_pdb1c5z.pdb_1215014998_0 Cleaning up graphics data... Detaching shared memory... and the last few lines from the error log were: SIGSEGV: segmentation violation Stack trace (2 frames): /boinc/BOINC/boinc[0x808e90a] /lib/tls/libc.so.6[0xa4e908] Exiting... These files were both last written when I was at lunch, viz.: Jul 2 12:17 valinuxl.error.log - |
Send message Joined: 19 Dec 05 Posts: 93 |
[quote]I doubt the death of a child will kill the boinc client. Is this correct (that it will not)? My other machine did it again while I was at dinner. $ tail valinuxl.error.log Exiting... Unrecognized XML in GLOBAL_PREFS::parse_override: max_ncpus_pct Skipping: 100 Skipping: /max_ncpus_pct SIGSEGV: segmentation violation Stack trace (2 frames): /boinc/BOINC/boinc[0x808e90a] /lib/tls/libc.so.6[0xa4e908] Exiting... $ tail -20 valinuxl.boinc.log 02-Jul-2008 18:30:31 [Hydrogen@Home] Scheduler request succeeded: got 1 new tasks 02-Jul-2008 18:30:33 [Hydrogen@Home] Started download of 1215037815_nsc36937.mol2 02-Jul-2008 18:30:33 [Hydrogen@Home] Started download of 1215037815_pdb1gow.pdb 02-Jul-2008 18:30:34 [Hydrogen@Home] Finished download of 1215037815_nsc36937.mol2 02-Jul-2008 18:30:34 [Hydrogen@Home] Started download of ad_nsc36937.mol2_pdb1gow.pdb_1215037815 02-Jul-2008 18:30:35 [Hydrogen@Home] Finished download of 1215037815_pdb1gow.pdb 02-Jul-2008 18:30:35 [Hydrogen@Home] Finished download of ad_nsc36937.mol2_pdb1gow.pdb_1215037815 02-Jul-2008 18:30:41 [Hydrogen@Home] Sending scheduler request: To fetch work. Requesting 13268 seconds of work, reporting 0 completed tasks 02-Jul-2008 18:30:46 [Hydrogen@Home] Scheduler request succeeded: got 0 new tasks 02-Jul-2008 18:30:46 [Hydrogen@Home] Message from server: No work sent 02-Jul-2008 18:31:47 [Hydrogen@Home] Sending scheduler request: To fetch work. Requesting 13268 seconds of work, reporting 0 completed tasks 02-Jul-2008 18:31:52 [Hydrogen@Home] Scheduler request succeeded: got 0 new tasks 02-Jul-2008 18:31:52 [Hydrogen@Home] Message from server: No work sent 02-Jul-2008 18:32:52 [Hydrogen@Home] Sending scheduler request: To fetch work. Requesting 13304 seconds of work, reporting 1 completed tasks 02-Jul-2008 18:32:57 [Hydrogen@Home] Scheduler request succeeded: got 0 new tasks 02-Jul-2008 18:32:57 [Hydrogen@Home] Message from server: No work sent Resuming CPDN! Resuming CPDN! Cleaning up graphics data... Detaching shared memory... |
Send message Joined: 29 Aug 05 Posts: 15569 |
Looks like it's CPDN that's causing it. Let me advertise this thread on the CPDN forums, get some of their people over here to check things with you. |
Send message Joined: 16 Apr 06 Posts: 386 |
It might be worth running a stress test on the machine. I like to use Prime95 (which can be downloaded from http://www.mersenne.org/). The Linux version is called mprime. You need to run one copy per core, making sure that you use the affinity option to keep them from stepping on their own toes. If you can run it for 24 hours without any errors then it is a good way to demonstrate that the hardware side of things is working OK. (Look in the 'hardware' section of the following post for some ideas, although it is written more with MS-XP in mind. http://www.climateprediction.net/board/viewtopic.php?t=5896. sensors / lm-sensors can be used to detect the CPU temperature since this can also cause intermittent hardware faults) |
Send message Joined: 19 Dec 05 Posts: 93 |
Looks like it's CPDN that's causing it. Let me advertise this thread on the CPDN forums, get some of their people over here to check things with you. Whatever it is, it crashed the boinc client on both machines last night. It did not crash the machines themselves and they continued running all night, doing the things they do. In fact, I do not remember that either of these machines has ever crashed. The only machine I ever had (new in 1996) that crashed did so mostly when running Windows 95 and Red Hat Linux 5.0, and 6.0, and maybe 6.2. By the time Red Hat Linux 7.3 came out, it never crashed when running Linux. That machine has been retired for a few years now. I do not need three machines. Both machines have APC Smart-UPSs supplying power, and neither recorded a power event overnight. If the machines are experiencing hardware problems, it seems to me strange that they should both start doing it at about the same time. The machines are different. And the problem targets only boinc-related processes. The older one (new in early 2000) is by VA Linux Systems and it has dual 550 MHz Pentium III processors, 512 Megabytes Ram. The newer one (new in early 2004) I put together myself and has dual hyperthreaded 3.06 GHz Xeon processors (32-bit) and 8 GBytes Ram. They have different chip sets. I restarted the boinc client on the new machine and it started right up. Likewise, the old machine. I now have them both set to get new tasks only for setiathome, and I have suspended all climate prediction work units. So in a few hours or a day or so, they should be running only setiathome. Then we may see. |
Send message Joined: 19 Dec 05 Posts: 93 |
It might be worth running a stress test on the machine. I like to use Prime95 (which can be downloaded from http://www.mersenne.org/). The Linux version is called mprime. I really doubt it is hardware problems since both my machines are having problems. And it is unlikely to be a power problem because both machines are on APC Smart-UPS power and neither recorded any power events. You need to run one copy per core, making sure that you use the affinity option to keep them from stepping on their own toes. If you can run it for 24 hours without any errors then it is a good way to demonstrate that the hardware side of things is working OK. It is unlikely to be temperature problems because it was cool last night and the lm_sensors recorded no unusual temperatures or fan speeds last night. Here is how it came down: 04-Jul-2008 02:34:58 [Hydrogen@Home] Sending scheduler request: To fetch work. Reque sting 72352 seconds of work, reporting 0 completed tasks 04-Jul-2008 02:35:03 [Hydrogen@Home] Scheduler request succeeded: got 0 new tasks 04-Jul-2008 02:35:03 [Hydrogen@Home] Message from server: No work sent Resuming CPDN! hadam3h_c_62s02_2000_2000_1 - PH 1 TS 0014904 A - 14/07/2000 12:00 - H:M:S=0209:51:33 AVG=50.69 DLT=40.99 hadam3h_c_62s02_2000_2000_1 - PH 1 TS 0014905 A - 14/07/2000 12:10 - H:M:S=0209:52:14 AVG=50.69 DLT=41.00 04-Jul-2008 02:36:20 [Hydrogen@Home] Sending scheduler request: To fetch work. Reque sting 73761 seconds of work, reporting 0 completed tasks 04-Jul-2008 02:36:25 [Hydrogen@Home] Scheduler request succeeded: got 1 new tasks Cleaning up graphics data... Detaching shared memory... And here are the temperatures, processor fans, and voltages: Fri Jul 4 02:30:01 EDT 2008 w83627hf-isa-0290 Adapter: ISA adapter VCore: +1.44 V (min = +1.36 V, max = +1.47 V) +3.3V: +3.31 V (min = +3.14 V, max = +3.46 V) VBat: +3.18 V (min = +2.40 V, max = +3.60 V) +5V: +4.92 V (min = +4.84 V, max = +5.24 V) +12V: +11.92 V (min = +11.49 V, max = +12.59 V) -12V: -11.78 V (min = -13.02 V, max = -11.37 V) V5SB: +5.43 V (min = +4.84 V, max = +5.24 V) CPU0 fan: 3668 RPM (min = 1592 RPM, div = 8) CPU1 fan: 2722 RPM (min = 1592 RPM, div = 8) System: +42 C (high = +50 C, hyst = +48 C) sensor = thermistor CPU0: +55.5 C (high = +60 C, hyst = +58 C) sensor = thermistor CPU1: +54.0 C (high = +60 C, hyst = +58 C) sensor = thermistor vid: +1.525 V (VRM Version 9.0) alarms: Chassis intrusion detection beep_enable: Sound alarm disabled Fri Jul 4 02:45:01 EDT 2008 w83627hf-isa-0290 Adapter: ISA adapter VCore: +1.46 V (min = +1.36 V, max = +1.47 V) +3.3V: +3.31 V (min = +3.14 V, max = +3.46 V) VBat: +3.18 V (min = +2.40 V, max = +3.60 V) +5V: +4.97 V (min = +4.84 V, max = +5.24 V) +12V: +12.04 V (min = +11.49 V, max = +12.59 V) -12V: -11.70 V (min = -13.02 V, max = -11.37 V) V5SB: +5.46 V (min = +4.84 V, max = +5.24 V) CPU0 fan: 2909 RPM (min = 1592 RPM, div = 8) CPU1 fan: 2482 RPM (min = 1592 RPM, div = 8) System: +41 C (high = +50 C, hyst = +48 C) sensor = thermistor CPU0: +38.5 C (high = +60 C, hyst = +58 C) sensor = thermistor CPU1: +38.5 C (high = +60 C, hyst = +58 C) sensor = thermistor vid: +1.525 V (VRM Version 9.0) alarms: Chassis intrusion detection beep_enable: Sound alarm disabled |
Send message Joined: 29 Aug 05 Posts: 15569 |
One thing I see now I check through all your supplied logs. It seems to 'crash' each time after you either just started a task (independent of whichever project) or had a scheduler moment. At all these times there's reading from & writing to disk going on. So could it be your disk controller? Would you be willing to test the BOINC 6.2 that is about to be released? See if it does this as well? |
Send message Joined: 19 Dec 05 Posts: 93 |
One thing I see now I check through all your supplied logs. It seems to 'crash' each time after you either just started a task (independent of whichever project) or had a scheduler moment. At all these times there's reading from & writing to disk going on. Recall that I am running two different machines and get the same problems on each. So if it is the disk controller (One one machine, all disks are on SCSI controllers, and on the other the one used by the Boinc client is on an EIDE controller. On the machine with the EIDE controller, that hard drive has had two errors, both at power on, over the last 8+ years. Measured with smartctl program. One was 986 days and the other was 1006 days from initial installation. Since the machine is about 2920 days old, and the problems are recent, I would guess it is not a hard drive problem. On the machine with the SCSI controllers, there have been no uncorrected errors in the life of the hard drive with BOINC in it. I downloaded the BOINC 6.2 onto my old machine. It says ubuntu and I am running CentOS 4 on that machine. I hope it will be OK. I am currently running two instances of vprime on that machine (one for each processor) and want to let it run some more. It has found no errors in over 4 hours (each processor), but I think I should let it run a few more hours before deciding that the memory and processors are OK. To be honest, I have already decided that, but I might as well let them run a little longer. |
Send message Joined: 19 Dec 05 Posts: 93 |
OK: the mersenne prime torture test ran on both processors with no problems for 6 hours each. I then tried to boot the BOINC 6.2, but it will not boot because it wants a 2.4 glibc and mine is glibc-2.3.4-2.39 |
Send message Joined: 29 Aug 05 Posts: 15569 |
Can you update glibc on your distro? The developers came back to me and are very interested if you manage to crash the 6.2 client. Especially since no application should ever crash the client. |
Send message Joined: 29 Aug 05 Posts: 15569 |
OK, scratch the former post. Please first try the 5.10.45 debug version. Any and all crashes you get with it, post the stack trace that it leaves with that. The information should be saved in stderrdae.txt so perhaps you want to clear it before going to test this version (just delete it, it'll be made again by BOINC). You can still try 6.2.11 but need the compatible version, not the Ubuntu version. Only the Ubuntu version needs glibc 2.4 6.2.11 compatible version 6.2.11 compatible debug version As far as I know, these two don't use a BOINC Manager. Although you should be able to use the BM of 5.10.45 if you keep it around. |
Send message Joined: 19 Dec 05 Posts: 93 |
OK, scratch the former post. Please first try the 5.10.45 debug version. OK, I have boinc_6.2.11_i686-pc-linux-gnu_debug.sh]6.2.11 running on my older machine. That machine will fetch no new work except from setiathome. Do you want me to change that to get any other applications? I have suspended the only climateprediction work unit. Right now it has a world community grid, a rosetta, two setiathome work units in it. boinc client did not create the error file. Does it wait until it needs to write it? |
Send message Joined: 30 Oct 05 Posts: 1239 |
boinc client did not create the error file. Does it wait until it needs to write it? How did you set up BOINC? The Linux version doesn't create the logs if you just do ./boincmgr from the directory it's installed to. You can try ./boinc --daemon and then ./boincmgr (I think it's still called boinc, not boinc_client, but it'll be obvious what you want, it's not the one with cmd or mgr in it). Kathryn :o) |
Send message Joined: 19 Dec 05 Posts: 93 |
boinc client did not create the error file. Does it wait until it needs to write it? I do it so it comes up automatically when I boot the system. So there is stuff in /etc/sysconfig/boinc and /etc/rc.d/init.d/boinc to do it. So the stuff in init.d/boinc includes: case "$1" in start) cd $BOINCDIR if [ ! -f client_state.xml ] ; then echo -n "BOINC client requires initialization first." echo_failure exit 3 fi echo -n "Starting BOINC client: " su - boinc -c "$BOINCDIR/BOINC/boinc >>$BOINCDIR/$LOGFILE 2>>$BOINCDIR/$ERRORLOG &" # su - boinc -c "$BOINCDIR/BOINC/boinc >/dev/null 2>>$BOINCDIR/$ERRORLOG &" echo_success echo ;; And some of those definitions come from /etc/sysconfig/boinc, as follows: # Configuration for boinc client. BOINCUSER=boinc BOINCDIR=/boinc BUILD_ARCH=i686-pc-linux-gnu LOGFILE=valinuxl.boinc.log ERRORLOG=valinuxl.error.log I just restarted it and the error log is empty and the logfile is: 04-Jul-2008 22:36:47 [---] Starting BOINC client version 6.2.11 for i686-pc-linux-gnu 04-Jul-2008 22:36:47 [---] This a development version of BOINC and may not function properly 04-Jul-2008 22:36:47 [---] log flags: task, file_xfer, sched_ops 04-Jul-2008 22:36:47 [---] Libraries: libcurl/7.18.0 OpenSSL/0.9.8g zlib/1.2.1.2 c-ares/1.5.1 04-Jul-2008 22:36:47 [---] Data directory: /boinc 04-Jul-2008 22:36:47 [---] Processor: 2 GenuineIntel Pentium III (Katmai) [Family 6 Model 7 Stepping 3] 04-Jul-2008 22:36:47 [---] Processor features: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 mmx fxsr sse 04-Jul-2008 22:36:47 [---] OS: Linux: 2.6.9-67.0.20.ELsmp 04-Jul-2008 22:36:47 [---] Memory: 502.41 MB physical, 1.00 GB virtual 04-Jul-2008 22:36:47 [---] Disk: 7.88 GB total, 6.99 GB free 04-Jul-2008 22:36:47 [---] Local time is UTC -4 hours 04-Jul-2008 22:36:47 [---] No coprocessors 04-Jul-2008 22:36:47 [---] Version change (5.10.45 -> 6.2.11) 04-Jul-2008 22:36:47 [SETI@home] URL: http://setiathome.berkeley.edu/; Computer ID: 859259; location: home; project prefs: default 04-Jul-2008 22:36:47 [climateprediction.net] URL: http://climateprediction.net/; Computer ID: 164427; location: home; project prefs: default 04-Jul-2008 22:36:47 [Predictor @ Home] URL: http://predictor.chem.lsa.umich.edu/; Computer ID: 101216; location: home; project prefs: default 04-Jul-2008 22:36:47 [rosetta@home] URL: http://boinc.bakerlab.org/rosetta/; Computer ID: 52404; location: home; project prefs: default 04-Jul-2008 22:36:47 [Hydrogen@Home] URL: http://hydrogenathome.org/; Computer ID: 4152; location: (none); project prefs: default 04-Jul-2008 22:36:47 [malariacontrol.net] URL: http://www.malariacontrol.net/; Computer ID: 12250; location: home; project prefs: default 04-Jul-2008 22:36:47 [World Community Grid] URL: http://www.worldcommunitygrid.org/; Computer ID: 603471; location: (none); project prefs: default 04-Jul-2008 22:36:47 [---] General prefs: from malariacontrol.net (last modified 21-May-2008 09:19:36) 04-Jul-2008 22:36:47 [---] Computer location: home 04-Jul-2008 22:36:47 [---] General prefs: no separate prefs for home; using your defaults 04-Jul-2008 22:36:47 [---] Preferences limit memory usage when active to 376.80MB 04-Jul-2008 22:36:47 [---] Preferences limit memory usage when idle to 452.17MB 04-Jul-2008 22:36:47 [---] Preferences limit disk usage to 6.90GB 04-Jul-2008 22:36:47 [---] Running CPU benchmarks 04-Jul-2008 22:37:18 [---] Benchmark results: 04-Jul-2008 22:37:18 [---] Number of CPUs: 2 04-Jul-2008 22:37:18 [---] 299 floating point MIPS (Whetstone) per CPU 04-Jul-2008 22:37:18 [---] 550 integer MIPS (Dhrystone) per CPU 04-Jul-2008 22:37:19 [World Community Grid] Restarting task faah4143_AB3_MIN3_xmd06240_02_0 using faah version 605 04-Jul-2008 22:37:21 [rosetta@home] Restarting task FRA_t454_CASP8_2CIR_5_axnew.0167_0003_4170_854_0 using rosetta_beta version 598 which looks harmless. climate prediction is still suspended and the only work units that will get new tasks come from setiathome. Do you want me to enable other work units? |
Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.