BOINC 6.2.xx - crashes all over the place

Author	Message
me Send message Joined: 7 Aug 08 Posts: 1	Message 19283 - Posted: 7 Aug 2008, 13:58:52 UTC Hi, not sure if someone else allready posted it but i do think 6.2.xx is broken like hell. I've seen this on numberous project while beeing paired with some wingmen that use 6.2.xx It's allways the same message... <core_client_version>6.2.14</core_client_version> <![CDATA[ <message> Can't get shared memory segment name: shmget() failed </message> ]]> example: http://genlife.is-a-geek.org/genlife/workunit.php?wuid=18194 seen some crashes on Milkyway and S@H too... always the same pattern/error message... ID: 19283 ·

Jean-David Send message Joined: 19 Dec 05 Posts: 89	Message 19323 - Posted: 8 Aug 2008, 10:24:36 UTC - in response to Message 19283. Hi, not sure if someone else allready posted it but i do think 6.2.xx is broken like hell. I've seen this on numberous project while beeing paired with some wingmen that use 6.2.xx It's allways the same message... <core_client_version>6.2.14</core_client_version> <![CDATA[ <message> Can't get shared memory segment name: shmget() failed </message> ]]> example: http://genlife.is-a-geek.org/genlife/workunit.php?wuid=18194 seen some crashes on Milkyway and S@H too... always the same pattern/error message... I ran 6.2.11 for several weeks (2 or maybe 3) on two different 32-bit machines. One on CentOS4 and the other on Red Hat Enterprise Linux 5. Both worked fine without that message. I upgraded the RHEL5 machine to 6.2.14 and it, too, works fine. I could not upgrade the CentOS4 machine to 6.2.14 but that was because the CentOS4 library for something (glibc?) is too old. I should upgrade that machine to CentOS5, but I have not gotten to that yet. ID: 19323 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5082	Message 20075 - Posted: 10 Sep 2008, 17:35:02 UTC Is anyone still performing post-mortems on v6.2.14, to get to the bottom of what caused all those Can't get shared memory segment name: shmget() failed messages? I have one 6.2.14 client for testing purposes. It was running absolutely smoothly, with no errors at all, until this happened: 10-Sep-2008 11:10:54 [lhcathome] Sending scheduler request: To fetch work. Requesting 57827 seconds of work, reporting 0 completed tasks 10-Sep-2008 11:11:16 [---] Project communication failed: attempting access to reference site 10-Sep-2008 11:11:18 [---] Internet access OK - project servers may be temporarily down. 10-Sep-2008 11:11:19 [lhcathome] Scheduler request failed: Couldn't connect to server 10-Sep-2008 11:28:04 [SETI@home] Computation for task 14au08af.23089.18477.5.8.223_1 finished 10-Sep-2008 11:28:04 [SETI@home] Starting ap_14au08aa_B0_P1_00115_20080909_26999.wu_0 10-Sep-2008 11:28:04 [SETI@home] Starting task ap_14au08aa_B0_P1_00115_20080909_26999.wu_0 using astropulse version 435 10-Sep-2008 11:28:06 [SETI@home] Started upload of 14au08af.23089.18477.5.8.223_1_0 10-Sep-2008 11:28:14 [SETI@home] Finished upload of 14au08af.23089.18477.5.8.223_1_0 10-Sep-2008 12:02:55 [SETI@home] Computation for task 15au08aa.16292.17250.5.8.155_1 finished 10-Sep-2008 12:02:55 [SETI@home] Starting 14au08af.23089.24203.5.8.28_0 10-Sep-2008 12:02:55 [SETI@home] Starting task 14au08af.23089.24203.5.8.28_0 using setiathome_enhanced version 528 10-Sep-2008 12:02:57 [SETI@home] Started upload of 15au08aa.16292.17250.5.8.155_1_0 10-Sep-2008 12:03:04 [SETI@home] Finished upload of 15au08aa.16292.17250.5.8.155_1_0 10-Sep-2008 12:57:31 [SETI@home] Computation for task 14au08af.23089.24203.5.8.28_0 finished 10-Sep-2008 12:57:31 [SETI@home] Starting 15au08aa.16292.20931.5.8.4_0 10-Sep-2008 12:57:31 [SETI@home] Starting task 15au08aa.16292.20931.5.8.4_0 using setiathome_enhanced version 528 10-Sep-2008 12:57:33 [SETI@home] Started upload of 14au08af.23089.24203.5.8.28_0_0 10-Sep-2008 12:57:40 [SETI@home] Finished upload of 14au08af.23089.24203.5.8.28_0_0 10-Sep-2008 13:21:00 [lhcathome] Sending scheduler request: To fetch work. Requesting 57797 seconds of work, reporting 0 completed tasks 10-Sep-2008 13:21:05 [lhcathome] Scheduler request succeeded: got 0 new tasks 10-Sep-2008 13:52:08 [SETI@home] Computation for task 15au08aa.16292.20931.5.8.4_0 finished 10-Sep-2008 13:52:08 [SETI@home] Starting 14au08ae.28085.72.13.8.135_0 10-Sep-2008 13:52:08 [SETI@home] Starting task 14au08ae.28085.72.13.8.135_0 using setiathome_enhanced version 528 10-Sep-2008 13:52:10 [SETI@home] Started upload of 15au08aa.16292.20931.5.8.4_0_0 10-Sep-2008 13:52:18 [SETI@home] Finished upload of 15au08aa.16292.20931.5.8.4_0_0 10-Sep-2008 14:46:06 [SETI@home] Computation for task 14au08ae.28085.72.13.8.135_0 finished [b][color=red]10-Sep-2008 14:46:06 [SETI@home] Starting 14au08ae.28085.890.13.8.242_0 10-Sep-2008 14:46:06 [SETI@home] Starting 14au08ae.28085.3344.13.8.113_1[/color][/b] 10-Sep-2008 14:46:07 [SETI@home] Computation for task 14au08ae.28085.890.13.8.242_0 finished 10-Sep-2008 14:46:07 [SETI@home] Output file 14au08ae.28085.890.13.8.242_0_0 for task 14au08ae.28085.890.13.8.242_0 absent 10-Sep-2008 14:46:07 [SETI@home] Computation for task 14au08ae.28085.3344.13.8.113_1 finished 10-Sep-2008 14:46:07 [SETI@home] Output file 14au08ae.28085.3344.13.8.113_1_0 for task 14au08ae.28085.3344.13.8.113_1 absent 10-Sep-2008 14:46:07 [SETI@home] Starting 14au08ae.28085.3753.13.8.163_1 10-Sep-2008 14:46:08 [SETI@home] Started upload of 14au08ae.28085.72.13.8.135_0_0 10-Sep-2008 14:46:08 [SETI@home] Computation for task 14au08ae.28085.3753.13.8.163_1 finished 10-Sep-2008 14:46:08 [SETI@home] Output file 14au08ae.28085.3753.13.8.163_1_0 for task 14au08ae.28085.3753.13.8.163_1 absent 10-Sep-2008 14:46:08 [SETI@home] Starting 14au08af.1803.6207.6.8.251_0 10-Sep-2008 14:46:09 [SETI@home] Computation for task 14au08af.1803.6207.6.8.251_0 finished 10-Sep-2008 14:46:09 [SETI@home] Output file 14au08af.1803.6207.6.8.251_0_0 for task 14au08af.1803.6207.6.8.251_0 absent This is a quad core, and is attached to a variety of projects: however as of today, every project is set to NNT except SETI and LHC. LHC had no work at the time, so effectively the host had become a SETI-only cruncher. Further, it had (and still has) three Astropulse tasks running - you can see where the third AP task started, at 11:28:04. They are 40-hour plus tasks, so that means that only one core remained available for SETI MB work, and you can see how the tasks start one at a time - at 12:02:55, 12:57:31, 13:52:08 etc. I run with a conservative 1 day cache, so there is no question of tasks being pre-empted for EDF. Then, at 14:46:06 (highlighted), BOINC tried to start two tasks at once. They both crashed with the "shmget() failed" error, and BOINC then proceeded to trash the remaining 74 tasks in the cache, one per second. Fortunately, it didn't trash the running Astropulse tasks, and it did go into a 24-hour backoff on scheduler contact with SETI (no reason apparent in the logs - the only scheduler contacts are: 10-Sep-2008 09:36:57 [SETI@home] Sending scheduler request: To fetch work. Requesting 101 seconds of work, reporting 2 completed tasks 10-Sep-2008 09:37:02 [SETI@home] Scheduler request succeeded: got 1 new tasks and 10-Sep-2008 17:41:07 [SETI@home] Fetching scheduler list 10-Sep-2008 17:41:12 [SETI@home] Master file download succeeded 10-Sep-2008 17:41:17 [SETI@home] Sending scheduler request: Requested by user. Requesting 0 seconds of work, reporting 86 completed tasks 10-Sep-2008 17:41:22 [SETI@home] Scheduler request succeeded: got 0 new tasks 10-Sep-2008 17:47:41 [---] Exit requested by user when I got home). So the only oddity I can see is that double task start at 14:46:06, which would have meant five tasks running on a four-core CPU. Host ID 4292666 at SETI, now upgraded to BOINC v6.2.18 (service install, as before). ID: 20075 ·

Thyme Lawn Send message Joined: 2 Sep 05 Posts: 103	Message 20077 - Posted: 10 Sep 2008, 19:52:25 UTC - in response to Message 20075. Is anyone still performing post-mortems on v6.2.14, to get to the bottom of what caused all those Can't get shared memory segment name: shmget() failed messages? It's worth checking if there are any extra messages in stderrdae.txt. If there was a problem setting up the shared memory security descriptors the error messages will have been written directly to the stderr file stream (you won't see them in stdoutdae.txt or the BOINC Manager message tab). "The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer ID: 20077 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5082	Message 20078 - Posted: 10 Sep 2008, 21:30:44 UTC - in response to Message 20077. Is anyone still performing post-mortems on v6.2.14, to get to the bottom of what caused all those Can't get shared memory segment name: shmget() failed messages? It's worth checking if there are any extra messages in stderrdae.txt. If there was a problem setting up the shared memory security descriptors the error messages will have been written directly to the stderr file stream (you won't see them in stdoutdae.txt or the BOINC Manager message tab). Worth a look - but it seems stderrdae.txt hasn't been written to since 7 May 2008, and only contains (multiple iterations of): UNRECOGNIZED: suspend_if_no_recent_input UNRECOGNIZED: max_ncpus_pct ID: 20078 ·

Jord Volunteer tester Help desk expert Send message Joined: 29 Aug 05 Posts: 15482	Message 20141 - Posted: 11 Sep 2008, 21:40:32 UTC - in response to Message 20075. Is anyone still performing post-mortems on v6.2.14, to get to the bottom of what caused all those Can't get shared memory segment name: shmget() failed messages? Not really, as it was fixed in 6.2.18. See its change log, which says: I was able to verify the BOINCTray.exe issue and the shared-mem and handle leaks. I’m not sure how any of us could test the client crash scenario, I ran through the basic battery of tests against BOINC Alpha. I guess we’ll just have to let the people who discovered it, let us know if the problem is fixed. - client: don't leak handles to shared-mem files - client: don't leak process handles when abort jobs - client: if an app exits or we kill it, always destroy the shmem segment. ID: 20141 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5082	Message 20155 - Posted: 11 Sep 2008, 23:40:59 UTC - in response to Message 20141. Is anyone still performing post-mortems on v6.2.14, to get to the bottom of what caused all those Can't get shared memory segment name: shmget() failed messages? Not really, as it was fixed in 6.2.18. See its change log, which says: I was able to verify the BOINCTray.exe issue and the shared-mem and handle leaks. I’m not sure how any of us could test the client crash scenario, I ran through the basic battery of tests against BOINC Alpha. I guess we’ll just have to let the people who discovered it, let us know if the problem is fixed. - client: don't leak handles to shared-mem files - client: don't leak process handles when abort jobs - client: if an app exits or we kill it, always destroy the shmem segment. None of those actually addresses what happens when we try to start more concurrent tasks than we have cores. ID: 20155 ·

Jord Volunteer tester Help desk expert Send message Joined: 29 Aug 05 Posts: 15482	Message 20162 - Posted: 12 Sep 2008, 1:41:54 UTC - in response to Message 20155. None of those actually addresses what happens when we try to start more concurrent tasks than we have cores. Um... why would you want to do that anyway? or am I missing something? ID: 20162 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5082	Message 20172 - Posted: 12 Sep 2008, 8:14:25 UTC - in response to Message 20162. Last modified: 12 Sep 2008, 8:14:38 UTC None of those actually addresses what happens when we try to start more concurrent tasks than we have cores. Um... why would you want to do that anyway? or am I missing something? Well, I don't want to - but it seems my CC v6.2.14 did (at 10-Sep-2008 14:46:06, see log below), and that's what provoked the first attack of the shmget() faileds. ID: 20172 ·

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.