BOINC 6.2.xx - crashes all over the place

Message boards : BOINC client : BOINC 6.2.xx - crashes all over the place
Message board moderation

To post messages, you must log in.

AuthorMessage
me

Send message
Joined: 7 Aug 08
Posts: 1
Kazakhstan
Message 19283 - Posted: 7 Aug 2008, 13:58:52 UTC

Hi,

not sure if someone else allready posted it but i do think 6.2.xx is broken like hell.

I've seen this on numberous project while beeing paired with some wingmen that use 6.2.xx
It's allways the same message...

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
Can't get shared memory segment name: shmget() failed
</message>
]]>

example:
http://genlife.is-a-geek.org/genlife/workunit.php?wuid=18194

seen some crashes on Milkyway and S@H too... always the same pattern/error message...
ID: 19283 · Report as offensive
Jean-David

Send message
Joined: 19 Dec 05
Posts: 89
United States
Message 19323 - Posted: 8 Aug 2008, 10:24:36 UTC - in response to Message 19283.  

Hi,

not sure if someone else allready posted it but i do think 6.2.xx is broken like hell.

I've seen this on numberous project while beeing paired with some wingmen that use 6.2.xx
It's allways the same message...

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
Can't get shared memory segment name: shmget() failed
</message>
]]>

example:
http://genlife.is-a-geek.org/genlife/workunit.php?wuid=18194

seen some crashes on Milkyway and S@H too... always the same pattern/error message...


I ran 6.2.11 for several weeks (2 or maybe 3) on two different 32-bit machines. One on CentOS4 and the other on Red Hat Enterprise Linux 5. Both worked fine without that message. I upgraded the RHEL5 machine to 6.2.14 and it, too, works fine.

I could not upgrade the CentOS4 machine to 6.2.14 but that was because the CentOS4 library for something (glibc?) is too old. I should upgrade that machine to CentOS5, but I have not gotten to that yet.

ID: 19323 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5082
United Kingdom
Message 20075 - Posted: 10 Sep 2008, 17:35:02 UTC

Is anyone still performing post-mortems on v6.2.14, to get to the bottom of what caused all those Can't get shared memory segment name: shmget() failed messages?

I have one 6.2.14 client for testing purposes. It was running absolutely smoothly, with no errors at all, until this happened:

10-Sep-2008 11:10:54 [lhcathome] Sending scheduler request: To fetch work.  Requesting 57827 seconds of work, reporting 0 completed tasks
10-Sep-2008 11:11:16 [---] Project communication failed: attempting access to reference site
10-Sep-2008 11:11:18 [---] Internet access OK - project servers may be temporarily down.
10-Sep-2008 11:11:19 [lhcathome] Scheduler request failed: Couldn't connect to server
10-Sep-2008 11:28:04 [SETI@home] Computation for task 14au08af.23089.18477.5.8.223_1 finished
10-Sep-2008 11:28:04 [SETI@home] Starting ap_14au08aa_B0_P1_00115_20080909_26999.wu_0
10-Sep-2008 11:28:04 [SETI@home] Starting task ap_14au08aa_B0_P1_00115_20080909_26999.wu_0 using astropulse version 435
10-Sep-2008 11:28:06 [SETI@home] Started upload of 14au08af.23089.18477.5.8.223_1_0
10-Sep-2008 11:28:14 [SETI@home] Finished upload of 14au08af.23089.18477.5.8.223_1_0
10-Sep-2008 12:02:55 [SETI@home] Computation for task 15au08aa.16292.17250.5.8.155_1 finished
10-Sep-2008 12:02:55 [SETI@home] Starting 14au08af.23089.24203.5.8.28_0
10-Sep-2008 12:02:55 [SETI@home] Starting task 14au08af.23089.24203.5.8.28_0 using setiathome_enhanced version 528
10-Sep-2008 12:02:57 [SETI@home] Started upload of 15au08aa.16292.17250.5.8.155_1_0
10-Sep-2008 12:03:04 [SETI@home] Finished upload of 15au08aa.16292.17250.5.8.155_1_0
10-Sep-2008 12:57:31 [SETI@home] Computation for task 14au08af.23089.24203.5.8.28_0 finished
10-Sep-2008 12:57:31 [SETI@home] Starting 15au08aa.16292.20931.5.8.4_0
10-Sep-2008 12:57:31 [SETI@home] Starting task 15au08aa.16292.20931.5.8.4_0 using setiathome_enhanced version 528
10-Sep-2008 12:57:33 [SETI@home] Started upload of 14au08af.23089.24203.5.8.28_0_0
10-Sep-2008 12:57:40 [SETI@home] Finished upload of 14au08af.23089.24203.5.8.28_0_0
10-Sep-2008 13:21:00 [lhcathome] Sending scheduler request: To fetch work.  Requesting 57797 seconds of work, reporting 0 completed tasks
10-Sep-2008 13:21:05 [lhcathome] Scheduler request succeeded: got 0 new tasks
10-Sep-2008 13:52:08 [SETI@home] Computation for task 15au08aa.16292.20931.5.8.4_0 finished
10-Sep-2008 13:52:08 [SETI@home] Starting 14au08ae.28085.72.13.8.135_0
10-Sep-2008 13:52:08 [SETI@home] Starting task 14au08ae.28085.72.13.8.135_0 using setiathome_enhanced version 528
10-Sep-2008 13:52:10 [SETI@home] Started upload of 15au08aa.16292.20931.5.8.4_0_0
10-Sep-2008 13:52:18 [SETI@home] Finished upload of 15au08aa.16292.20931.5.8.4_0_0
10-Sep-2008 14:46:06 [SETI@home] Computation for task 14au08ae.28085.72.13.8.135_0 finished
[b][color=red]10-Sep-2008 14:46:06 [SETI@home] Starting 14au08ae.28085.890.13.8.242_0
10-Sep-2008 14:46:06 [SETI@home] Starting 14au08ae.28085.3344.13.8.113_1[/color][/b]
10-Sep-2008 14:46:07 [SETI@home] Computation for task 14au08ae.28085.890.13.8.242_0 finished
10-Sep-2008 14:46:07 [SETI@home] Output file 14au08ae.28085.890.13.8.242_0_0 for task 14au08ae.28085.890.13.8.242_0 absent
10-Sep-2008 14:46:07 [SETI@home] Computation for task 14au08ae.28085.3344.13.8.113_1 finished
10-Sep-2008 14:46:07 [SETI@home] Output file 14au08ae.28085.3344.13.8.113_1_0 for task 14au08ae.28085.3344.13.8.113_1 absent
10-Sep-2008 14:46:07 [SETI@home] Starting 14au08ae.28085.3753.13.8.163_1
10-Sep-2008 14:46:08 [SETI@home] Started upload of 14au08ae.28085.72.13.8.135_0_0
10-Sep-2008 14:46:08 [SETI@home] Computation for task 14au08ae.28085.3753.13.8.163_1 finished
10-Sep-2008 14:46:08 [SETI@home] Output file 14au08ae.28085.3753.13.8.163_1_0 for task 14au08ae.28085.3753.13.8.163_1 absent
10-Sep-2008 14:46:08 [SETI@home] Starting 14au08af.1803.6207.6.8.251_0
10-Sep-2008 14:46:09 [SETI@home] Computation for task 14au08af.1803.6207.6.8.251_0 finished
10-Sep-2008 14:46:09 [SETI@home] Output file 14au08af.1803.6207.6.8.251_0_0 for task 14au08af.1803.6207.6.8.251_0 absent

This is a quad core, and is attached to a variety of projects: however as of today, every project is set to NNT except SETI and LHC. LHC had no work at the time, so effectively the host had become a SETI-only cruncher.

Further, it had (and still has) three Astropulse tasks running - you can see where the third AP task started, at 11:28:04. They are 40-hour plus tasks, so that means that only one core remained available for SETI MB work, and you can see how the tasks start one at a time - at 12:02:55, 12:57:31, 13:52:08 etc. I run with a conservative 1 day cache, so there is no question of tasks being pre-empted for EDF.

Then, at 14:46:06 (highlighted), BOINC tried to start two tasks at once. They both crashed with the "shmget() failed" error, and BOINC then proceeded to trash the remaining 74 tasks in the cache, one per second.

Fortunately, it didn't trash the running Astropulse tasks, and it did go into a 24-hour backoff on scheduler contact with SETI (no reason apparent in the logs - the only scheduler contacts are:

10-Sep-2008 09:36:57 [SETI@home] Sending scheduler request: To fetch work.  Requesting 101 seconds of work, reporting 2 completed tasks
10-Sep-2008 09:37:02 [SETI@home] Scheduler request succeeded: got 1 new tasks

and

10-Sep-2008 17:41:07 [SETI@home] Fetching scheduler list
10-Sep-2008 17:41:12 [SETI@home] Master file download succeeded
10-Sep-2008 17:41:17 [SETI@home] Sending scheduler request: Requested by user.  Requesting 0 seconds of work, reporting 86 completed tasks
10-Sep-2008 17:41:22 [SETI@home] Scheduler request succeeded: got 0 new tasks
10-Sep-2008 17:47:41 [---] Exit requested by user

when I got home).

So the only oddity I can see is that double task start at 14:46:06, which would have meant five tasks running on a four-core CPU. Host ID 4292666 at SETI, now upgraded to BOINC v6.2.18 (service install, as before).
ID: 20075 · Report as offensive
Thyme Lawn

Send message
Joined: 2 Sep 05
Posts: 103
United Kingdom
Message 20077 - Posted: 10 Sep 2008, 19:52:25 UTC - in response to Message 20075.  

Is anyone still performing post-mortems on v6.2.14, to get to the bottom of what caused all those Can't get shared memory segment name: shmget() failed messages?

It's worth checking if there are any extra messages in stderrdae.txt. If there was a problem setting up the shared memory security descriptors the error messages will have been written directly to the stderr file stream (you won't see them in stdoutdae.txt or the BOINC Manager message tab).
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 20077 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5082
United Kingdom
Message 20078 - Posted: 10 Sep 2008, 21:30:44 UTC - in response to Message 20077.  

Is anyone still performing post-mortems on v6.2.14, to get to the bottom of what caused all those Can't get shared memory segment name: shmget() failed messages?

It's worth checking if there are any extra messages in stderrdae.txt. If there was a problem setting up the shared memory security descriptors the error messages will have been written directly to the stderr file stream (you won't see them in stdoutdae.txt or the BOINC Manager message tab).

Worth a look - but it seems stderrdae.txt hasn't been written to since 7 May 2008, and only contains (multiple iterations of):

UNRECOGNIZED: suspend_if_no_recent_input
UNRECOGNIZED: max_ncpus_pct
ID: 20078 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15482
Netherlands
Message 20141 - Posted: 11 Sep 2008, 21:40:32 UTC - in response to Message 20075.  

Is anyone still performing post-mortems on v6.2.14, to get to the bottom of what caused all those Can't get shared memory segment name: shmget() failed messages?

Not really, as it was fixed in 6.2.18. See its change log, which says:

I was able to verify the BOINCTray.exe issue and the shared-mem and handle leaks. I’m not sure how any of us could test the client crash scenario, I ran through the basic battery of tests against BOINC Alpha. I guess we’ll just have to let the people who discovered it, let us know if the problem is fixed.

- client: don't leak handles to shared-mem files

- client: don't leak process handles when abort jobs

- client: if an app exits or we kill it, always destroy the shmem segment.

ID: 20141 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5082
United Kingdom
Message 20155 - Posted: 11 Sep 2008, 23:40:59 UTC - in response to Message 20141.  

Is anyone still performing post-mortems on v6.2.14, to get to the bottom of what caused all those Can't get shared memory segment name: shmget() failed messages?

Not really, as it was fixed in 6.2.18. See its change log, which says:

I was able to verify the BOINCTray.exe issue and the shared-mem and handle leaks. I’m not sure how any of us could test the client crash scenario, I ran through the basic battery of tests against BOINC Alpha. I guess we’ll just have to let the people who discovered it, let us know if the problem is fixed.

- client: don't leak handles to shared-mem files

- client: don't leak process handles when abort jobs

- client: if an app exits or we kill it, always destroy the shmem segment.

None of those actually addresses what happens when we try to start more concurrent tasks than we have cores.
ID: 20155 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15482
Netherlands
Message 20162 - Posted: 12 Sep 2008, 1:41:54 UTC - in response to Message 20155.  

None of those actually addresses what happens when we try to start more concurrent tasks than we have cores.

Um... why would you want to do that anyway? or am I missing something?
ID: 20162 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5082
United Kingdom
Message 20172 - Posted: 12 Sep 2008, 8:14:25 UTC - in response to Message 20162.  
Last modified: 12 Sep 2008, 8:14:38 UTC

None of those actually addresses what happens when we try to start more concurrent tasks than we have cores.

Um... why would you want to do that anyway? or am I missing something?

Well, I don't want to - but it seems my CC v6.2.14 did (at 10-Sep-2008 14:46:06, see log below), and that's what provoked the first attack of the shmget() faileds.
ID: 20172 · Report as offensive

Message boards : BOINC client : BOINC 6.2.xx - crashes all over the place

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.