(temporarily) Solving the LHC/BOINC crashing problem.

Message boards : BOINC client : (temporarily) Solving the LHC/BOINC crashing problem.
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 16245 - Posted: 31 Mar 2008, 23:17:37 UTC
Last modified: 1 Apr 2008, 17:05:56 UTC

At this moment LHC's scheduler is running again, so the below is no longer necessary.


I will start with a warning. By following the below you will rid yourself of any work of LHC that you still have uploading, ready to start, downloading or ready to report. it will go lost.
------------------------------------------------

If you do not want to lose any work from LHC, force BOINC to use the Network Activity Suspended option. This will let BOINC run, but no projects will upload/download/report.

To do so, exit BOINC, navigate to your BOINC directory, edit client_state.xml, scroll to the bottom of the file and where it says <user_network_request></user_network_request> change the value in between the tags to 3. Save client_state.xml and restart BOINC.

------------------------------------------------
So only use the below if you truly want to, don't mind the lost work etc. Otherwise wait for LHC to return on line, while keeping your network activity suspended.
------------------------------------------------

Exit BOINC.
Navigate to your BOINC directory.
Open client_state.xml with a text editor.
Use the Find option (usually under the F3 key)
Type in lhc and click Find.

Move your cursor to the left side of the screen, make sure it sits before the <project> tag
<project>
<master_url>http://lhcathome.cern.ch/lhcathome/</master_url>

Now hold down Shift.
Scroll down, all the way till you see the next <project> tag.
Stop the scroll after the last </result> tag you see before the next <project> tag.

You have now selected all of LHC.
Hit Delete.
All what you selected will now be deleted.
Save client_state.xml through the File->Save menu. (!! Don't use the Save As... option. !!)

Still in your BOINC directory, rename client_state_prev.xml to client_state_prev.xml.backup
(if you want to get rid of LHC from here on in for the moment, delete or rename account_lhcathome.cern.ch_lhcathome.xml)

Restart BOINC.
Set LHC to No New tasks and/or Suspend.

Why the backup of client_state_prev.xml? Now it still has the LHC files in it. Perhaps if LHC comes back, that you can edit client_state.xml again and add the information from the backup file. I don't think it'll work, but it never hurts to try at that time. At that time you just copy the material from the backup file back into client_state.xml and delete the new cs_prev.xml file.

Do NOT change anything else in client_state.xml !!

And if the above doesn't help you, or you still have questions, or you want me to do it for you, just ask for help. Please!
ID: 16245 · Report as offensive
Professor Ray

Send message
Joined: 31 Mar 08
Posts: 59
United States
Message 16251 - Posted: 31 Mar 2008, 23:36:33 UTC

The above solution will hack out LHC WU's for transfer.

To disable network connectivity with WU's pending for transfer, find the below string at the bottom of client_state.xml (open with Notepad):

<user_network_request>1</user_network_request>

Replace the numeral one (or whatever) in the above string with a numeral three. Save the file. Then restart BOINC.

Do this and the aforementioned procedure by Ageless after BOINC has been terminated. Ensure that no BOINC client is running either (for me Rosetta would load and begin execution and then BOINC manager would crash trying to upload LHC WU).

FWIW, do Ageless suggested "hack" if and ONLY if you have WU's for other projects that MUST be uploaded for credit (or if you have no other WU's to crunch).

Thanks Ageless. Hope these band-aides help somebody.

ID: 16251 · Report as offensive
John McLeod VII
Avatar

Send message
Joined: 29 Aug 05
Posts: 147
Message 16254 - Posted: 31 Mar 2008, 23:50:33 UTC - in response to Message 16251.  

The above solution will hack out LHC WU's for transfer.

To disable network connectivity with WU's pending for transfer, find the below string at the bottom of client_state.xml (open with Notepad):

<user_network_request>1</user_network_request>

Replace the numeral one (or whatever) in the above string with a numeral three. Save the file. Then restart BOINC.

Do this and the aforementioned procedure by Ageless after BOINC has been terminated. Ensure that no BOINC client is running either (for me Rosetta would load and begin execution and then BOINC manager would crash trying to upload LHC WU).

FWIW, do Ageless suggested "hack" if and ONLY if you have WU's for other projects that MUST be uploaded for credit (or if you have no other WU's to crunch).

Thanks Ageless. Hope these band-aides help somebody.


To avoid getting work from LHC you could suspend the project. You still have to elide all of the information from LHC for a bit, but having the account file present will auto attach you to the project.

BOINC WIKI
ID: 16254 · Report as offensive
Professor Ray

Send message
Joined: 31 Mar 08
Posts: 59
United States
Message 16257 - Posted: 1 Apr 2008, 0:13:55 UTC

Right, what you said. However, the aforementioned "band-aides" address the specific situation of BOINC manager crashing at this time for users who have completed LHC work units stuck in the transfer queue due to a crashed LHC server.

To fix the BOINC manager crash, network connectivity must be disabled thereby preventing BOINC access to the LHC server. To disable network connectivity one has to hack the client_state.xml file as indicated. Doing so in the GUI is futile, in that the BOINC manager crashes on start (as soon as upload to LHC of completed LHC WU's that are stuck in the transfer queue is initiated).

In my case a download pending for another BOINC client, a different BOINC client began execution, and LHC attempted to upload a completed result, and then BOINC manager crashed. The temp fix for this problem is to disable network communication. This however shuts the door to communication for ALL projects.

To resolve THAT issue, hacking out completed LHC WU's allows users who have completed BOINC client WU's OTHER than LHC to upload that are approaching deadline (or if LHC WU's awaiting upload are passed deadline), OR have no other WU's to crunch in the mean time. If one is a dedicated LHC client, then they're sort of stuck for the time being.
ID: 16257 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 15477
Netherlands
Message 16258 - Posted: 1 Apr 2008, 0:16:59 UTC - in response to Message 16257.  

Doing so in the GUI is futile, in that the BOINC manager crashes on start

Actually, BOINC Manager works fine. It's the core client (boinc.exe) that crashes. But that's peanuts. ;-)
ID: 16258 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 16281 - Posted: 1 Apr 2008, 9:23:06 UTC

I removed the sections in client_state.xml that Ageless suggests (<project>...</project> and <active_task>...</active_task>), so there were no references to LHC in the file. BOINC started just fine: I have other projects on the box, so I wanted to keep networking enabled.

However, I hadn't read JM7's comment about the 'account_' file, so BOINC tried to reconnect to LHC with a 'project initialisation' request. I suspended the LHC Project which had reappeared in the projects list, but BOINC kept sending the initialistation requests (BUG? v5.10.13, as usual) and evenually crashed.

I closed it down, removed the (largely empty) <project>...</project> which had reappeared in client_state, and parked the account_ file in a handy folder out of the way. This time, when I restarted BOINC, all seemed to work normally.

I found two files in the BOINC folder, "master_lhcathome.cern.ch_lhcathome.xml" and "sched_reply_lhcathome.cern.ch_lhcathome.xml", both datestamped at the time BOINC was trying to re-initialise the project. Both of them appear to be copies of a recent front page of the LHC website. I would expect that for 'master_', but 'sched_reply_'??????
ID: 16281 · Report as offensive
Graeme Hewson

Send message
Joined: 6 Jun 06
Posts: 12
United Kingdom
Message 16313 - Posted: 1 Apr 2008, 18:54:14 UTC - in response to Message 16295.  

I've put lhcathome.cern.ch into my /etc/hosts with a dummy address. After re-enabling networking, my other projects are uploading and downloading fine.

The entry looks like this:

172.20.1.1 lhcathome.cern.ch


That address should be OK for most people, but if you feel you need to change it for any reason, please be careful and use RFC3330 as a guide.

Unfortunately, this means I can't go to the project home page from my machine to check the status. Perhaps it should be a principle that hostnames for WU transfers should be different from those for home pages (even if they resolve to the same IP address).
ID: 16313 · Report as offensive
The Gas Giant

Send message
Joined: 30 Aug 05
Posts: 65
Message 16317 - Posted: 1 Apr 2008, 19:36:03 UTC - in response to Message 16313.  

I've put lhcathome.cern.ch into my /etc/hosts with a dummy address. After re-enabling networking, my other projects are uploading and downloading fine.

The entry looks like this:

172.20.1.1 lhcathome.cern.ch


That address should be OK for most people, but if you feel you need to change it for any reason, please be careful and use RFC3330 as a guide.

Unfortunately, this means I can't go to the project home page from my machine to check the status. Perhaps it should be a principle that hostnames for WU transfers should be different from those for home pages (even if they resolve to the same IP address).

This worked a treat for me. LHC work tried to upload and immediately failed with a non response and a days worth of MalariaControl.Net was able to upload fine. Thanks for the suggestion, it sure got me out of a bind!

Live long and BOINC!

Paul.
ID: 16317 · Report as offensive
Graeme Hewson

Send message
Joined: 6 Jun 06
Posts: 12
United Kingdom
Message 16464 - Posted: 5 Apr 2008, 7:53:10 UTC - in response to Message 16317.  

I've just come back to this, because my LHC WUs from Monday were still not being uploaded. I wasn't overly concerned, because the LHC@HOME Web site is still down for maintenance.

However, tracing with Wireshark I found that even though I removed the dummy entry from /etc/hosts on Tuesday, the BOINC client was still trying to connect to the dummy host. I run nscd, but the same happened when I stopped it. When I restarted the client, my WUs uploaded fine.

It seems the client caches host addresses (indefinitely?) This is a serious problem. I vaguely recall a year or two ago being unable to upload WUs for some project, I think after a host address change, until I restarted the client.
ID: 16464 · Report as offensive
Graeme Hewson

Send message
Joined: 6 Jun 06
Posts: 12
United Kingdom
Message 16476 - Posted: 5 Apr 2008, 15:53:59 UTC - in response to Message 16464.  

I'm running the current Ubuntu package, 5.10.8. Is this a known problem? I don't see anything like it at http://boinc.berkeley.edu/trac/query.
ID: 16476 · Report as offensive
John McLeod VII
Avatar

Send message
Joined: 29 Aug 05
Posts: 147
Message 16478 - Posted: 5 Apr 2008, 18:37:35 UTC - in response to Message 16464.  

I've just come back to this, because my LHC WUs from Monday were still not being uploaded. I wasn't overly concerned, because the LHC@HOME Web site is still down for maintenance.

However, tracing with Wireshark I found that even though I removed the dummy entry from /etc/hosts on Tuesday, the BOINC client was still trying to connect to the dummy host. I run nscd, but the same happened when I stopped it. When I restarted the client, my WUs uploaded fine.

It seems the client caches host addresses (indefinitely?) This is a serious problem. I vaguely recall a year or two ago being unable to upload WUs for some project, I think after a host address change, until I restarted the client.

The problem is the library used caches the ip addresses in one mode and does even worse things in other modes. I know that this was being worked on, but I don't know the current status.

BOINC WIKI
ID: 16478 · Report as offensive

Message boards : BOINC client : (temporarily) Solving the LHC/BOINC crashing problem.

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.