BOINC client crashes immediately at startup (macOS 10.12.6)

Message boards : Questions and problems : BOINC client crashes immediately at startup (macOS 10.12.6)
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 4470
United Kingdom
Message 81373 - Posted: 18 Sep 2017, 14:30:44 UTC - in response to Message 81372.  
Last modified: 18 Sep 2017, 15:06:49 UTC

Got it, thanks, and opened OK. I'll start looking, but I may be some time...

Well, the first obvious clue is from stdoutdae.txt: the client has been restarting continuously. It seems normal, until this happens:

18-Sep-2017 15:16:39 [---]    (to change preferences, visit a project web site or select Preferences in the Manager)
18-Sep-2017 15:16:39 [---] Using account manager BOINCstatsBAM!
18-Sep-2017 15:16:39 Initialization completed
18-Sep-2017 15:16:39 [---] Running CPU benchmarks
18-Sep-2017 15:16:39 [---] Starting BOINC client version 7.8.2 for x86_64-apple-darwin
about 120 times in the snippet you sent me. So, init completed, start running benchmarks, crash.

The next line, on my v7.8.2 for Windows (and many previous versions) is

09-Sep-2017 21:35:39 [---] Suspending computation - CPU benchmarks in progress
so the crash is very quick. BTW, BAM! is not implicated - about half the crashes happened before BAM! was attached.
ID: 81373 · Report as offensive
pbro

Send message
Joined: 8 Mar 12
Posts: 7
United Kingdom
Message 81375 - Posted: 18 Sep 2017, 15:09:15 UTC - in response to Message 81373.  

about 120 times in the snippet you sent me.

Yes, I may have pressed "Yes, try again" a couple of times :)

On my machine, successfully running CPU benchmarks takes around 30 seconds. It does not get much done before crashing.
ID: 81375 · Report as offensive
Juha
Volunteer developer
Volunteer tester
Help desk expert

Send message
Joined: 20 Nov 12
Posts: 801
Finland
Message 81376 - Posted: 18 Sep 2017, 15:19:52 UTC
Last modified: 18 Sep 2017, 15:30:33 UTC

In *** buffer overflow detected ***: boinc_client terminated the problem was CPDN task which failed and the client's error reporting wasn't prepared to deal with 100+ missing output files.

pbro in message 81364 shows that even 50 files is too much.

The problem is that the error message is so long that it overflows the buffer allocated for it. For several years now Debian/Ubuntu and probably Fedora as well have used compiler settings that detect such buffer overflow. I think a similar compiler setting is used on Windows too. So we should have seen people having this problem much earlier.

So, question for those with more in-depth knowledge of CPDN. Are these tasks with tens of output files something new?

The fix should be in client: eliminate possible buffer overflow in reporting result errors and client: use snprintf() instead of sprintf() in a few places which didn't make it to 7.8.2.


edit:

With message like this per file:
<file_xfer_error>
  <file_name>wah2_wus25_ti5c_200309_25_583_011070828_1_r1095091230_5.zip</file_name>
  <error_code>-161 (not found)</error_code>
</file_xfer_error>


The safe limit on number of output files per task is about 25. The "r1095091230" is a relatively new addition to server software but having it drops the limit by about one file only.
ID: 81376 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 4470
United Kingdom
Message 81377 - Posted: 18 Sep 2017, 15:36:13 UTC - in response to Message 81376.  
Last modified: 18 Sep 2017, 16:00:43 UTC

We might have something there. I'm looking through pbro's client_state.xml, and a second user - MR, you know who you are - has also sent me a set of files including client_state.xml. Both users have a failed CPDN WAH2 PNW task in their client_state, showing 50 files (51 actually - zips 1 to 49, restart, out) plus a crash dump - about 14 KB in total for the <result> section.

MR also mentions suffering the "attached to Einstein@home twice" problem due to the changed master url - that's another complication we could do without.

Edit - I've posted in Jim1348's thread at CPDN, asking if users have failures/successes with the 51-file WAH2 PNW workunits. Most Linux CPDN failure reports tend to come from bad 32-bit library installation - I haven't found any discussion specific to these tasks yet.
ID: 81377 · Report as offensive
pbro

Send message
Joined: 8 Mar 12
Posts: 7
United Kingdom
Message 81379 - Posted: 18 Sep 2017, 16:28:37 UTC - in response to Message 81377.  

Richard, a shot into the blue, but the site this work was done at was having internet connection issues this weekend. Might this be related, i.e. only a temporarily failed upload triggers the client suicide?

Secondly, is there anything but a clean reset that I can do to get my installation running again?
ID: 81379 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 4470
United Kingdom
Message 81380 - Posted: 18 Sep 2017, 16:31:41 UTC

One other small anomaly in pbro's bundle of files is a zero-length file called 'all_projects_list_temp.xml' (there's a full-size 'all_projects_list.xml' as well). We had problems with a zero-length RPC (different file) recently - could this indicate another problem causing or caused by the client crashing at startup? File is dated 18/09/2017 10:21, which is in the middle of the sequence of crashes with v7.8.2 - in fact the file coincided with "Version change (7.6.34 -> 7.8.2)", so it's part of the testing just before I got the files.
ID: 81380 · Report as offensive
pbro

Send message
Joined: 8 Mar 12
Posts: 7
United Kingdom
Message 81383 - Posted: 18 Sep 2017, 16:50:15 UTC - in response to Message 81380.  

I did attempt to see if a BOINC upgrade resolves the issues, before contacting the forum. That will explain the version change.
ID: 81383 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 4470
United Kingdom
Message 81384 - Posted: 18 Sep 2017, 16:50:42 UTC - in response to Message 81379.  

Richard, a shot into the blue, but the site this work was done at was having internet connection issues this weekend. Might this be related, i.e. only a temporarily failed upload triggers the client suicide?
I think it's unlikely - BOINC is designed to cope with that sort of thing. My ISP resets my line once a week, and that doesn't cause any problems while it's down. But thanks for adding it to the report - we're probably going to need every scrap of information we can get.

Secondly, is there anything but a clean reset that I can do to get my installation running again?
The reset will be the quickest and easiest, but if you feel up to performing a little experiment first, it would be very helpful.

Could you open the file 'client_state.xml', please, with a plain text editor - not a fancy XML editor.

Then search for the line

<name>wah2_pnw25_c4ci_190312_49_658_011241397_0</name>
It will be in a block of XML that looks like

    ...
    </file_ref>
    <file_ref>
        <file_name>ozone_hist_N96_1899_1910v2.gz</file_name>
        <open_name>ozone_hist_N96_1899_1910v2.gz</open_name>
    </file_ref>
</workunit>
<result>
    <name>wah2_pnw25_c4ci_190312_49_658_011241397_0</name>
    <final_cpu_time>15880.410000</final_cpu_time>
    <final_elapsed_time>16233.776360</final_elapsed_time>
    <exit_status>0</exit_status>
    <state>3</state>
    ...
If you could go back to the beginning of the <workunit> section, and down to the end of the </result> section, and remove absolutely everything - the whole

<workunit>
...
</workunit>
<result>
...
</result>
segment, including those lines.

Then save the file, and try BOINC one more time. If it starts normally, we have our smoking gun. Remember to stop work fetch from CPDN as soon as you get control again!

If that doesn't work, just reset things - I doubt I'm going to come up with any more ideas tonight.
ID: 81384 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 14573
Netherlands
Message 81385 - Posted: 18 Sep 2017, 16:52:26 UTC

I've asked CPDN to come take a look in this thread and advise, or at least tell if they think it's probable or coincidence.
ID: 81385 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 4470
United Kingdom
Message 81386 - Posted: 18 Sep 2017, 16:55:00 UTC - in response to Message 81385.  
Last modified: 18 Sep 2017, 17:18:05 UTC

I've re-activated my CPDN account, and with any luck I should get a WAH2 PNW in about 10 minutes (end of 1 hour backoff). If I disappear off the face of the earth, you'll know where I've gone...

Otherwise, I might be able to see how it works under v7.8.2 for Windows.

Edit - well, that worked better than I expected:

18/09/2017 18:05:52 | climateprediction.net | [sched_op] CPU work request: 10726.27 seconds; 4.00 devices
18/09/2017 18:05:54 | climateprediction.net | Scheduler request completed: got 4 new tasks
18/09/2017 18:05:54 | climateprediction.net | [sched_op] estimated total CPU task duration: 2309063 seconds
One and three spares. Why does the client request work for 4 devices, when all cores are busy with three CPU tasks and one OpenCL support job?

Edit2 - actually, handy. One CAM model with 12 upload files, two AFR with 14 files, and one PNW with the 51 files. I'll start with an easy one.
ID: 81386 · Report as offensive
pbro

Send message
Joined: 8 Mar 12
Posts: 7
United Kingdom
Message 81387 - Posted: 18 Sep 2017, 17:21:07 UTC - in response to Message 81384.  

That indeed fixed it. Back to happily crunching numbers (not CPDN though).

Thanks for your help!
ID: 81387 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 4470
United Kingdom
Message 81390 - Posted: 18 Sep 2017, 17:49:41 UTC - in response to Message 81387.  

And thank you for yours. We know what we need now, and the fix is already available - we just need a new build so we can test it properly.
ID: 81390 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 14573
Netherlands
Message 81391 - Posted: 18 Sep 2017, 17:50:50 UTC

Answer from the CPDN moderators list:
gdp wrote:
I've seen this in Linux, and I'm not running 7.8.2. I've been running 7.4.22 forever. It's been happening in the last month or two when a task crashes and something gets corrupted. I don't know whether some OS update changed something that boinc utilizes, or what (I'm running on Ubuntu 16.04 or higher). The only way I've been able to get the installation to work again without removing and reinstalling boinc is to remove all traces of the crashed task from client_state.xml. I can then start boinc back up and it will continue with whatever other tasks were running. Unfortunately that doesn't resolve why the problem is happening and why it's only been happening for the last couple months. I couldn't see a corruption in client_state.xml when I looked at it, but I am not an expert in what that file looks like at all times.

I don't think any cpdn application updates have occurred during that time so it is not tied to that.

ID: 81391 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 4470
United Kingdom
Message 81392 - Posted: 18 Sep 2017, 18:04:34 UTC - in response to Message 81391.  
Last modified: 18 Sep 2017, 18:22:11 UTC

That's more-or-less what we expected. Ask him to check batch 658 - that's the PNW that I've got. Data, not application.

Edit - note on the CPDN front page that batch 658 was submitted on 15 September. That matches - and my first one crashed, though that was a 651.
ID: 81392 · Report as offensive
geophi

Send message
Joined: 18 Sep 17
Posts: 2
United States
Message 81397 - Posted: 18 Sep 2017, 18:58:45 UTC - in response to Message 81392.  
Last modified: 18 Sep 2017, 19:06:26 UTC

FYI, I am a moderator at cpdn and run mostly linux.

Certain wah2 science app batches of cpdn tasks crash on otherwise stable Mac and Linux PCs. These batches will mostly crash after 1 model month, on Jan 1 of the next model year as the regional worker takes over after the global worker finishes that day. These batches run okay on Windows PCs. There is some problem that happens on these batches at that point. The common theme for these batches is "naturalized" parameter sets. This dates back to April.

However, in the last couple months, sometimes when this type of crash occurs, the boinc client can no longer communicate, and restarting boinc results in errors similar to the ones posted in this thread. The only way to recover from this for me has been to edit client_state.xml and remove all entries related to the crashed task.

I'm thinking some OS update occurred in the last couple months that changed some files in how boinc works with the science app, or writes something, or who knows... I'm not a programmer or system person, just an IT enthusiast.

The cpdn programmers know of the crash problem with the naturalized parameter sets, but have not been able to isolate the cause as to why it only occurs on Linux and Mac. The input files should be the same for the Windows app.

Richard, 658 tasks will crash on Mac and Linux after one month, no matter which boinc client they use. Whether a the boinc "corruption" problem occurs may depend on the OS distribution, version, and what updates have been run on it.

Edit...I see you got AFR and CAM tasks as well. The CAM tasks should work fine. Not sure about the AFR ones.

George
ID: 81397 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 4470
United Kingdom
Message 81399 - Posted: 18 Sep 2017, 19:12:52 UTC - in response to Message 81397.  

Thanks George.

There seem to be two problems there:

1) Why do Linux and Mac CPDN tasks fail after one month? Dunno, but you might talk to the CPDN programmers about vsyscall - mentioned in this thread, I think, else search recent threads. Certainly for Linux, that function is being removed from Linux: if it's being called during the month-end file shuffle, that might cause the failure on recently updated Linux kernels.

2) Why does BOINC crash when the CPDN app crashes? We seem to have established via this thread (and the final test was exactly a repeat of your own procedure) that BOINC can't run when it has a huge stderr_txt and a huge pile of failed uploads for an unreported result. That seems to be the result of old and neglected code, perhaps dating back several versions. There is a fix in the pipeline, but no date for a test build yet: it was omitted from v7.8.2, despite being available then.
ID: 81399 · Report as offensive
Juha
Volunteer developer
Volunteer tester
Help desk expert

Send message
Joined: 20 Nov 12
Posts: 801
Finland
Message 81402 - Posted: 18 Sep 2017, 19:22:51 UTC

An alternative to editing client_state.xml would be deleting account_climateprediction.net.xml from BOINC's data directory. This should be approximately equivalent of using BOINC Manager to remove CPDN. All CPDN tasks would be lost but it doesn't have the risk of losing tasks for other projets.
ID: 81402 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 4470
United Kingdom
Message 81404 - Posted: 18 Sep 2017, 19:26:41 UTC - in response to Message 81402.  

Good idea.

My other correspondent has sent me some log files. That client is also failing immediatly after starting the 'run benchmark' process, but with one extra line:

17-Sep-2017 14:17:10 [---] Running CPU benchmarks
17-Sep-2017 14:17:10 [---] Received signal 15
17-Sep-2017 14:17:09 [---] cc_config.xml not found - using defaults
17-Sep-2017 14:17:09 [---] Starting BOINC client version 7.6.34 for x86_64-apple-darwin
With extra 'Signal 15', if that helps anyone.
ID: 81404 · Report as offensive
geophi

Send message
Joined: 18 Sep 17
Posts: 2
United States
Message 81426 - Posted: 19 Sep 2017, 2:49:21 UTC - in response to Message 81399.  

2) Why does BOINC crash when the CPDN app crashes? We seem to have established via this thread (and the final test was exactly a repeat of your own procedure) that BOINC can't run when it has a huge stderr_txt and a huge pile of failed uploads for an unreported result. That seems to be the result of old and neglected code, perhaps dating back several versions. There is a fix in the pipeline, but no date for a test build yet: it was omitted from v7.8.2, despite being available then.


Now I have thoroughly read this thread and can see why it has only been happening recently. The wah2 batches with naturalized parameters have only recently regularly started having numerous months in them. Prior to that, 1,3,10,12,13,18 months were the norm. Now a number of batches have greater than 18 months in them, and some of those have the naturalized parameters. Those die early and thus stderr and client_state are large with a long list of upload files that were never sent.

I thought because these boinc problems were recent, some OS change was the culprit. Obviously not. Thanks Richard and those others troubleshooting for identifying the main problem with boinc under this scenario.
ID: 81426 · Report as offensive
jfw25

Send message
Joined: 16 Sep 17
Posts: 3
United States
Message 81427 - Posted: 19 Sep 2017, 3:26:32 UTC

I finally came back to check the thread, and tried out the "edit the client_state.xml" suggestion -- success!

Thanks!
ID: 81427 · Report as offensive
Previous · 1 · 2 · 3 · Next

Message boards : Questions and problems : BOINC client crashes immediately at startup (macOS 10.12.6)

Copyright © 2021 University of California. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.