Need help identifying bug(s): ubuntu or boinc or both?

Message boards : BOINC client : Need help identifying bug(s): ubuntu or boinc or both?
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 93924 - Posted: 27 Nov 2019, 16:52:42 UTC

Using 7.16.3 from the recommended repository with a newly reconfigured system of various cheap, used, ebay, nvidia boards. After a few hours gpu8 got stuck. I had a problem, obviously hardware. I edited cc_config to exclude gpu8 but used <device>8</device> instead of <device_num>8</device_num> That caused all nvidia boards to be excluded. That is perfectly understandable. I then corrected my mistake and issued another "read config" command and got the message that 8 was excluded. Unfortunately, that did not un-exclude gpu0..gpu7 I assume this is a bug. Looking at the references I read
If you change GPU exclusions, you must restart the BOINC client for these changes to take effect

So it seem I can exclude a GPU but if I "change the exclusion" I must restart the client. I am guessing that this could be fixed in a feature that could be added in a future version.***

Restarting the client caused a problem
sudo /etc/init.d/boinc-client restart
did not exit and I had to ctrl-c to get back to the bash command prompt.
I did a "stop" to make sure that boinc was stopped and then looked at htop to make sure it was stopped. Things got worse from there on:


There were 9 processes all executing: the cpu% and shared mem change, not just the elapsed time. All of them are accessing that stuck GPU8. I am guessing they timed out during the night as I see 8 errored tasks listed at the seti web site but the tasks continue to execute in background. I tried killing them:

jstateson@h110btc:/usr/bin$ boinccmd --quit
can't connect to local host

root@h110btc:/var/lib/boinc/projects# sudo killall -v boinc
boinc: no process found

sudo kill -9 12374


None of those worked, not even the kill-9
Looking at htop I see PID 12374 getting a time slice: the cpu% changes from 77-99 percent frequently as do al the other instances. Pretty sure this is a problem in ubuntu. The only thing I can think of is to reboot and I suspect the reboot will hang and I will have to power it off.

***There is a discussion at github "computing prefs 2.0" issue #2993


https://github.com/BOINC/boinc/issues/2993

about adding useful features of enabling or disabling GPU. I added a comment a couple of weeks ago. Looks like the disable tool "exclude_gpu" works but re-enabling has a problem. Obviously if GPU had hardware issue it should not be re-enabled but one should be able to re-enable the GPUs that were working.

Was wondering if the ubuntu experts here can shed some light on why the kill -9 didn't work. Also, if the task was timed out by the project, should not it have been killed? If it could not be killed one would hope it would not be re-assigned subsequent tasks.

I spent some time looking at where tasks were assigned to GPU's and it is not clear to me where that is done. Some information such as ignoring is passed to the co_proc handler which only runs when the boinc is started. That could explain why a restart is required. the command to "read cc_config" calls a gpu handler but it seems that program can only disable or remove the gpu and not add it back in.
ID: 93924 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5081
United Kingdom
Message 93926 - Posted: 27 Nov 2019, 18:59:15 UTC - in response to Message 93924.  

Restarting the client caused a problem
sudo /etc/init.d/boinc-client restart
did not exit ...
Are you sure that's the right command for that software combination?

On my Mint box (derivative of Ubuntu), loaded from LoctusOfBorg PPA, the incantation is

sudo systemctl stop boinc-client
with 'start' and, I expect, 'restart' variants.
ID: 93926 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 93927 - Posted: 27 Nov 2019, 19:14:12 UTC - in response to Message 93926.  
Last modified: 27 Nov 2019, 19:27:43 UTC

Normally works fine but definitely can be a problem when a GPU has hardware problems.

ubuntu 18.04

Will have to go to AskUbuntu and see what can cause a kill -9 to be ignored. Hopefully there is NOT a simple explanation that will cause me to lose the few "reputation points" I got there I always thought that the only stupid question is the one not asked but that can cause a points loss with some critical moderators.

However, I will try that in the future.

Running that /etc/init.d/boinc-client actually passes the arguments to "start-stop-daemon" whatever that is.
My guess it works it way back to systemctl
ID: 93927 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 93928 - Posted: 27 Nov 2019, 22:02:52 UTC

Just ran another test of disable and enable the exclude_gpu mechanism in cc_config.

It worked - or at least it behaved differently than I expected.
I booted up with gpu8 disabled
edited cc_config to change the 8 to 18 (there is no gpu 18)
when I issued a "read cc_config" the #8 gpu was enabled. I had expected to have to re-start boinc to get it to work.

This has become more complicated. What I am guessing is that the various failures (like failure to enable gpu0 .. gpu7 were caused by the NVidia driver not being able to handle the stuck gp8 and unable to complete the request to enable 0..7. That is just a guess.
Will have to run some more test to see WTF is going on.
ID: 93928 · Report as offensive

Message boards : BOINC client : Need help identifying bug(s): ubuntu or boinc or both?

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.