permission problem: if client cannot run the app (wrong owner) why does it delete the work unit?

Message boards : Questions and problems : permission problem: if client cannot run the app (wrong owner) why does it delete the work unit?
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 93191 - Posted: 14 Oct 2019, 4:15:26 UTC
Last modified: 14 Oct 2019, 4:21:02 UTC

I started BOINC manually and failed to use suda. Consequently, the client was running under my name instead of root (I am guessing this is the problem)

It could not find any app even though I can see them with "ls -l" or whatever (ubuntu).. OK, I forgot the suda. It appears the client can delete the work units!!! So how can it get away with deleting the work units after it finds it cannot access the app? Seems like it should have just reported the work units were not there! That caused 100 work units to be trashed.

ls -l
-rwxr-xr-x 1 boinc boinc 181979256 Oct 12 19:06 setiathome_x41p_V0.98b1_x86_64-pc-linux-gnu_cuda90


13-Oct-2019 22:14:24 [SETI@home] State file error: missing application file setiathome_x41p_V0.98b1_x86_64-pc-linux-gnu_cuda90

13-Oct-2019 22:14:24 [SETI@home] No application found for task blc64_2bit_guppi_58692_57863_HIP21594_0020.9424.818.21.44.28.vlar: platform x86_64-pc-linux-gnu version 801 plan class cuda90; discarding


Maybe permissions and ownership are wrong. Maybe it could not access inside "/projects" I am not an expert on Linux. Since I am admin it seems boinc would be able to find and run that seti app even if I forgot to use suda.

Is it necessary to use suda when running boinc?. Maybe I should not have used suda in the first place when I caused the boinc folders to get created.
This is what I used to create the boinc folders.
sudo /usr/bin/boinc --gui_rpc_port 31418 --dir /home/jstateson/nuke1 --detach

The above created alternate implementation of boinc at "nuke1" in my home directory. I am running some tests using multiple instances of the client after reading this windows discussion as a reference. There is no "suda' problem in windows.

What is best way in linux to set up multiple clients? is SUDA necessary? Maybe I am going where not many have gone before!

Is there any way to recover the work units? I read reading something about recovering "ghost" work units over at SETI. They call it "ghost protocol" Is that applicable here?
ID: 93191 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 93192 - Posted: 14 Oct 2019, 7:08:42 UTC
Last modified: 14 Oct 2019, 7:24:34 UTC

Follow up.

I looked deeper into stdoutdae.txt and picked the first file that was downloaded and then disappeared. The filename was
blc64_2bit_guppi_58692_57863_HIP21594_0020.9424.818.21.44.28.vla

root@jyslinux1:/home/jstateson/nuke1# grep -i "blc64_2bit_guppi_58692_57863_HIP21594_0020.9424.818.21.44.28.vlar" *.txt
stdoutdae.txt:13-Oct-2019 19:15:09 [SETI@home] Started download of blc64_2bit_guppi_58692_57863_HIP21594_0020.9424.818.21.44.28.vlar
stdoutdae.txt:13-Oct-2019 19:15:12 [SETI@home] Finished download of blc64_2bit_guppi_58692_57863_HIP21594_0020.9424.818.21.44.28.vlar
stdoutdae.txt:13-Oct-2019 19:29:01 [SETI@home] Starting task blc64_2bit_guppi_58692_57863_HIP21594_0020.9424.818.21.44.28.vlar_0
stdoutdae.txt:13-Oct-2019 22:14:24 [SETI@home] No application found for task blc64_2bit_guppi_58692_57863_HIP21594_0020.9424.818.21.44.28.vlar: platform x86_64-pc-linux-gnu version 801 plan class cuda90; discarding
stdoutdae.txt:13-Oct-2019 22:14:24 [SETI@home] State file error: result blc64_2bit_guppi_58692_57863_HIP21594_0020.9424.818.21.44.28.vlar_0 not found for task
stdoutdae.txt:13-Oct-2019 22:20:29 [SETI@home] Couldn't delete file projects/setiathome.berkeley.edu/blc64_2bit_guppi_58692_57863_HIP21594_0020.9424.818.21.44.28.vlar.gzt


The above info from that error files makes it look like boinc was unable to delete the file after all. However, it was attempting to delete
blc64_2bit_guppi_58692_57863_HIP21594_0020.9424.818.21.44.28.vlar.gzt
I assume gzt is the compressed file and it must get uncompressed before being stuffed into /projects/setiathome.berkeley.edu
I have never seen a .vlar.gtz only the .vlar

Be that as it may, while it appears the client was unable to delete the file, THE FILE ALONG WITH ANOTHER 100 WERE ACTUALLY DELETED. I am guessing that it deleted the uncompressed file and then, for who know why, tried to delete the compressed file which probably did not exist anymore. Just a hunch.

According to Keith Myers lost tasks can be recovered. Sure enough, I was able to recover all 2 tasks out of the 100. One has about 0.75 to 1.50 seconds to click "no network activity" after seeing "Reported xx tasks" but before "Scheduler request completed" and it took me 5 tries before I got the timing correct. I only got 2 lost tasks recovered because I had restored the priority from to 100 from 0 but boinc checks the current resource and see "0" before it discovers that the project had specified "100" and by then it is too late and only one lost task for each gpu gets downloaded.

Anyway, I can eventually get the rest of the lost tasks, if I want to. Click on Keith's name above to see his procedures which I tried.

that brings me to a second question. Instead of quickly clicking on "no network activity" within a second (after waiting maybe 15 minutes for the chance) could the program scheduler_op.cpp be modified as follows:
        if (p->nresults_returned) {
            msg_printf(p, MSG_INFO,
                "Reporting %d completed tasks", p->nresults_returned
            );
        }
        request_string(buf, sizeof(buf));
        if (strlen(buf)) {
            msg_printf(p, MSG_INFO, "Requesting new tasks for %s", buf);
===========> PUT IN AN EXIT HERE<==========
        } else {
            if (p->pwf.project_reason) {
                msg_printf(p, MSG_INFO,
                    "Not requesting tasks: %s", project_reason_string(p, buf, sizeof(buf))
                );
            } else {
                msg_printf(p, MSG_INFO, "Not requesting tasks");
            }


I do not know how that "ghost protocol' came about but if one is supposed to stop all network activity immediately after seeing that "Requesting new tasks for …" message it seems to me an easy fix is to have a special build of the client and just have it exit right after that message is displayed. I have been able to build the Linux version of boinc and was wondering if my idea would work. I have found it very difficult to perform Keith's procedure to recover lost tasks as I have a fast computer and the system is remote which has some latency.

[EDIT] Amazing - I totally forgot to post this message after preview. I then went off to other web pages and discovered it had not been posted when I click on my back link. I then hit the back arrow until I got to a "are you sure you want to re-submit". It worked! I did not lose my post!. I guess I have Microsoft to think for this.
ID: 93192 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 93193 - Posted: 14 Oct 2019, 9:53:20 UTC

I'd be a little bit careful about taking this discussion too far.

BOINC is designed to work quietly and automatically in the background. If things go wrong, there are automatic recovery processes in place - like cleaning up lost tasks when no application can be found to run them.

If you choose to drive it manually - you own your own mistakes. Hopefully, that teaches you not to make the same mistake next time.

In this case, a mistake caused the loss of a number of tasks from a cache. Shit happens. On most projects, BOINC would automatically recover: Job retransmission is a normal and recommended project option:

<resend_lost_results> 0|1 </resend_lost_results>
If set, and a <other_results> list is present in scheduler request, resend any in-progress results not in the list. This is recommended; it may increase the efficiency of your project. For reasons that are not well understood, a BOINC client sometimes fails to receive the scheduler reply. This flag addresses that issue: it causes the SAME results to be resent by the scheduler, if the client has failed to receive them. Note: this will increase the load on your DB server; you can minimize this by creating an index:

alter table result add index res_host_state (hostid, server_state);
In this case, the tasks were lost from the SETI@Home project. SETI happens to be a special case where "this will increase the load on your DB server" applies - it's been tried, and deliberately turned off because the load was unmanageable.

As people have found over the years, there's a loophole - quite possibly a programming bug - which enables lost tasks to be resent by using the 'ghost task recovery protocol'. Keith's protocol is benign: it relies on precise timing, but doesn't involve editing any files. [editing files usually causes more problems than it solves]

But as the discussion and revised protocol around 30 August makes clear, the actual trigger for the loophole is reporting the same task for a second time. Keith achieves this by deliberately not listening for the scheduler reply which contains the acknowledgement. But the converse is: not hearing a scheduler reply is exactly what causes a task to be 'ghosted' in the first place. If you miss a step in the protocol, you risk making things worse, rather than better.

I'd be very careful about automating this, outside the control of a tight user group. If an automated tool became widely available and increases the server load as before, it could cause the devs to hunt down and fix the server programming bug which opened the loophole in the first place.

But if you do want to automate it (which I did once consider doing myself), I think I'd handle it by discarding one of the 'acks' in the scheduler reply, so that the client retains one of the 'ready to report' tasks for re-reporting. You must NOT do this on every report: the DB load is caused by checking every entry in your (possibly long) list of other tasks in progress. Only do this when you have to - so it needs a manual trigger and automatic 'return to normal' afterwards.
ID: 93193 · Report as offensive
Profile Keith Myers
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 17 Nov 16
Posts: 863
United States
Message 93196 - Posted: 14 Oct 2019, 14:23:43 UTC

From my impressions on the comments, I think the project would rather you just ignore the problem and let the "ghost" tasks just expire in the database at their normally scheduled deadline to be re-issued in replication. I don't like adding more bloat to the db if I can by reclaiming the tasks as soon as possible. It seems a lot of people are unable to activate the suspend connectivity property fast enough to get ahead of the scheduler ack. But I don't have any better suggestion.
ID: 93196 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 93197 - Posted: 14 Oct 2019, 15:55:00 UTC - in response to Message 93193.  
Last modified: 14 Oct 2019, 16:02:47 UTC

I'd be very careful about automating this, outside the control of a tight user group. If an automated tool became widely available and increases the server load as before, it could cause the devs to hunt down and fix the server programming bug which opened the loophole in the first place


Thanks for explaining this! I was unaware of any server problem. I thought things were hunky-dory and did not know chasing down lost work units was akin to running amuck through the data base. Sorry for my American slang.

So, if the programmers decide to chase down that bug maybe they might consider "fixing" the "feature" that allows uses to fake or spoof the number of GPUs from just 1 physical up to 96 or as many as they want. Something along the lines of
 <number_virtual_gpus>128</number_virtual_gpus>

would make it a lot easier to download a lot more work units than is normally allowed for just 1 or 2 GPUs.

The problem of my lost tasks originated from my attempt to duplicate the "bunkering" of work units that a few (?) users do before the SETI WOW event that is held yearly. Discussion of that topic is buried midway down this thread

I was able to duplicate the work unit bunkering using a pair of boinc clients on the same machine and 1 or 2 GPUs for testing. The idea being to set "NumClients" to something like 1000 or more and download, accumulate, and process (but not upload) work units during the approximately 2 or 3 months before the WOW event starts
let NumClients=2
let BasePort=31416
for (( n=0; n < NumClients; n++))
do
NumPort=$((BasePort+n))
echo sudo /usr/bin/boinc --gui_rpc_port $NumPort --dir /home/jstateson/nuke$n --detach
done


The process of holding back work units until the WOW event but releasing those before and after the event time period can be handled by a Boinctask "rule" and an app I have.
In the process of testing out my idea (I have nothing better or more interesting to do), some s**t did hit the fan so to say but once it gets working it can be upscaled from 2 clients to what is best for my Linux box.
I thought I was helping the project out by recovering my lost tasks but it was a waste of time though a learning experience.

[EDIT] As far as the original question on this forum, it seems to me that the client should determine if the app is truly missing. If not accessible it should report that in an event message or notification and NOT delete the work units.
ID: 93197 · Report as offensive
Profile Dave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2516
United Kingdom
Message 93198 - Posted: 14 Oct 2019, 17:21:33 UTC

So, if the programmers decide to chase down that bug maybe they might consider "fixing" the "feature" that allows uses to fake or spoof the number of GPUs from just 1 physical up to 96 or as many as they want. Something along the lines of

<number_virtual_gpus>128</number_virtual_gpus>


I never understood why it is possible to spoof the number of cpu cores either. I think from memory when I tried it to see what happens it resulted in tasks crashing but that may just be my experience. I didn't actually try to download extra work with it though.
ID: 93198 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 93199 - Posted: 14 Oct 2019, 18:24:29 UTC - in response to Message 93198.  
Last modified: 14 Oct 2019, 18:28:23 UTC

So, if the programmers decide to chase down that bug maybe they might consider "fixing" the "feature" that allows uses to fake or spoof the number of GPUs from just 1 physical up to 96 or as many as they want. Something along the lines of

<number_virtual_gpus>128</number_virtual_gpus>


I never understood why it is possible to spoof the number of cpu cores either. I think from memory when I tried it to see what happens it resulted in tasks crashing but that may just be my experience. I didn't actually try to download extra work with it though.


I did figure out the "how" but like you did not want to run any work units.

In cs_scheduler.cpp the following code segment obtains actual system information including number of GPU"s:
// send master global preferences if present and not host-specific
    //
    if (!global_prefs.host_specific && boinc_file_exists(GLOBAL_PREFS_FILE_NAME)) {
        FILE* fprefs = fopen(GLOBAL_PREFS_FILE_NAME, "r");
        if (fprefs) {
            copy_stream(fprefs, f);
            fclose(fprefs);
        }


Further down, below, where the client sends messages to the project, I wrote over that data
 FILE* fprefs = fopen("spoof.txt", "r");
        if (fprefs) {
            copy_stream(fprefs, f);
            fclose(fprefs);
        }


The file "spoof.txt" had a fake number of GPUs but probably had a lot of wrong stuff as I was guessing, but at this point in cs_scheduler it seems only the number of GPUs are used.

Like the SETI GPU Users group I cannot give away all my secrets so the contents of spoof.txt is my little secret.

I have not decide whether to follow through on my "calling" to bunker up work units before the next WOW event but I have already come with a system name: "NumberOfBeasts". While I have not changed my domain name yet, I have picked an appropriate boinc client version number as shown here https://setiathome.berkeley.edu/show_host_detail.php?hostid=8830364
ID: 93199 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 5077
United Kingdom
Message 93200 - Posted: 14 Oct 2019, 18:45:39 UTC - in response to Message 93198.  
Last modified: 14 Oct 2019, 19:25:18 UTC

I never understood why it is possible to spoof the number of cpu cores either. I think from memory when I tried it to see what happens it resulted in tasks crashing but that may just be my experience. I didn't actually try to download extra work with it though.
It's always been described in the documentation as a programming and debugging aid for the project scientists that develop the apps to run under BOINC. Remember that BOINC was being developed in the earliest years of the new millenium: I guess that at that time, a research postgrad programmer would feel pretty blessed to be offered a Pentium D computer. But their institutions would have been looking forward to the coming developments in multi-core hardware and multi-threaded software.

<ncpus> would have been provided to allow simulation of multi-core server or workstation hardware. It would have run unbearably slowly, and it would have thrashed memory as program contexts were swapped in and out - but it was better than nothing.

We don't have any equivalent need for multi-GPU simulation. I've never heard of a scientific programming language (CUDA, OpenCL, or whatever) utilising SLI or CrossFire: BOINC (currently) doesn't have any support for any equivalent of multithreading in the GPU arena. Except, of course, for the zillions of kernels that are executing in parallel on the hardware within each single GPU. That's enough.

The only pressure for spoofing GPUs has come, again, from the SETI@Home project - and the perceived need arises, again, from the self-defence mechanisms which that project has felt necessary to wrap around itself to keep the hordes of users at bay.

SETI normally has a relatively steady eight million tasks stored in its database. They come and go at the rate of about 35 a second - day in, day out. Just roll those figures around on your tongue for a moment. Their servers can cope - just, most of the time. But they do try quite hard to stop things getting out of hand.

There is a spoofed GPU client in existence, but its use is tightly restricted to a team which helps Eric Korpela (project scientist) with hardware and fundraising. I've been granted permission to test it out, and I can see the problem: my machine is a fairly standard commercial gaming machine, and it spits out a task every 50 seconds with no non-standard hardware at all. It's currently at #55 in the 'top hosts' list at SETI. Cost? under 1,000 GBP for the basic machine (assembled and 3-year warranty), plus another 270 GBP for a second GPU. Just the two in total.

The only advantage of spoofing the client on a machine like that (it reports 16 GPUs) is to ride out the maintenance outages in a production environment. Unlike the ncpus spoof, it confers no programming assistance: I think GPU spoofing won't be accepted into production versions of BOINC any time soon.
ID: 93200 · Report as offensive
Profile Dave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2516
United Kingdom
Message 93204 - Posted: 15 Oct 2019, 6:01:33 UTC - in response to Message 93200.  

Thanks Richard,
it was a while ago, I tried it and I suspect lack of memory may have been a factor in tasks crashing and as I say, it was pure curiosity that led me to try it. But it is good to have the background to it all.
ID: 93204 · Report as offensive

Message boards : Questions and problems : permission problem: if client cannot run the app (wrong owner) why does it delete the work unit?

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.