strange error: garbage_collect ?cannot collect?

Message boards : Questions and problems : strange error: garbage_collect ?cannot collect?
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 92453 - Posted: 10 Aug 2019, 21:28:54 UTC
Last modified: 10 Aug 2019, 22:03:10 UTC

Two of my GPUs on a 10 GPU mining rig are stuck: 0% utilization with work unit showing %100 done

error messages:

7209	SETI@home	8/10/2019 3:34:45 PM	[error] garbage_collect(); still have active task for acked result blc32_2bit_guppi_58643_76143_HIP73005_0101.26078.409.23.46.97.vlar_0; state 5	
10233	SETI@home	8/10/2019 4:20:49 PM	[error] garbage_collect(); still have active task for acked result blc33_2bit_guppi_58643_86349_HIP33332_0131.3725.0.23.46.188.vlar_0; state 5	


what's happening?

googling I found a previous report dated 2010 over at SETI.

[EDIT] Cannot even kill boinc. tried sudo kill -9 8109 (boinc) and just kill 8109 and task 8109 never disappears from top or htop. Argument shows boinc with command line --detectgpu so it (7.16.1) seems stuck trying to detect the gpu and not bothering to accept the kill signal.

This was after using the /etc/init.d/boinc-client stop
to try to stop

going to reboot

[EDIT 2] Suspended and NNT and rebooted. The two "stuck" tasks were assigned GPUs 0 and 1 and finished in under minutes. resumed rest of tasks look back to normal.

maybe I ran of out memory with only 8gb and 10 gpus.
ID: 92453 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 641
United States
Message 92454 - Posted: 11 Aug 2019, 12:51:17 UTC
Last modified: 11 Aug 2019, 13:04:21 UTC

Trying to debug the problem as it is happening once or twice a day.

it would appear that memory is not a problem.

Looking here
if (rp->got_server_ack) {
            // see if - for some reason - there's an active task
            // for this result.  don't want to create dangling ptr.
            //
            ACTIVE_TASK* atp = active_tasks.lookup_result(rp);
            if (atp) {
                msg_printf(rp->project, MSG_INTERNAL_ERROR,
                    "garbage_collect(); still have active task for acked result %s; state %d",
                    rp->name, atp->task_state()


State 5 means finished ok from what I understand. Looks like the Linux seti app does not realize it finished.

On my boinc manager, under status I see the following typical behavior
....running....uploading....ready-to-report

(1) At what point is the status set to 5? Is it after the upload? after the "ready to report"
I am guessing the error occurs as the 5 is generated just after finishing the "running" but "uploading" does not take place for some reason. So it is got the server ack but is marked as still running or a dangling "active task".

(2) what exactly does "uploading" mean?

(3) what exactly does "reporting" mean?

Could there be a timing problem in the app when looking for the ack from the server? Who handles the ack: boinc or the app?
Even if this is not a boinc problem I would like to know answers to 1,2 and 3 before going over to SETI and stirring the pot.

==============some other observations=============
kill and kill -9 do not kill the "dangling" task even under sudo. I am not an expert but kill -9 has always worked for me. I do see that "boinc" is the owner of the dangling task. Is that what is keeping me from being able to kill it? I would rather kill it than reboot. bionccmd --quit stops boinc but not that dangling task. A restart of the service failsL I see the task with command "boinc --detactgpu xx (don't remember exactly) and the task disappears and reappears as the service keeps trying to start but boinc never gets past that detectgpu. I end up with reboot of system and often have to power off and on as it never totally shuts down.
ID: 92454 · Report as offensive
Profile Dave
Help desk expert

Send message
Joined: 28 Jun 10
Posts: 2531
United Kingdom
Message 92455 - Posted: 11 Aug 2019, 13:03:19 UTC

(2) what exactly does "uploading" mean?

(3) what exactly does "reporting" mean?


My understanding is that uploading is sending the zip file(s) with the data back to the server and reporting is telling the project that the result of the task is either success or failure. At least, that is what it means with CPDN
ID: 92455 · Report as offensive

Message boards : Questions and problems : strange error: garbage_collect ?cannot collect?

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.