GPUgrid not always resuming tasks correctly

Author	Message
Keith Myers Volunteer tester Help desk expert Send message Joined: 17 Nov 16 Posts: 879	Message 94205 - Posted: 11 Dec 2019, 22:13:06 UTC Haven't found the answer in the GPUGrid forums or the BOINC forums. The acemd3 application at GPUGrid does not survive a restart on a different card. The application does survive a restart on the gpu when there is only one gpu or all gpus are the same type. I have hosts with dissimilar cards. What I need to know is does suspending a currently running gpu write out a saved state file that allows the task to resume on the same card? Or does the suspended task simply restart on whatever card first becomes available? Every time I have restarted BOINC or rebooted the host with a acemd3 task running, I have so far errored out and dumped the task throwing away the task and many hours of computation. I have not tried suspending a running task yet and wonder if that might be the solution. ID: 94205 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5103	Message 94206 - Posted: 11 Dec 2019, 22:38:57 UTC - in response to Message 94205. It will have written the normal checkpoint file, if the project has that capability. That enables calculations to resume from the point reached prior to the checkpoint. If the task is to be resumed, BOINC will resume it when 'a' GPU becomes available. BOINC will direct that the task runs on the free GPU - by a command line, or by an init_data.xml file, depending on the API version in use. BOINC does not wait until the previously used, or even an identical, GPU becomes available. ID: 94206 ·

Keith Myers Volunteer tester Help desk expert Send message Joined: 17 Nov 16 Posts: 879	Message 94213 - Posted: 12 Dec 2019, 4:22:11 UTC - in response to Message 94206. Thanks Richard. That is what I assumed would happen. Nothing in the saved state file forces the task to resume on the same gpu it started on. It just resumes on whatever gpu first becomes available. For my host with three identical GTX 1070 TI cards, that has not been a problem. A task can start on Device 0 and finish on Device 1 with no errors because BOINC states that all devices are equivalent. However, for my other hosts,with at least one card dissimilar than the rest, you take your chances of restarting on the same card type or not. And with the restart the chance the task errors out when restarted on a different device type. I just waited until the last GPUGrid task finished up today before switching off the computer for some maintenance involving switching out cpu blocks and extra fan installations and other changes. Just got a lot later start on the renovation than anticipated. All back together and crunching again. Jury still out whether the changes were beneficial or not. ID: 94213 ·

Joseph Stateson Volunteer tester Send message Joined: 27 Jun 08 Posts: 641	Message 94222 - Posted: 13 Dec 2019, 13:22:39 UTC Last modified: 13 Dec 2019, 13:27:46 UTC I can shed some light on this problem and offer a possible solution but I think it is up to the project to do a proper resume. I have been looking a the possibility of removing a defective GPU from the pool of available GPU and learned a few things as to what modules were responsible for assigning a gpu to an app. Being able to assign the same app to the same GPU on a resume is similar to assigning it to a different GPU due to failure of the one it was on.. I looked at the slots on my system that is currently running two gpugrid tasks (lucky me!). The stderr file in slot 0 shows "boinc input --device 1" The one in slot 1 shows "boinc input --device 0" I am guessing that when the system reboots boinc has lost track of which app was running in which slot and assign the first device it finds to the first suspended app. I am pretty sure this is the case as module "app_start" calls "coproc_cmdline" and that module find the first "N" through iteration and puts it into "--device N" Later on a slot is assigned.. coproc_cmdline simply gets the first # and only checks to see if it is "out of range" I am guessing, though it could be verified, the problem of the failing gpugrid task could be solved by a cut and paste of the contents of the "slot" into a different slot based on ascending number., KISS: if stderr of slot 0 shows device 1 and stderr of slot 1 shows device 0 then swap the contents of the slots after stopping boinc and before rebooting. ID: 94222 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5103	Message 94223 - Posted: 13 Dec 2019, 13:37:21 UTC - in response to Message 94222. Check the timestamps and other clues. --device n is an old way of doing things. With GPUGrid, you might be looking at the start instruction given by the wrapper to the science app. In turn, BOINC will have passed the device instruction to the wrapper in init_data.xml - which will probably have been re-written at the start of the 'resume' session. And the slot number is of no effect. It's just the first scratch folder which happened to be available when the task was first launched - which may have been hours earlier. ID: 94223 ·

Joseph Stateson Volunteer tester Send message Joined: 27 Jun 08 Posts: 641	Message 94225 - Posted: 13 Dec 2019, 13:42:20 UTC - in response to Message 94223. Last modified: 13 Dec 2019, 14:08:48 UTC Check the timestamps and other clues. --device n is an old way of doing things. With GPUGrid, you might be looking at the start instruction given by the wrapper to the science app. In turn, BOINC will have passed the device instruction to the wrapper in init_data.xml - which will probably have been re-written at the start of the 'resume' session. And the slot number is of no effect. It's just the first scratch folder which happened to be available when the task was first launched - which may have been hours earlier. The checkpoint file is in the slot. Clearly, the device assigned must be getting the wrong slot on restart. The dates of all files are all current except the app. 12/13/2019 07:42 AM 24,184,238 restart.chk 12/13/2019 07:40 AM 24,184,238 restart.chk.bkp 12/13/2019 06:28 AM 123 stderr.txt As mentioned, this can easily be tested. I cannot do it myself as I have identical GPUs and it is difficult to even get gpugrid work units. And the slot number is of no effect. It's just the first scratch folder which happened to be available when the task was first launched - which may have been hours earlier. Actually, I think that is the problem. on restart the app is assigned the first slot and gets the wrong checkpoint file. ID: 94225 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5103	Message 94227 - Posted: 13 Dec 2019, 14:43:47 UTC - in response to Message 94225. And the slot number is of no effect. It's just the first scratch folder which happened to be available when the task was first launched - which may have been hours earlier. Actually, I think that is the problem. on restart the app is assigned the first slot and gets the wrong checkpoint file. No, if it did that it would probably try to start a task from a different project. As it happens, I've got a GPUGrid task on this machine. The xml files as I start this reply are: 13/12/2019 14:19 511 boinc_task_state.xml 13/12/2019 08:58 14,807 init_data.xml After suspending it and allowing SETI to take a turn on the GPU, they are 13/12/2019 14:25 511 boinc_task_state.xml 13/12/2019 14:39 14,815 init_data.xml 'Task_state' has been updated, because I waited until the next checkpoint before pausing it: init_data was re-written on restart to reflect the new device number - although on this machine GPUGrid always runs on device 0, and this time ran in slot 1. 13/12/2019 14:39:08 \| GPUGRID \| [cpu_sched] Restarting task initial_1706-ELISA_GSN4V1-35-100-RND9825_0 using acemd3 version 210 (cuda101) in slot 1 ID: 94227 ·

Joseph Stateson Volunteer tester Send message Joined: 27 Jun 08 Posts: 641	Message 94228 - Posted: 13 Dec 2019, 15:31:32 UTC - in response to Message 94227. if it did that it would probably try to start a task from a different project. I follow you on all of this but the problem seems to be with the project and it is the projects responsibility to do the resume. Directory of D:\ProgramData\Boinc\slots\0 12/13/2019 09:13 AM 24,184,239 restart.chk 1 File(s) 24,184,239 bytes Directory of D:\ProgramData\Boinc\slots\1 12/13/2019 09:14 AM 24,184,239 restart.chk 1 File(s) 24,184,239 bytes The files are identical sizes but different binary contents. It seems logical to me that the files are used to restore the state of the app. I am guessing that the app looks for "restart.chk" and if there it attempts to resume and gets the wrong state info. One thing nice about my "theory" is that it is falsifiable. You showed that if put into a different slot it ran without failing. Did it restart from the checkpoint or did it just start over when it could not find the checkpoint file? Does it even need that file to resume from where it left off? If that file is needed to resume from where it left off then how does it find it? What is the "working directory" of the app? I don't have the answers to all of these. Unlike the global warming theory that cannot be falsified (too little snow proves global warming and so does too much snow) if swapping the slots still causes gpugrid to fail then definitely, my "guess" was wrong. ID: 94228 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5103	Message 94229 - Posted: 13 Dec 2019, 15:48:53 UTC - in response to Message 94228. Again, no. It's BOINC which manages the slot directories, not the project. When a task is first started, all the files are put the the first directory which BOINC determines is currently empty. In this case, it was slot 1: 13-Dec-2019 08:58:53 [GPUGRID] [cpu_sched] Starting task initial_1706-ELISA_GSN4V1-35-100-RND9825_0 using acemd3 version 210 (cuda101) in slot 1 And everything stays there until the task finally fishes, at which point BOINC deletes everything so it's clean for the next task (which might be a completely different project). If there's no empty directory, BOINC creates a new one: 13-Dec-2019 14:25:18 [GPUGRID] task initial_1706-ELISA_GSN4V1-35-100-RND9825_0 suspended by user 13-Dec-2019 14:25:24 [SETI@home] [cpu_sched] Starting task 11dc19aa.13973.1294.14.41.146.vlar_2 using setiathome_v8 version 800 (opencl_nvidia_SoG) in slot 5 BOINC couldn't re-use slot 1, because it was still full of the files copied at 08:58. After SETI had finished with slot 5, BOINC cleaned it up: Directory of D:\BOINCdata\slots\5 13/12/2019 14:39 <DIR> . 13/12/2019 14:39 <DIR> .. 0 File(s) 0 bytes 2 Dir(s) 940,707,196,928 bytes free I should point you to the init_data.xml file in the GPUGrid slot (slot 1): <gpu_type>NVIDIA</gpu_type> <gpu_device_num>0</gpu_device_num> <gpu_opencl_dev_index>0</gpu_opencl_dev_index> <gpu_usage>1.000000</gpu_usage> and in my normal SETI working slot (slot 2): <gpu_type>NVIDIA</gpu_type> <gpu_device_num>1</gpu_device_num> <gpu_opencl_dev_index>1</gpu_opencl_dev_index> <gpu_usage>1.000000</gpu_usage> ID: 94229 ·

Joseph Stateson Volunteer tester Send message Joined: 27 Jun 08 Posts: 641	Message 94230 - Posted: 13 Dec 2019, 15:56:31 UTC - in response to Message 94229. Last modified: 13 Dec 2019, 16:08:51 UTC Again, no. It's BOINC which manages the slot directories, not the project. Richard: I don't have a problem with anything you have written here. It is the project that has to manage the resume and it is not working correctly due to some designed fault on their part. I assume it is finding the wrong checkpoint files and that is causing the failure. [edit] I got the idea of the wrong checkpoint file being used from something Keith told me last week. He said that you need to unsuspend the gpugrid tasks in the same order they were suspended. That implies they are finding the "right stuff" in the "right place". This whole problems is the projects fault and they have a lot more bigger than this. ID: 94230 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5103	Message 94231 - Posted: 13 Dec 2019, 16:09:07 UTC - in response to Message 94230. Again, no. It's BOINC which manages the slot directories, not the project. Richard: I don't have a problem with anything you have written here. It is the project that has to manage the resume and it is not working correctly due to some designed fault on their part. I assume it is finding the wrong checkpoint files and that is causing the failure. BOINC tells the app which GPU to use. BOINC may say 'use device 0' when it first starts, and it may say 'use device 1' after a restart. But the files will be is slot 17 or whatever, both times. Both the initial data files, and the checkpoint files, will be in that slot the whole time - neither BOINC nor I moved them. The problem with GPUGrid is that their new app wants to run on the same model of card after a restart, and BOINC doesn't guarantee that: all it guarantees is that CUDA tasks will run on 'a' NVidia GPU - any NVidia GPU. The machine I'm pulling examples from has a GTX 970 and a GTX 750 Ti. The second card is too slow to use for GPUGrid, so I have <exclude_gpu> <url>http://www.gpugrid.net/</url> <device_num>1</device_num> <type>NVIDIA</type> </exclude_gpu> in cc_config.xml - that avoids the problem you're describing. GPUGrid runs on device zero, period. ID: 94231 ·

Joseph Stateson Volunteer tester Send message Joined: 27 Jun 08 Posts: 641	Message 94232 - Posted: 13 Dec 2019, 16:20:06 UTC - in response to Message 94231. Last modified: 13 Dec 2019, 16:24:35 UTC Both the initial data files, and the checkpoint files, will be in that slot the whole time - neither BOINC nor I moved them. The problem with GPUGrid is that their new app wants to run on the same model of card after a restart, and BOINC doesn't guarantee that: all it guarantees is that CUDA tasks will run on 'a' NVidia GPU - any NVidia GPU. this is no different from what I have been saying. A different GPU is given a working directory of slot "17". The files have not been moved so they are the same checkpoint files as was create by the previous, different, GPU and when read in cause problems resuming. ID: 94232 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5103	Message 94233 - Posted: 13 Dec 2019, 16:35:10 UTC - in response to Message 94222. Going back to the very beginning of this conversation, you said I am guessing that when the system reboots boinc has lost track of which app was running in which slot and assign the first device it finds to the first suspended app. If you'd written "... boinc has lost track of which task was running on which device ..." I'd have agreed with you. BOINC doesn't lose track of which task's files are in which slot. ID: 94233 ·

Joseph Stateson Volunteer tester Send message Joined: 27 Jun 08 Posts: 641	Message 94234 - Posted: 13 Dec 2019, 16:37:39 UTC - in response to Message 94233. Going back to the very beginning of this conversation, you said I am guessing that when the system reboots boinc has lost track of which app was running in which slot and assign the first device it finds to the first suspended app. If you'd written "... boinc has lost track of which task was running on which device ..." I'd have agreed with you. BOINC doesn't lose track of which task's files are in which slot. That was bad choice of word and I have done worse. Unlike the project that seems not to give a hoot, I "own" my mistakes. ID: 94234 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5103	Message 94235 - Posted: 13 Dec 2019, 16:50:34 UTC - in response to Message 94234. OK, so the real problem seems to be that GPUGrid's 'New version of ACEMD' app evaluates the hardware it's running on when it first starts, and then remembers it. If restart hardware doesn't match the original evaluation, it crashes. That means their new application is "not fit for BOINC". We need to convince them that the hardware evaluation has to be re-done from scratch when resuming from a pause, so that computation can continue. Is that a fair form of words? If so, we have to work out whether they "don't care", or "don't understand". I suspect it's the latter, for which the appropriate penalty is re-education. ID: 94235 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5103	Message 94237 - Posted: 13 Dec 2019, 17:03:17 UTC Last modified: 13 Dec 2019, 17:06:16 UTC Now, here's a thought. All those hexadecimal hash files in the slot directory are actually plain text content, and they start // // Generated by NVIDIA NVVM Compiler // // Compiler Build ID: CL-26218862 // Cuda compilation tools, release 10.1, V10.1.168 // Based on LLVM 3.4svn // .version 6.4 .target sm_52 .address_size 64 Just as a test, we could try deleting those for a paused task. My guess is that the app will re-compile them if it finds they're missing. And if the .target sm_52 is different on a different device, the binary compiler output might be different, and might run on the new hardware. Worth a punt? (edit - that target value is on my GTX 970. Is yours different, for a different card?) ID: 94237 ·

Joseph Stateson Volunteer tester Send message Joined: 27 Jun 08 Posts: 641	Message 94238 - Posted: 13 Dec 2019, 17:17:25 UTC - in response to Message 94235. Last modified: 13 Dec 2019, 17:53:26 UTC OK, so the real problem seems to be that GPUGrid's 'New version of ACEMD' app evaluates the hardware it's running on when it first starts, and then remembers it. If restart hardware doesn't match the original evaluation, it crashes. That means their new application is "not fit for BOINC". We need to convince them that the hardware evaluation has to be re-done from scratch when resuming from a pause, so that computation can continue. Is that a fair form of words? If so, we have to work out whether they "don't care", or "don't understand". I suspect it's the latter, for which the appropriate penalty is re-education. Yea, they open the checkpoint file and read in an OpenCL values like "compute units: 28" but the board actually has only 14 "compute units" and their algorithm does not compensate for the change so the process quickly dies. I don't think their code is public and even if it was I have had bad experiences compiling project code. Easy fix is to delete the checkpoint file is my guess. Either way all is lost [edit] The app is CUDA not OpenCL but the idea is the same: parameters in the checkpoint file are incompatible with the new gpu. ID: 94238 ·

Joseph Stateson Volunteer tester Send message Joined: 27 Jun 08 Posts: 641	Message 94239 - Posted: 13 Dec 2019, 17:25:32 UTC - in response to Message 94237. Last modified: 13 Dec 2019, 18:03:38 UTC Now, here's a thought. All those hexadecimal hash files in the slot directory are actually plain text content, and they start // // Generated by NVIDIA NVVM Compiler // // Compiler Build ID: CL-26218862 // Cuda compilation tools, release 10.1, V10.1.168 // Based on LLVM 3.4svn // .version 6.4 .target sm_52 .address_size 64 Just as a test, we could try deleting those for a paused task. My guess is that the app will re-compile them if it finds they're missing. And if the .target sm_52 is different on a different device, the binary compiler output might be different, and might run on the new hardware. Worth a punt? (edit - that target value is on my GTX 970. Is yours different, for a different card?) Just saw this. if the machine class is not sm_52 then deleting the checkpoint file will not help. In addition to the delete, the class needs to be changed as you mentioned or it wont run on the card. I saw this problem on the new SETI app. The SETI app includes everything above SM_30 as 30 and below is not CUDA 5.0 and wont run the older boards.. Maybe the staff at gpugrid built the app exactly for a particular device they used sm_52 just for tjhose devices and did not include the library for the sm_60 and higher like the SETI folks did.. If they put all the libraries in then it would work. Maybe this is the problem and not the checkpoint file? [edit] Both that checkpoint and that header file have to match the gpu. I would assume the libraries for various classes of co-processors are embedded in the executable as that makes configuration control easier (one app) but who knows. The seti app has just about everything and is 229mb in size. There is nothing that big in the gpugrid folder but adding all the DLLS up gets up high enough.. All my gpugrid tasks finished and the Einstein backup is at work. The slots were wiped clean of any gpugrid residuals. ID: 94239 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5103	Message 94240 - Posted: 13 Dec 2019, 17:56:14 UTC The problem is that ACEMD spits out so many blasted files - I was trying to find the right ones. To me, a "checkpoint" file is written out (or added to) as the science progresses. I'd expect that device hardware enumeration would take place only once, at the start of the run - and the most likely candidate is the compilation stage. If we can prove that, we have something to offer the admins. ID: 94240 ·

Joseph Stateson Volunteer tester Send message Joined: 27 Jun 08 Posts: 641	Message 94241 - Posted: 13 Dec 2019, 18:05:11 UTC - in response to Message 94240. Last modified: 13 Dec 2019, 18:32:33 UTC The problem is that ACEMD spits out so many blasted files - I was trying to find the right ones. To me, a "checkpoint" file is written out (or added to) as the science progresses. I'd expect that device hardware enumeration would take place only once, at the start of the run - and the most likely candidate is the compilation stage. If we can prove that, we have something to offer the admins. Sounds good! Hopefully I will get some gpugrid tasks in to look at. [EDIT] This link shows what various boards handle which CUDA https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/ sm_52 is only good for gtrx970 class or "below" although at some depths "below" will no longer be an option. Clearly, the file you have works only with the 970 board. Petri over at SETI just built an app that uses lastest 10.2 CUDA libraries and works with all boards CUDA 5 or later. Maybe they can hire Petri or convince the SETI folks that gpugrid can help find ET. More accurately: Tbar built the executable and Petri coded up the app using new features in CUDA. ID: 94241 ·

Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.