Posts by Wedge009

1) Message boards : Questions and problems : BOINC 7.18.x and later: Computation error oddly specific to ROCm (Message 110044)
Posted 8 Oct 2022 by Wedge009
Post:
It turns out that disabling some of the systemd hardening is a work-around for this issue. I consider it only a work-around because it wasn't necessary for BOINC 7.16.17, and presumably the hardening is there for good reason.

https://github.com/BOINC/boinc/issues/4948
2) Message boards : Questions and problems : BOINC 7.18.x and later: Computation error oddly specific to ROCm (Message 109915)
Posted 30 Sep 2022 by Wedge009
Post:
I did some digging - initialize_ocl() seems to be a function in Einstein code, not BOINC. For whatever reason, though, newer BOINCs cause a problem in it. According to the source code for Einstein BRP (which may well be out of date) error code 2013 is the definition in demod_binary.h for RADPUL_OCL_MEM_ALLOC_DEVICE. It's one of the error codes in response clCreateCommandQueue(), which is an OpenCL function. Error code -6 corresponds to CL_OUT_OF_HOST_MEMORY. It seems to be a common error code for a variety of reasons, so I suspect it's not really out of memory, just some weird interaction between potentially old Einstein code and new BOINC versions. Why and how newer BOINCs are causing this is still a mystery to me, however.
3) Message boards : Questions and problems : BOINC 7.18.x and later: Computation error oddly specific to ROCm (Message 109857)
Posted 22 Sep 2022 by Wedge009
Post:
This issue is oddly specific to AMD GPUs running ROCr-based OpenCL on Linux. It doesn't appear to be a problem for NV GPUs or AMD's legacy OpenCL on Linux, or for any Windows-based set-up. (AMD OpenCL support for Linux requires ROCm for Vega GPUs and later.)

When attempting to run GPU tasks for Einstein@Home, it results in 'computation error' within ~10 seconds, an example:
<message>
process exited with code 69 (0x45, -187)</message>
<stderr_txt>
09:31:22 (11580): [normal]: This Einstein@home App was built at: Jan 16 2017 08:09:16

09:31:22 (11580): [normal]: Start of BOINC application '../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.18_x86_64-pc-linux-gnu__FGRPopencl1K-ati'.
09:31:22 (11580): [debug]: 1e+16 fp, 5.9e+09 fp/s, 1785112 s, 495h51m52s17
command line: ../../projects/einstein.phys.uwm.edu/hsgamma_FGRPB1G_1.18_x86_64-pc-linux-gnu__FGRPopencl1K-ati --inputfile ../../projects/einstein.phys.uwm.edu/LATeah3012L12220912.dat --alpha 2.59819959601 --delta -0.694603692878 --skyRadius 1.890770e-06 --ldiBins 15 --f0start 836.0 --f0Band 8.0 --firstSkyPoint 0 --numSkyPoints 1 --f1dot -1e-13 --f1dotBand 1e-13 --df1dot 1.69860773e-15 --ephemdir ../../projects/einstein.phys.uwm.edu/JPLEPH --Tcoh 2097152.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.1 --reftime 56100 --model 0 --f0orbit 0.005 --mismatch 0.1 --demodbinary 1 --BinaryPointFile ../../projects/einstein.phys.uwm.edu/templates_LATeah3012L12220912_0844_11462382.dat --debug 0 --device 1 -o LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out
output files: 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out' '../../projects/einstein.phys.uwm.edu/LATeah3012L12220912_844.0_0_0.0_11462382_1_0' 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah3012L12220912_844.0_0_0.0_11462382_1_1'
09:31:22 (11580): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
09:31:22 (11580): [debug]: glibc version/release: 2.35/stable
09:31:22 (11580): [debug]: Set up communication with graphics process.
boinc_get_opencl_ids returned [0x1e97b40 , 0x7fabc0742d90]
Using OpenCL platform provided by: Advanced Micro Devices, Inc.
Using OpenCL device "gfx900:xnack-" by: Advanced Micro Devices, Inc.
Max allocation limit: 7287183768
Global mem size: 8573157376
Couldn't create OpenCL command queue (error: -6)!
OpenCL shutdown complete!
initialize_ocl returned error [2013]
OCL context null
OCL queue null
Error generating generic FFT context object [5]
09:31:22 (11580): [CRITICAL]: ERROR: MAIN() returned with error '5'
FPU status flags:
mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out': No such file or directory
mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out': No such file or directory
mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out': No such file or directory
mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out': No such file or directory
mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out': No such file or directory
mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out': No such file or directory
mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out': No such file or directory
mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out': No such file or directory
mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out.cohfu': No such file or directory
mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out.cohfu': No such file or directory
mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out.cohfu': No such file or directory
mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out.cohfu': No such file or directory
mv: cannot stat 'LATeah3012L12220912_844.0_0_0.0_11462382_1_0.out.cohfu': No such file or directory
09:31:34 (11580): [normal]: done. calling boinc_finish(69).
09:31:34 (11580): called boinc_finish

</stderr_txt>

I've determined that this issue appears to be specific to BOINC because while I confirm it's a problem with BOINC 7.18.1 and 7.20.2, it's not a problem with 7.16.17. All other hardware and software remains the same - even the same desktop session (ie no rebooting between BOINC installations). I wonder if it's a permissions issue, because of all the file missing messages - is there a change in how BOINC runs GPU tasks between 7.16.x and 7.18.x that ROCm might be sensitive to?

Here are some of my Linux hosts:
Ubuntu 20.04, ROCr-based OpenCL, can only run successfully up to BOINC 7.16.17:
https://einsteinathome.org/host/12803029

Ubuntu 22.04 (issue also occurs on 20.04), ROCr-based OpenCL, can only run successfully up to BOINC 7.16.17:
https://einsteinathome.org/host/12918837

Ubuntu 22.04, legacy OpenCL, running just fine with BOINC 7.20.2:
https://einsteinathome.org/host/12887570

On the other hand, I've found a host that's using BOINC 7.18.1 and appears to be running AMD GPU fine, but I can't tell what the amdgpu set-up is. (I've attempted to contact the owner before but never got an answer.)
https://einsteinathome.org/host/12941414
4) Message boards : GPUs : BOINC 7.10.2 - Windows 7 - OpenCL GPU Detection (Message 86522)
Posted 10 Jun 2018 by Wedge009
Post:
For future reference: for whatever reason, in my particular dual-GPU set-up, a 64-bit compilation fails in COPROCS::get_opencl(), gpu_opencl.cpp:343. After finding two OpenCL GPU platform IDs, it fails when attempting to find the AMD GPU Device ID:

        ciErrNum = (*p_clGetDeviceIDs)(
            platforms[platform_index],
            (CL_DEVICE_TYPE_GPU | CL_DEVICE_TYPE_ACCELERATOR),
            MAX_COPROC_INSTANCES, devices, &num_devices
        );

        if (ciErrNum == CL_DEVICE_NOT_FOUND) continue;  // No devices

It doesn't seem like it should be necessary, but I'll stick with 32-bit compilation as a work-around.
5) Message boards : GPUs : BOINC 7.10.2 - Windows 7 - OpenCL GPU Detection (Message 86521)
Posted 10 Jun 2018 by Wedge009
Post:
Okay, totally bizarre but I've found why my compilation didn't match the BOINC release - my compilations were 32-bit only while the BOINC release I was using was the 64-bit version. I just tried a 32-bit release and the GPUs are being detected okay.

Might anyone know why 64-bit BOINC can't detect Vega properly (apparently)?
6) Message boards : GPUs : BOINC 7.10.2 - Windows 7 - OpenCL GPU Detection (Message 86520)
Posted 10 Jun 2018 by Wedge009
Post:
So I managed to compile the boinc client from branch client_release/7/7.10.2 (commit f6033b09) and it's detecting the Vega while the BOINC 7.10.2 binary release does not. I'm getting quite confused by now. Does anyone have any information on who does the compilation for the releases on boinc.berkeley.edu?
7) Message boards : GPUs : BOINC 7.10.2 - Windows 7 - OpenCL GPU Detection (Message 86512)
Posted 9 Jun 2018 by Wedge009
Post:
Thanks for the quick responses - I didn't have time to go into too much detail earlier.

I don't recall ever having an issue with GPU detection in Windows, only in Linux and that was years ago - I think the driver situation has improved much since then.

Under BOINC I've run NV Pascal, Maxwell, Kepler, and Fermi of all sizes without issue in Windows (OpenCL as well as CUDA), and AMD Fiji, Hawaii, Bonaire, Tahiti, Pitcairn, and even pre-GCN Cayman, Barts, RV730 and RV630. I don't remember BOINC having issues detecting any of those under Windows. In my current case, GPU1 is NV Pascal and GPU2 is AMD Vega. Vega has been around for nearly a year now, so I'd be surprised if there's something wrong with the driver side - as I said I can run OpenCL applications off-line apart from BOINC just fine, plus my cobbled-together stand-alone BOINC GPU detection could read both the Pascal and the Vega.

That pre-compiled clinfo reproduced the curious scenario I described earlier - the May 2018 clinfo that was put into my Windows/System32 directory (I think that's from the AMD installation) only picks up the NV Pascal and the CPU. The 2011 copy you linked to picks up those two as well as the AMD Vega.

I also did a quick test under Linux with AMD Fiji and AMD Vega together. Only the Fiji was detected by BOINC so I doubt mixed-vendor set-up is a relevant concern. Really puzzled why BOINC is having difficulty with Vega when there is more than one GPU involved.
8) Message boards : GPUs : BOINC 7.10.2 - Windows 7 - OpenCL GPU Detection (Message 86503)
Posted 9 Jun 2018 by Wedge009
Post:
I have been running BOINC for years, am currently in the middle of trying to upgrade a dual GPU system. Quick summary:

Have two GPUs - GPU1 and GPU2. GPU1 and GPU2 are detected on their own just fine. GPU1 is detected fine together with any other GPU. GPU2, however, isn't detected in combination with GPU1 (have swapped physical PCIe slots with the same result).

coproc_debug doesn't give much info. I pulled the OpenCL detection routines (gpu_opencl.cpp) from BOINC into a stand-alone application for debugging purposes (I didn't have the time to compile and debug the entire BOINC application). The crazy thing is that even with the hardware set-up that BOINC doesn't detect correctly, my stand-alone application does.

Running current releases of NV and AMD drivers respectively. Another curiosity is that clinfo doesn't appear to detect GPU2, but an older version of clinfo does. Both GPUs concerned here are current generation, current architectures.

I have to rush off now but I'll provide more details later - just wondering if anyone has any ideas/thoughts on this because I'm starting to run out of them.

Edit: I'll add that stand-alone OpenCL applications work just fine as well. BOINC just isn't detecting the GPU for some reason.




Copyright © 2023 University of California. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.