Message boards : GPUs : Specifications for NVidia RTX 30x0 range?
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 29 Aug 05 Posts: 15569 |
Errors may also be because of cheap components causing internal corruption in the 3080s made by third party manufacturers. |
Send message Joined: 17 Nov 16 Posts: 891 |
The error at GPUGrid has nothing to do with card hardware. The problem is the apps don't understand the new arch and Compute Capability of SM_8.6 which the apps proclaim "out of range" when the app is run time compiled by the nvrtc module in the drivers. |
Send message Joined: 5 Oct 06 Posts: 5130 |
Agreed. They've used a funny sort of CUDA app development which requires explicit pre-knowledge of the card characteristics. The exact error message (on an A100, cc8.0 datacentre GPU) is # Engine failed: Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch) |
Send message Joined: 17 Nov 16 Posts: 891 |
Maybe they are waiting on the CUDA 11.1 drivers to be made available in the distros and PPA's. From a Phoronix news article: CUDA 11.1 also brings a new PTX compiler static library, version 7.1 of the Parallel Thread Execution (PTX) ISA, support for Fedora 32 and Debian 10.3, new unified programming models, hardware-accelerated sparse texture support, multi-threaded launch to different CUDA streams, improvements to CUDA Graphs, and various other enhancements. GCC 10.0 and Clang 10.0 are also now supported as host compilers. That PTX module seems interesting. I wonder if that will allow apps to make use of the dormant FP32 pipeline. [Edit]From the CUDA 11.1 Ampere docs: Devices of compute capability 8.6 have 2x more FP32 operations per cycle per SM than devices of compute capability 8.0. While a binary compiled for 8.0 will run as is on 8.6, it is recommended to compile explicitly for 8.6 to benefit from the increased FP32 throughput. |
Send message Joined: 8 Nov 19 Posts: 718 |
How do you want to set that experiment up? What parameters are you looking for?This is just for baseline BOINC users, not fancy optimisers. Ideally a single 30x0 card, in a host with plenty of power and cooling (so nothing gets throttled). Run a known - preferably CUDA - app for long enough to get a good idea of performance. Slap in an app_config.xml file with <gpu_usage>.5</gpu_usage>, and record what happens. On my 2080Ti I run some Einstein@home WUs at 0.333 I can imagine that the 3080 would be able to run them at 0.25 However, doing so, I'd be interested in what the minimum PCIE bandwidth should be. PCIE 3.0 x8 'should' be fine, but interested in someone testing x16, vs x8, vs x4 on those cards... |
Send message Joined: 24 Dec 19 Posts: 229 |
as an added data point. peak_flops for my 3070 in BOINC gets detected as roughly 10TFlops, when it's really about 20TFlops. definitely under reporting by half, due to the change in cores/SM |
Send message Joined: 31 Jan 22 Posts: 1 |
HI! Im not an expert on the config for BOINC projects, neither the architecture. However, I have both a 3080 and titan volta and 3080 has almost same eficiency than titan volta... and Boinc doesnt report diferences on diferent "coprocesadores" just says two 3080. https://einsteinathome.org/es/host/12916614 of course, I have several questions about a more eficient configuration Regards. |
Send message Joined: 24 Dec 19 Posts: 229 |
HI! Yes the Titan V and Ampere (GDDR6X versions) have about the same efficiency (performance per watt) on Einstein. This is mainly due to memory performance. The Titan V has HBM2 memory which has very low latency, and the GDDR6X cards can achieve close to the same performance due to raw speed (19+ Gbps). I think the TitanV is a little more power efficient, but faster 3080Ti and 3090 models are overall faster and more productive, but using more power. About the host reporting two 3080, this is an idiosyncrasy in how BOINC works. It will decide the “best” GPU you have, and for Nvidia cards, the strongest determinator of “best” is the compute capability. The Ampere cards have CC 8.6 and the Volta card has CC 7.0. So the system chooses the Ampere card to display, and appends that you have 2 total Nvidia GPUs. With the current LAT3000 series tasks being distributed, you’ll probably find max production by running 3-4 tasks at a time on each GPU. However, if the project goes back to crunching LAT4000 series tasks, 1x task per GPU will be best. Just watch what tasks are being distributed and adapt as necessary. |
Send message Joined: 8 Nov 19 Posts: 718 |
...the GA102 (and above, but not the A100) benefit from both an increase from 64 to 128 cores per SM, and the ability to process two FP32 streams concurrently. It was my understanding that the RTX 3060 has cuda cores (shaders) operating either at 32bit INT, OR 32bit FPP. It's not 2 streams concurrently. It's either/or. 32bit INT works faster if you don't need to have as precise numbers. 32bit FPP is more precise, and a bit slower (and less hardware supports it). It's funny, because it reminds me of my audio modelling days. I'd run samples and effects, and there was this effect that used 32 bit INT reverb, and the reverb sounded a bit more metallicky. Meanwhile 32 bit float, sounded like a 'perfect' reverb. So the human ear was able to differentiate between the two, much like the eye can see the difference between 32bit INT (255 values of RGB), and 32bit float (255 values of RGB, and 64 bit per pixel shaded). It's probably close to the ear and eye's maximum perceivable range of colors and sounds; which is why I never got why some digital stomp box pedals were sold with 24 bit reverbs. They sound like trash. Things like 3D polygons run just fine on INT. Not sure if 32bit Float would work in a 3D game environment.. Anyway, but that's off topic. |
Send message Joined: 24 Dec 19 Posts: 229 |
...the GA102 (and above, but not the A100) benefit from both an increase from 64 to 128 cores per SM, and the ability to process two FP32 streams concurrently. this is an incorrect understanding. Both Turing and Ampere have concurrent FP32/INT processing. Page 11 of the Turing whitepaper: (source) Turing implements a major revamping of the core execution datapaths. Modern shader Ampere added onto this by making that second data path FP32 capable as well. It's two stream concurrently. one is FP32, the other is either FP32 or INT32 |
Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.