Thread 'PCI express risers to use multiple GPUs on one motherboard

Author	Message
Ian&Steve C. Send message Joined: 24 Dec 19 Posts: 241	Message 95280 - Posted: 18 Jan 2020, 19:29:32 UTC - in response to Message 95278. I think there's a limit with USB hubs, and with the old non-switched network hubs, you can only chain them up to 4 in a row or something. I assume a similar thing may exist with these splitters. Or perhaps a splitter will not tolerate it's CPU-side being already split and having to wait. I guess I'll eventually find out. But still, a modern board has around 7 PCI express slots of some kind, so putting one splitter on each gives 28 GPUs - and I can't physically fit any more than that around one machine! P.S. that block diagram is incorrect. The board does not have 5 PCI slots, it has 2. It also only has 2 PATA, not 4. you still don't understand. you can't just add more splitters. the motherboard can't keep track of all the memory addresses in that many GPUs! once you reach the limit of your motherboards ability to map the memory it will simply fail to boot. just because you can "plug it in" does NOT mean that it will work. I urge you to do some more research on this. I've done multi-GPU setups more than probably anyone else on this board, and have setup several multi-GPU systems for BOINC processing as well as mining. the block diagram is the diagram for the chipset that you have on your motherboard, the nvidia nforce 750i SLI chipset. it represents what is "possible" to have in a full load out. not all manufactures will use all available I/O when building boards with varying features. but it represents what resources are available in your topology. you can't have more than is shown, but you can have less. it correctly reflects the features you have available to your board. point being that your x1 slots without a doubt run PCIe 1.0 speeds. additionally since no one here seems to know how to check PCIe bus utilization with an in-progress WU. I checked on my system so I can answer for you. I run Linux Ubuntu 18.04.3 and I used a program called 'gmonitor' which is similar to htop and gives graphical and numerical info in a terminal window. Under Windows, I would assume that a program like GPUz or MSI afterburner might show you a bus utilization. but my crunchers do not run windows so you'll have to find a program yourself. Project: SETI@home Application: CUDA 10.2 special application (highly optimized) WU type: BLC61 vlar GPU: GTX 1650 4GB GPU link speed/width: PCIe 3.0 x16 (16,000 MB/s max) here for about the first half of the WU progress, PCIe bus utilization shows about 4-5% (640-800MB/s). and the remaining half of the WU progress ran at about 1-2% PCIe bus utilization. since the resolution of the tool is only in terms of whole percent, I decided to move the GPU to another slot with only a PCIe 3.0 x4 (4,000 MB/s max) slot for some more resolution, and here I saw a relatively expected 14-16% (560-640 MB/s) PCIe bus utilization on he first half of the WU, and about 5% on the second half. this confirms my previous experience when comparing run times on SETI WUs on PCIe 2.0 x1 and PCIe 3.0 x1. no slowdown on a PCIe 3.0 x1 link since it's not being bottlenecked, but a small slowdown on the 2.0 link since its being slightly bottlnecked (2.0 x1 limited to 500MB/s) for about half of the WU run. theres not just a small amount of data passed at the beginning and end of the WU. there can certainly be a constant flow of data across the whole WU. and SETI WUs are tiny compared to some projects. This is just an example of SETI, as that's the only project I run. You need to do your own independent testing on the projects you run. ID: 95280 ·

Joseph Stateson Volunteer tester Send message Joined: 27 Jun 08 Posts: 642	Message 95281 - Posted: 18 Jan 2020, 19:30:20 UTC - in response to Message 95275. Last modified: 18 Jan 2020, 19:31:55 UTC Also, why do people bother mining? I've tried it on GPUs and ASICs, it just is not profitable. The electricity cost is approximately twice the coins you earn. I have been "mining" since classic SETI but it was not called that back in 1999. Three (?) years ago I quit the Texas A&M club and joined the Gridcoin club. At the time I joined a single GRC was just under a quarter USD as I recall. If it had risen to a full quarter I would have 61,000 * 0.25 = $15,250. Unfortunately it is currently worth less than 1/4 cent I will let you do the math. The conclusion of this exercise is that I get a small return of something more valuable than just mining for "credits". ID: 95281 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5149	Message 95282 - Posted: 18 Jan 2020, 19:43:16 UTC - in response to Message 95280. additionally since no one here seems to know how to check PCIe bus utilization with an in-progress WU. I checked on my system so I can answer for you. I run Linux Ubuntu 18.04.3 and I used a program called 'gmonitor' which is similar to htop and gives graphical and numerical info in a terminal window. Under Windows, I would assume that a program like GPUz or MSI afterburner might show you a bus utilization. but my crunchers do not run windows so you'll have to find a program yourself. Windows 10 (later releases only) has an enhanced Task Manager, with GPU monitoring on the Performance tab. The GPU monitor can visualise many different performance metrics (11 on the Intel iGPU I've just checked), but only has 4 mini-graphs to display them. The components used for distributed computing aren't among the default four shown at startup - you have to search through the drop-down lists to find the applicable one(s). ID: 95282 ·

Ian&Steve C. Send message Joined: 24 Dec 19 Posts: 241	Message 95286 - Posted: 18 Jan 2020, 20:04:33 UTC - in response to Message 95283. how exactly is PCIe 1.0 x1 "almost enough"? 250MB/s is less than half of what was being measured. this can and will cause big speed slowdowns. I have measured it and saw a repeatable difference when switching between PCIe 2.0 and 3.0, which have 2 and 4 times the bandwidth of PCIe 1.0 respectively. in my opinion if PCIe 2.0 x1 isnt enough for SETI, then PCIe 1.0 x1 certainly isnt "almost enough". I do not run more than 1 WU at a time (for processing), as this application is optimized to the point that 1 alone will max out the card as much as it can. 100% GPU utilization and 100% memory bus utilization. running 2 just causes the WUs to take 2x as long, not really helpful. ID: 95286 ·

Richard Haselgrove Volunteer tester Help desk expert Send message Joined: 5 Oct 06 Posts: 5149	Message 95290 - Posted: 18 Jan 2020, 20:45:37 UTC - in response to Message 95289. Just how big are those SETI tasks? Surely you can't need half a GB to be transferred every second, that would severely max out your internet connection giving the result back. Can't the memory on the GPU be used during the task? Or is the CPU doing a lot of assistance and needs to communicate with the GPU regularly? GPUs - taking NVidia CUDA as an example - are limited to 5 seconds runtime per program launch, with a tighter TDR limit of 2 seconds under Windows. You will have noticed that SETI tasks take longer than this... The whole art (and it is an art) of GPU programming is to break the original task into a myriad of tasklets, or kernels - as many as possible of which are running in parallel. The GPU itself doesn't have the decision-making hardware to manage this: the CPU is responsible for ensuring that each new kernel - and its associated data - is present and correct at the precise millisecond (or is that microsecond?) that the shader becomes available to process it. You can't just measure the data flow as a one-time load: you have to know the average memory used by each kernel, and the number of kernel launches needed to process the whole meta-task. I can't begin to estimate those numbers. ID: 95290 ·

Ian&Steve C. Send message Joined: 24 Dec 19 Posts: 241	Message 95291 - Posted: 18 Jan 2020, 20:45:53 UTC - in response to Message 95288. how exactly is PCIe 1.0 x1 "almost enough"? 250MB/s is less than half of what was being measured. Only for half your work unit. If you run more than one unit on each card, then you can get it averaged out. I do not run more than 1 WU at a time (for processing), as this application is optimized to the point that 1 alone will max out the card as much as it can. 100% GPU utilization and 100% memory bus utilization. running 2 just causes the WUs to take 2x as long, not really helpful. It would be if you had an uneven bus bottleneck. I do it because I have an uneven CPU bottleneck. Both Einstein and Milkyway sometimes do CPU processing, which holds up the GPU, so with two tasks loaded, that virtually never happens. I'm not trying to be rude, but frankly you don't know what you're talking about. there is no "uneven bus bottleneck", that doesn't even make sense. running 2 WUs will not help when the PCIe bus is already maxed. there is only a finite amount of bandwidth available. if running 1 WU is using ~600MB/s, 2 at no restriction will be wanting 1200MB/s. shoving 2 WUs at a time doesnt just increase load on the GPU, it increases load on the PCIe bus as well. so you'd have 2 WUs fighting for only 250MB/s each even FURTHER slowing it down. this application is highly optimized and i can't stress that enough. its rather incredible. processing times are about 4x faster than the optimized OpenCL app over at seti. additionally I run certain command line parameters (specific to this application) which enable higher CPU support to the GPU. each GPU WU will use 100% of a CPU thread during the entire WU run. this increases the speed of processing further makes processing times more consistent keeping GPU utilization at 100%. so my 7-GPU system running on an 8-thread CPU is using almost all of the CPU for GPU support only. the 1650 system is running a 4-thread system and using about 25% CPU, etc. the WUs files downloaded from the project are not that large. maybe about 720KB. but once extracted to GPU memory, each WU uses about 1.3GB of GPU memory space. if you truly arent seeing a slowdown on the 1.0 x1 slot, then great, but I also suspect you can apply some optimizations (you'd have to inquire at Einstein or MW forums if there any) to make those jobs run faster, at the expense of using more resources. ID: 95291 ·

robsmith Volunteer tester Help desk expert Send message Joined: 25 May 09 Posts: 1350	Message 95294 - Posted: 18 Jan 2020, 21:45:37 UTC Picking up from what Richard said - the SETI CUDA applications have a very efficient task slicing and management which reduces the demands on the bus. However the SETI SoG application is very much more demanding on the bus, in part this is down to the technology used (CUDA vs.OpenCL). So when considering the performance required of the bus one has to be aware of what the application's needs are. If I&S thinks he's seen some high GPU count computers - how about 256 Quadro rtx 800s in one computer, each one is sitting on its own x16 PCI link, to keep that lot fed and under control there are a whole host of Xeons, ARMs and FPGAs, not to mention the cooling system that shifts over 100kW of heat away (and no doubt warms a fair bit of the rest of the building....) ID: 95294 ·

Ian&Steve C. Send message Joined: 24 Dec 19 Posts: 241	Message 95296 - Posted: 18 Jan 2020, 22:05:16 UTC - in response to Message 95293. You forgot to divide by two. If a GPU runs one SETI task and needs 600MB/sec, running two tasks will still need 600MB/s, as each task is running at half the speed. Now consider the second half of the task only needing 150MB/sec. If one task is in the first half and one is in the second half, the data transfer is the average of the two, 375MB/sec. Or another way to work it out, task 1 in the first half of computation is needing half of 600MB/sec, as it only has half a GPU to work with and runs at half speed. Task 2 in the second half of computation is needing half of 150MB/sec. So 600/2+150/2=375. thats not how this works. and you're forgetting that I said "at no restriction" meaning when there is more than enough bandwidth. running 2 tasks on a 250MB/s link effectively leaves each WU using 125MB/s, or roughly 1/4 the speed it could otherwise have to run at full speed. you also aren't realizing that (and it's not entirely your fault, i didnt mention it) that 50% task completion is NOT 50% of the time it takes. this was basing on the completion percentage as reported by BOINC. on these tasks, with this app, the WU % processing rate increases almost exponentially from start to finish. it does the 50% completion in about 75-80% of the total WU run time. so its actually most of the TIME requiring the increased bandwidth. So how come somebody wrote this optimisation only for one card? It would be great if he could do it for others, he could make a lot of money. its not for one card. this SETI application works on any Nvidia GPU with compute capability of 5.0 (Maxwell generation, GTX 900+) or greater and running Linux. that's a lot of different cards that can be used. no one is making money from this. it was developed by volunteers over at SETI and given to the community totally free. this application will not work for anything other than SETI. the same algorithms used on SETI cant be used for other projects which are doing totally different types of calculations. Do you know if you can use longer USB cables? I don't want to bother buying long USB 3 A-A cables if it's known they won't work. They're not the kind of thing that could be used elsewhere. you probably can use longer ones. I think I've used the 2ft-3ft ones before. but YMMV, shorter will be better for signal integrity. since good quality USB 3.0 cables can handle PCIe 3.0, 2.0 and 1.0 should be rather easy to handle. I shall ask. Although your SETI one is the only one I've ever heard of. What's the name of it so I can refer to it so they know what I'm talking about? the seti app runs CUDA, most other projects run OpenCL as far as i know. it's probably not helpful to mention it. I would just ask if there are any command line optimizations that can be implemented for the applications at those projects. usually applied in a config file or a text file. they will be specific to your projects and the applications they use. ID: 95296 ·

robsmith Volunteer tester Help desk expert Send message Joined: 25 May 09 Posts: 1350	Message 95297 - Posted: 18 Jan 2020, 22:19:03 UTC - in response to Message 95295. Well, there are no publishable pictures of the complete beast. Indeed the complete beast is very boring to look at, just a large box with power, data and cooling connections. Initially a small system was tested using RTX2080 to give an idea of what feeders were going to be needed. Next tests were with earlier Quadro which left the RTX2080 behind, after six months (and some mods to the cooling) the RTX8000 were installed, and they are a step up again. The trouble with bench marks and specs is they don't always reflect what happens in real life under very high stress. Being a totally air-cooled system the GPUs were obtained without their fans, etc. blast air at ~4C keeps everything in check. But we digress. ID: 95297 ·

Ian&Steve C. Send message Joined: 24 Dec 19 Posts: 241	Message 95298 - Posted: 18 Jan 2020, 22:19:46 UTC - in response to Message 95294. Last modified: 18 Jan 2020, 22:24:42 UTC If I&S thinks he's seen some high GPU count computers - how about 256 Quadro rtx 800s in one computer, each one is sitting on its own x16 PCI link, to keep that lot fed and under control there are a whole host of Xeons, ARMs and FPGAs, not to mention the cooling system that shifts over 100kW of heat away (and no doubt warms a fair bit of the rest of the building....) let me know when something like that is built by one person and using their hobby money lol. a multi-million <insert currency> supercomputer built by a team of engineers from one or more companies for a very specific use case is a bit outside the scope of this topic i think. not to mention that a system like that couldn't even by recognized by BOINC as a single host anyway lol (hard cap of 64 GPUs per host). "in one rack" or "in one room" might be more appropriate than "in one computer" and the resources would have to be split up virtually to ever run these BOINC projects. ID: 95298 ·

Copyright © 2025 University of California.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.

Thread 'PCI express risers to use multiple GPUs on one motherboard - not detecting card?'