PCI express risers to use multiple GPUs on one motherboard - not detecting card?

Message boards : GPUs : PCI express risers to use multiple GPUs on one motherboard - not detecting card?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 20 · Next

AuthorMessage
Peter Hucker
Avatar

Send message
Joined: 6 Oct 06
Posts: 1144
United Kingdom
Message 95278 - Posted: 18 Jan 2020, 18:51:48 UTC - in response to Message 95276.  
Last modified: 18 Jan 2020, 18:53:33 UTC

look at the information I posted in my previous comment. and look at the block diagram for your board that i also posted.

you have 2x PCIe 2.0 x16 slots and 2x PCIe 1.0 x1 slots. when the documentation "doesn't list" the generation and only says PCIe, that means it's 1.0.

you will never be able to "infinitely" daisychain those splitters. even if you didnt run into communication issues over the PCIe bus, the motherboard will not be able to map all of the VRAM. such old boards just can't handle it. "lane sharing" does take place with the 4-in-1 splitters via a PLX switch on a single lane. it's constantly cycling through the different inputs, only allowing the data from one card at a time. it does not send all data from all cards over the bus at the same time. when you daisy chain them you will run into communication issues undoubtedly.


I think there's a limit with USB hubs, and with the old non-switched network hubs, you can only chain them up to 4 in a row or something. I assume a similar thing may exist with these splitters. Or perhaps a splitter will not tolerate it's CPU-side being already split and having to wait. I guess I'll eventually find out. But still, a modern board has around 7 PCI express slots of some kind, so putting one splitter on each gives 28 GPUs - and I can't physically fit any more than that around one machine!

P.S. that block diagram is incorrect. The board does not have 5 PCI slots, it has 2. It also only has 2 PATA, not 4.
ID: 95278 · Report as offensive
Ian&Steve C.

Send message
Joined: 24 Dec 19
Posts: 159
United States
Message 95280 - Posted: 18 Jan 2020, 19:29:32 UTC - in response to Message 95278.  

I think there's a limit with USB hubs, and with the old non-switched network hubs, you can only chain them up to 4 in a row or something. I assume a similar thing may exist with these splitters. Or perhaps a splitter will not tolerate it's CPU-side being already split and having to wait. I guess I'll eventually find out. But still, a modern board has around 7 PCI express slots of some kind, so putting one splitter on each gives 28 GPUs - and I can't physically fit any more than that around one machine!

P.S. that block diagram is incorrect. The board does not have 5 PCI slots, it has 2. It also only has 2 PATA, not 4.


you still don't understand.

you can't just add more splitters. the motherboard can't keep track of all the memory addresses in that many GPUs! once you reach the limit of your motherboards ability to map the memory it will simply fail to boot. just because you can "plug it in" does NOT mean that it will work. I urge you to do some more research on this. I've done multi-GPU setups more than probably anyone else on this board, and have setup several multi-GPU systems for BOINC processing as well as mining.

the block diagram is the diagram for the chipset that you have on your motherboard, the nvidia nforce 750i SLI chipset. it represents what is "possible" to have in a full load out. not all manufactures will use all available I/O when building boards with varying features. but it represents what resources are available in your topology. you can't have more than is shown, but you can have less. it correctly reflects the features you have available to your board. point being that your x1 slots without a doubt run PCIe 1.0 speeds.

additionally since no one here seems to know how to check PCIe bus utilization with an in-progress WU. I checked on my system so I can answer for you. I run Linux Ubuntu 18.04.3 and I used a program called 'gmonitor' which is similar to htop and gives graphical and numerical info in a terminal window. Under Windows, I would assume that a program like GPUz or MSI afterburner might show you a bus utilization. but my crunchers do not run windows so you'll have to find a program yourself.

Project: SETI@home
Application: CUDA 10.2 special application (highly optimized)
WU type: BLC61 vlar
GPU: GTX 1650 4GB
GPU link speed/width: PCIe 3.0 x16 (16,000 MB/s max)

here for about the first half of the WU progress, PCIe bus utilization shows about 4-5% (640-800MB/s). and the remaining half of the WU progress ran at about 1-2% PCIe bus utilization.

since the resolution of the tool is only in terms of whole percent, I decided to move the GPU to another slot with only a PCIe 3.0 x4 (4,000 MB/s max) slot for some more resolution, and here I saw a relatively expected 14-16% (560-640 MB/s) PCIe bus utilization on he first half of the WU, and about 5% on the second half.

this confirms my previous experience when comparing run times on SETI WUs on PCIe 2.0 x1 and PCIe 3.0 x1. no slowdown on a PCIe 3.0 x1 link since it's not being bottlenecked, but a small slowdown on the 2.0 link since its being slightly bottlnecked (2.0 x1 limited to 500MB/s) for about half of the WU run. theres not just a small amount of data passed at the beginning and end of the WU. there can certainly be a constant flow of data across the whole WU. and SETI WUs are tiny compared to some projects.

This is just an example of SETI, as that's the only project I run. You need to do your own independent testing on the projects you run.
ID: 95280 · Report as offensive
Profile Joseph Stateson
Volunteer tester
Avatar

Send message
Joined: 27 Jun 08
Posts: 551
United States
Message 95281 - Posted: 18 Jan 2020, 19:30:20 UTC - in response to Message 95275.  
Last modified: 18 Jan 2020, 19:31:55 UTC

Also, why do people bother mining? I've tried it on GPUs and ASICs, it just is not profitable. The electricity cost is approximately twice the coins you earn.


I have been "mining" since classic SETI but it was not called that back in 1999. Three (?) years ago I quit the Texas A&M club and joined the Gridcoin club. At the time I joined a single GRC was just under a quarter USD as I recall. If it had risen to a full quarter I would have 61,000 * 0.25 = $15,250. Unfortunately it is currently worth less than 1/4 cent I will let you do the math. The conclusion of this exercise is that I get a small return of something more valuable than just mining for "credits".
ID: 95281 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 4536
United Kingdom
Message 95282 - Posted: 18 Jan 2020, 19:43:16 UTC - in response to Message 95280.  

additionally since no one here seems to know how to check PCIe bus utilization with an in-progress WU. I checked on my system so I can answer for you. I run Linux Ubuntu 18.04.3 and I used a program called 'gmonitor' which is similar to htop and gives graphical and numerical info in a terminal window. Under Windows, I would assume that a program like GPUz or MSI afterburner might show you a bus utilization. but my crunchers do not run windows so you'll have to find a program yourself.
Windows 10 (later releases only) has an enhanced Task Manager, with GPU monitoring on the Performance tab.

The GPU monitor can visualise many different performance metrics (11 on the Intel iGPU I've just checked), but only has 4 mini-graphs to display them. The components used for distributed computing aren't among the default four shown at startup - you have to search through the drop-down lists to find the applicable one(s).
ID: 95282 · Report as offensive
Peter Hucker
Avatar

Send message
Joined: 6 Oct 06
Posts: 1144
United Kingdom
Message 95283 - Posted: 18 Jan 2020, 19:54:37 UTC - in response to Message 95280.  

you can't just add more splitters. the motherboard can't keep track of all the memory addresses in that many GPUs! once you reach the limit of your motherboards ability to map the memory it will simply fail to boot. just because you can "plug it in" does NOT mean that it will work. I urge you to do some more research on this. I've done multi-GPU setups more than probably anyone else on this board, and have setup several multi-GPU systems for BOINC processing as well as mining.


The best research is trial and error. As I buy more GPUs, I will attempt to connect them to this board. When I cannot do so any longer, I'll have to buy a 2nd board.

the block diagram is the diagram for the chipset that you have on your motherboard, the nvidia nforce 750i SLI chipset. it represents what is "possible" to have in a full load out. not all manufactures will use all available I/O when building boards with varying features. but it represents what resources are available in your topology. you can't have more than is shown, but you can have less. it correctly reflects the features you have available to your board. point being that your x1 slots without a doubt run PCIe 1.0 speeds.


Well they do 2 Einstein or 2 Milkyway per GPU on the 1.0 x1 slots without slowing down at all.

additionally since no one here seems to know how to check PCIe bus utilization with an in-progress WU. I checked on my system so I can answer for you. I run Linux Ubuntu 18.04.3 and I used a program called 'gmonitor' which is similar to htop and gives graphical and numerical info in a terminal window. Under Windows, I would assume that a program like GPUz or MSI afterburner might show you a bus utilization. but my crunchers do not run windows so you'll have to find a program yourself.


GPU-Z doesn't show that. I can only see on-card GPU and memory utilisation, and with some cards, power usage, volts and amps into and out of the VRM.

Project: SETI@home
Application: CUDA 10.2 special application (highly optimized)
WU type: BLC61 vlar
GPU: GTX 1650 4GB
GPU link speed/width: PCIe 3.0 x16 (16,000 MB/s max)

here for about the first half of the WU progress, PCIe bus utilization shows about 4-5% (640-800MB/s). and the remaining half of the WU progress ran at about 1-2% PCIe bus utilization.

since the resolution of the tool is only in terms of whole percent, I decided to move the GPU to another slot with only a PCIe 3.0 x4 (4,000 MB/s max) slot for some more resolution, and here I saw a relatively expected 14-16% (560-640 MB/s) PCIe bus utilization on he first half of the WU, and about 5% on the second half.

this confirms my previous experience when comparing run times on SETI WUs on PCIe 2.0 x1 and PCIe 3.0 x1. no slowdown on a PCIe 3.0 x1 link since it's not being bottlenecked, but a small slowdown on the 2.0 link since its being slightly bottlnecked (2.0 x1 limited to 500MB/s) for about half of the WU run. theres not just a small amount of data passed at the beginning and end of the WU. there can certainly be a constant flow of data across the whole WU. and SETI WUs are tiny compared to some projects.

This is just an example of SETI, as that's the only project I run. You need to do your own independent testing on the projects you run.


PCI Express 1.0 x1 goes at 250MB/sec, which is almost enough for SETI on your card (assuming you ran more than one at a time, so if it was waiting for the first half, it could be computing on the second half of the other task). Your card is 3/4s of the speed of mine for single precision, so I should see more bottleneck. I can only assume that Einstein and Milkyway use a lot less data transfer, as I see the tasks completing in precisely the same time on the 1.0 x1 socket as on the 2.0 x16 socket. Can you try that tool running Einstein and Milkyway?
ID: 95283 · Report as offensive
Peter Hucker
Avatar

Send message
Joined: 6 Oct 06
Posts: 1144
United Kingdom
Message 95285 - Posted: 18 Jan 2020, 19:59:01 UTC - in response to Message 95281.  

Also, why do people bother mining? I've tried it on GPUs and ASICs, it just is not profitable. The electricity cost is approximately twice the coins you earn.


I have been "mining" since classic SETI but it was not called that back in 1999. Three (?) years ago I quit the Texas A&M club and joined the Gridcoin club. At the time I joined a single GRC was just under a quarter USD as I recall. If it had risen to a full quarter I would have 61,000 * 0.25 = $15,250. Unfortunately it is currently worth less than 1/4 cent I will let you do the math. The conclusion of this exercise is that I get a small return of something more valuable than just mining for "credits".


That's the problem, the value of coins plummets continuously. If one is valuable, people will all start mining it, and it will drop like a stone, or the difficulty rises so you get less of them. Either that or people continuously invent faster and faster mining chips, so the one you own needs replaced very often, wasting most of the money. Then when you add the cost of the electricity, you end up making a loss.

I run Boinc to contribute to science. Mining doesn't contribute anything to anything, it's just a pointless waste of electricity, and I'm surprised the environmentalists haven't had the whole farce shut down.
ID: 95285 · Report as offensive
Ian&Steve C.

Send message
Joined: 24 Dec 19
Posts: 159
United States
Message 95286 - Posted: 18 Jan 2020, 20:04:33 UTC - in response to Message 95283.  

how exactly is PCIe 1.0 x1 "almost enough"? 250MB/s is less than half of what was being measured. this can and will cause big speed slowdowns. I have measured it and saw a repeatable difference when switching between PCIe 2.0 and 3.0, which have 2 and 4 times the bandwidth of PCIe 1.0 respectively. in my opinion if PCIe 2.0 x1 isnt enough for SETI, then PCIe 1.0 x1 certainly isnt "almost enough".

I do not run more than 1 WU at a time (for processing), as this application is optimized to the point that 1 alone will max out the card as much as it can. 100% GPU utilization and 100% memory bus utilization. running 2 just causes the WUs to take 2x as long, not really helpful.
ID: 95286 · Report as offensive
Peter Hucker
Avatar

Send message
Joined: 6 Oct 06
Posts: 1144
United Kingdom
Message 95287 - Posted: 18 Jan 2020, 20:10:20 UTC - in response to Message 95282.  
Last modified: 18 Jan 2020, 20:11:26 UTC

additionally since no one here seems to know how to check PCIe bus utilization with an in-progress WU. I checked on my system so I can answer for you. I run Linux Ubuntu 18.04.3 and I used a program called 'gmonitor' which is similar to htop and gives graphical and numerical info in a terminal window. Under Windows, I would assume that a program like GPUz or MSI afterburner might show you a bus utilization. but my crunchers do not run windows so you'll have to find a program yourself.
Windows 10 (later releases only) has an enhanced Task Manager, with GPU monitoring on the Performance tab.

The GPU monitor can visualise many different performance metrics (11 on the Intel iGPU I've just checked), but only has 4 mini-graphs to display them. The components used for distributed computing aren't among the default four shown at startup - you have to search through the drop-down lists to find the applicable one(s).


I get "compute 1" at almost 100%, "copy" at spikes of 100%, but only about a seventh of the time, and "compute 0" more or less matching the "copy" graph. I assume "copy" means it's transferring across the PCI Express bus? Or does it include to the GPU's own RAM? This is running two of Milkyway on a 280X. I wonder what compute 0 and 1 are - 64 bit and 32 bit? Can't be that, Einstein and Milkyway both only show on the compute 0 graph. Strange how compute 0 almost exactly matches copy.

If copy really does mean data transfer on the PCI Express bus, then I'm nothing like maxing it out. I could run 7 cards on each 1.0 x1 slot!
ID: 95287 · Report as offensive
Peter Hucker
Avatar

Send message
Joined: 6 Oct 06
Posts: 1144
United Kingdom
Message 95288 - Posted: 18 Jan 2020, 20:14:12 UTC - in response to Message 95286.  

how exactly is PCIe 1.0 x1 "almost enough"? 250MB/s is less than half of what was being measured.


Only for half your work unit. If you run more than one unit on each card, then you can get it averaged out.

I do not run more than 1 WU at a time (for processing), as this application is optimized to the point that 1 alone will max out the card as much as it can. 100% GPU utilization and 100% memory bus utilization. running 2 just causes the WUs to take 2x as long, not really helpful.


It would be if you had an uneven bus bottleneck. I do it because I have an uneven CPU bottleneck. Both Einstein and Milkyway sometimes do CPU processing, which holds up the GPU, so with two tasks loaded, that virtually never happens.
ID: 95288 · Report as offensive
Peter Hucker
Avatar

Send message
Joined: 6 Oct 06
Posts: 1144
United Kingdom
Message 95289 - Posted: 18 Jan 2020, 20:25:55 UTC - in response to Message 95286.  

how exactly is PCIe 1.0 x1 "almost enough"? 250MB/s is less than half of what was being measured. this can and will cause big speed slowdowns. I have measured it and saw a repeatable difference when switching between PCIe 2.0 and 3.0, which have 2 and 4 times the bandwidth of PCIe 1.0 respectively. in my opinion if PCIe 2.0 x1 isnt enough for SETI, then PCIe 1.0 x1 certainly isnt "almost enough".

I do not run more than 1 WU at a time (for processing), as this application is optimized to the point that 1 alone will max out the card as much as it can. 100% GPU utilization and 100% memory bus utilization. running 2 just causes the WUs to take 2x as long, not really helpful.


Just how big are those SETI tasks? Surely you can't need half a GB to be transferred every second, that would severely max out your internet connection giving the result back. Can't the memory on the GPU be used during the task? Or is the CPU doing a lot of assistance and needs to communicate with the GPU regularly?
ID: 95289 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 4536
United Kingdom
Message 95290 - Posted: 18 Jan 2020, 20:45:37 UTC - in response to Message 95289.  

Just how big are those SETI tasks? Surely you can't need half a GB to be transferred every second, that would severely max out your internet connection giving the result back. Can't the memory on the GPU be used during the task? Or is the CPU doing a lot of assistance and needs to communicate with the GPU regularly?
GPUs - taking NVidia CUDA as an example - are limited to 5 seconds runtime per program launch, with a tighter TDR limit of 2 seconds under Windows. You will have noticed that SETI tasks take longer than this...

The whole art (and it is an art) of GPU programming is to break the original task into a myriad of tasklets, or kernels - as many as possible of which are running in parallel. The GPU itself doesn't have the decision-making hardware to manage this: the CPU is responsible for ensuring that each new kernel - and its associated data - is present and correct at the precise millisecond (or is that microsecond?) that the shader becomes available to process it. You can't just measure the data flow as a one-time load: you have to know the average memory used by each kernel, and the number of kernel launches needed to process the whole meta-task. I can't begin to estimate those numbers.
ID: 95290 · Report as offensive
Ian&Steve C.

Send message
Joined: 24 Dec 19
Posts: 159
United States
Message 95291 - Posted: 18 Jan 2020, 20:45:53 UTC - in response to Message 95288.  

how exactly is PCIe 1.0 x1 "almost enough"? 250MB/s is less than half of what was being measured.


Only for half your work unit. If you run more than one unit on each card, then you can get it averaged out.

I do not run more than 1 WU at a time (for processing), as this application is optimized to the point that 1 alone will max out the card as much as it can. 100% GPU utilization and 100% memory bus utilization. running 2 just causes the WUs to take 2x as long, not really helpful.


It would be if you had an uneven bus bottleneck. I do it because I have an uneven CPU bottleneck. Both Einstein and Milkyway sometimes do CPU processing, which holds up the GPU, so with two tasks loaded, that virtually never happens.


I'm not trying to be rude, but frankly you don't know what you're talking about. there is no "uneven bus bottleneck", that doesn't even make sense.

running 2 WUs will not help when the PCIe bus is already maxed. there is only a finite amount of bandwidth available. if running 1 WU is using ~600MB/s, 2 at no restriction will be wanting 1200MB/s. shoving 2 WUs at a time doesnt just increase load on the GPU, it increases load on the PCIe bus as well. so you'd have 2 WUs fighting for only 250MB/s each even FURTHER slowing it down.

this application is highly optimized and i can't stress that enough. its rather incredible. processing times are about 4x faster than the optimized OpenCL app over at seti. additionally I run certain command line parameters (specific to this application) which enable higher CPU support to the GPU. each GPU WU will use 100% of a CPU thread during the entire WU run. this increases the speed of processing further makes processing times more consistent keeping GPU utilization at 100%. so my 7-GPU system running on an 8-thread CPU is using almost all of the CPU for GPU support only. the 1650 system is running a 4-thread system and using about 25% CPU, etc.

the WUs files downloaded from the project are not that large. maybe about 720KB. but once extracted to GPU memory, each WU uses about 1.3GB of GPU memory space.

if you truly arent seeing a slowdown on the 1.0 x1 slot, then great, but I also suspect you can apply some optimizations (you'd have to inquire at Einstein or MW forums if there any) to make those jobs run faster, at the expense of using more resources.
ID: 95291 · Report as offensive
Peter Hucker
Avatar

Send message
Joined: 6 Oct 06
Posts: 1144
United Kingdom
Message 95292 - Posted: 18 Jan 2020, 21:30:01 UTC - in response to Message 95290.  
Last modified: 18 Jan 2020, 21:47:41 UTC

Just how big are those SETI tasks? Surely you can't need half a GB to be transferred every second, that would severely max out your internet connection giving the result back. Can't the memory on the GPU be used during the task? Or is the CPU doing a lot of assistance and needs to communicate with the GPU regularly?
GPUs - taking NVidia CUDA as an example - are limited to 5 seconds runtime per program launch, with a tighter TDR limit of 2 seconds under Windows. You will have noticed that SETI tasks take longer than this...

The whole art (and it is an art) of GPU programming is to break the original task into a myriad of tasklets, or kernels - as many as possible of which are running in parallel. The GPU itself doesn't have the decision-making hardware to manage this: the CPU is responsible for ensuring that each new kernel - and its associated data - is present and correct at the precise millisecond (or is that microsecond?) that the shader becomes available to process it. You can't just measure the data flow as a one-time load: you have to know the average memory used by each kernel, and the number of kernel launches needed to process the whole meta-task. I can't begin to estimate those numbers.


Why do they have the 5 second limit? Is it a real limitation of the GPU, or a silly decision made by the CUDA inventors? and why can't everything be stored in GPU memory, with just calls by the CPU for the GPU to access it? I thought that was the whole point of a GPU having memory on board, so it didn't have to access main RAM and a) get in the way of the CPU trying to use it for something else, and b) wait on bus transfers.
ID: 95292 · Report as offensive
Peter Hucker
Avatar

Send message
Joined: 6 Oct 06
Posts: 1144
United Kingdom
Message 95293 - Posted: 18 Jan 2020, 21:41:25 UTC - in response to Message 95291.  
Last modified: 18 Jan 2020, 21:45:31 UTC

I'm not trying to be rude, but frankly you don't know what you're talking about. there is no "uneven bus bottleneck", that doesn't even make sense.


You gave me the data yourself. The bus is being used three times as much for the first half of your tasks. It's therefore not even through the whole task. So in a similar way to my problem of Einstein tasks needing a lot more CPU power at the start, if I run more than one at a time, the GPU has something to do while the CPU is thinking about the first task. You would be averaging out how much the bus needs to be used, and I'm averaging out how much the GPU is waiting for the CPU.

running 2 WUs will not help when the PCIe bus is already maxed. there is only a finite amount of bandwidth available. if running 1 WU is using ~600MB/s, 2 at no restriction will be wanting 1200MB/s. shoving 2 WUs at a time doesnt just increase load on the GPU, it increases load on the PCIe bus as well. so you'd have 2 WUs fighting for only 250MB/s each even FURTHER slowing it down.


You forgot to divide by two. If a GPU runs one SETI task and needs 600MB/sec, running two tasks will still need 600MB/s, as each task is running at half the speed. Now consider the second half of the task only needing 150MB/sec. If one task is in the first half and one is in the second half, the data transfer is the average of the two, 375MB/sec. Or another way to work it out, task 1 in the first half of computation is needing half of 600MB/sec, as it only has half a GPU to work with and runs at half speed. Task 2 in the second half of computation is needing half of 150MB/sec. So 600/2+150/2=375.

this application is highly optimized and i can't stress that enough. its rather incredible. processing times are about 4x faster than the optimized OpenCL app over at seti. additionally I run certain command line parameters (specific to this application) which enable higher CPU support to the GPU. each GPU WU will use 100% of a CPU thread during the entire WU run. this increases the speed of processing further makes processing times more consistent keeping GPU utilization at 100%. so my 7-GPU system running on an 8-thread CPU is using almost all of the CPU for GPU support only. the 1650 system is running a 4-thread system and using about 25% CPU, etc.


Yes, in your case doubling up tasks wouldn't work.

So how come somebody wrote this optimisation only for one card? It would be great if he could do it for others, he could make a lot of money.

the WUs files downloaded from the project are not that large. maybe about 720KB. but once extracted to GPU memory, each WU uses about 1.3GB of GPU memory space.


I assume your optimized SETI code is passing a lot of data back and forth between CPU and GPU, I guess that isn't happening with the stock Einstein and Milkyway programs. Any data required by the GPU is on the GPU's own memory.

if you truly arent seeing a slowdown on the 1.0 x1 slot, then great,


Definitely - I timed several similar tasks and they were identical connected to either port. I haven't yet tried sharing the 1.0 x1 slot with both cards, as the 4 way adapter still isn't here. Even if it does slow down doing that, and if I can't daisychain adapters, I can still have 1 card on each of the x1 slots and 4 on each of the x16 slots. So 10 cards. Good enough for one old PC. And approaching the limit of physically connecting them.

Do you know if you can use longer USB cables? I don't want to bother buying long USB 3 A-A cables if it's known they won't work. They're not the kind of thing that could be used elsewhere.

but I also suspect you can apply some optimizations (you'd have to inquire at Einstein or MW forums if there any) to make those jobs run faster, at the expense of using more resources.


I shall ask. Although your SETI one is the only one I've ever heard of. What's the name of it so I can refer to it so they know what I'm talking about?
ID: 95293 · Report as offensive
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 976
United Kingdom
Message 95294 - Posted: 18 Jan 2020, 21:45:37 UTC

Picking up from what Richard said - the SETI CUDA applications have a very efficient task slicing and management which reduces the demands on the bus. However the SETI SoG application is very much more demanding on the bus, in part this is down to the technology used (CUDA vs.OpenCL). So when considering the performance required of the bus one has to be aware of what the application's needs are.

If I&S thinks he's seen some high GPU count computers - how about 256 Quadro rtx 800s in one computer, each one is sitting on its own x16 PCI link, to keep that lot fed and under control there are a whole host of Xeons, ARMs and FPGAs, not to mention the cooling system that shifts over 100kW of heat away (and no doubt warms a fair bit of the rest of the building....)
ID: 95294 · Report as offensive
Peter Hucker
Avatar

Send message
Joined: 6 Oct 06
Posts: 1144
United Kingdom
Message 95295 - Posted: 18 Jan 2020, 21:49:19 UTC - in response to Message 95294.  
Last modified: 18 Jan 2020, 21:53:02 UTC

Picking up from what Richard said - the SETI CUDA applications have a very efficient task slicing and management which reduces the demands on the bus. However the SETI SoG application is very much more demanding on the bus, in part this is down to the technology used (CUDA vs.OpenCL). So when considering the performance required of the bus one has to be aware of what the application's needs are.

If I&S thinks he's seen some high GPU count computers - how about 256 Quadro rtx 800s in one computer, each one is sitting on its own x16 PCI link, to keep that lot fed and under control there are a whole host of Xeons, ARMs and FPGAs, not to mention the cooling system that shifts over 100kW of heat away (and no doubt warms a fair bit of the rest of the building....)


I would love to see that thing. What does it do and do you have pictures?

I assume you meant Quadro RTX 8000, like this: https://www.techpowerup.com/gpu-specs/quadro-rtx-8000.c3306 - pah! It's half the speed of my 280X on double precision.
ID: 95295 · Report as offensive
Ian&Steve C.

Send message
Joined: 24 Dec 19
Posts: 159
United States
Message 95296 - Posted: 18 Jan 2020, 22:05:16 UTC - in response to Message 95293.  

You forgot to divide by two. If a GPU runs one SETI task and needs 600MB/sec, running two tasks will still need 600MB/s, as each task is running at half the speed. Now consider the second half of the task only needing 150MB/sec. If one task is in the first half and one is in the second half, the data transfer is the average of the two, 375MB/sec. Or another way to work it out, task 1 in the first half of computation is needing half of 600MB/sec, as it only has half a GPU to work with and runs at half speed. Task 2 in the second half of computation is needing half of 150MB/sec. So 600/2+150/2=375.


thats not how this works. and you're forgetting that I said "at no restriction" meaning when there is more than enough bandwidth. running 2 tasks on a 250MB/s link effectively leaves each WU using 125MB/s, or roughly 1/4 the speed it could otherwise have to run at full speed. you also aren't realizing that (and it's not entirely your fault, i didnt mention it) that 50% task completion is NOT 50% of the time it takes. this was basing on the completion percentage as reported by BOINC. on these tasks, with this app, the WU % processing rate increases almost exponentially from start to finish. it does the 50% completion in about 75-80% of the total WU run time. so its actually most of the TIME requiring the increased bandwidth.

So how come somebody wrote this optimisation only for one card? It would be great if he could do it for others, he could make a lot of money.


its not for one card. this SETI application works on any Nvidia GPU with compute capability of 5.0 (Maxwell generation, GTX 900+) or greater and running Linux. that's a lot of different cards that can be used. no one is making money from this. it was developed by volunteers over at SETI and given to the community totally free. this application will not work for anything other than SETI. the same algorithms used on SETI cant be used for other projects which are doing totally different types of calculations.


Do you know if you can use longer USB cables? I don't want to bother buying long USB 3 A-A cables if it's known they won't work. They're not the kind of thing that could be used elsewhere.


you probably can use longer ones. I think I've used the 2ft-3ft ones before. but YMMV, shorter will be better for signal integrity. since good quality USB 3.0 cables can handle PCIe 3.0, 2.0 and 1.0 should be rather easy to handle.

I shall ask. Although your SETI one is the only one I've ever heard of. What's the name of it so I can refer to it so they know what I'm talking about?

the seti app runs CUDA, most other projects run OpenCL as far as i know. it's probably not helpful to mention it. I would just ask if there are any command line optimizations that can be implemented for the applications at those projects. usually applied in a config file or a text file. they will be specific to your projects and the applications they use.
ID: 95296 · Report as offensive
robsmith
Volunteer tester
Help desk expert

Send message
Joined: 25 May 09
Posts: 976
United Kingdom
Message 95297 - Posted: 18 Jan 2020, 22:19:03 UTC - in response to Message 95295.  

Well, there are no publishable pictures of the complete beast. Indeed the complete beast is very boring to look at, just a large box with power, data and cooling connections.
Initially a small system was tested using RTX2080 to give an idea of what feeders were going to be needed. Next tests were with earlier Quadro which left the RTX2080 behind, after six months (and some mods to the cooling) the RTX8000 were installed, and they are a step up again. The trouble with bench marks and specs is they don't always reflect what happens in real life under very high stress.

Being a totally air-cooled system the GPUs were obtained without their fans, etc. blast air at ~4C keeps everything in check.

But we digress.
ID: 95297 · Report as offensive
Ian&Steve C.

Send message
Joined: 24 Dec 19
Posts: 159
United States
Message 95298 - Posted: 18 Jan 2020, 22:19:46 UTC - in response to Message 95294.  
Last modified: 18 Jan 2020, 22:24:42 UTC



If I&S thinks he's seen some high GPU count computers - how about 256 Quadro rtx 800s in one computer, each one is sitting on its own x16 PCI link, to keep that lot fed and under control there are a whole host of Xeons, ARMs and FPGAs, not to mention the cooling system that shifts over 100kW of heat away (and no doubt warms a fair bit of the rest of the building....)


let me know when something like that is built by one person and using their hobby money lol. a multi-million <insert currency> supercomputer built by a team of engineers from one or more companies for a very specific use case is a bit outside the scope of this topic i think. not to mention that a system like that couldn't even by recognized by BOINC as a single host anyway lol (hard cap of 64 GPUs per host). "in one rack" or "in one room" might be more appropriate than "in one computer" and the resources would have to be split up virtually to ever run these BOINC projects.
ID: 95298 · Report as offensive
Peter Hucker
Avatar

Send message
Joined: 6 Oct 06
Posts: 1144
United Kingdom
Message 95299 - Posted: 18 Jan 2020, 22:20:30 UTC - in response to Message 95296.  

thats not how this works. and you're forgetting that I said "at no restriction" meaning when there is more than enough bandwidth. running 2 tasks on a 250MB/s link effectively leaves each WU using 125MB/s, or roughly 1/4 the speed it could otherwise have to run at full speed.


But if you run two tasks on the GPU, the data requirement of each task is halved, as they are running slower.

you also aren't realizing that (and it's not entirely your fault, i didnt mention it) that 50% task completion is NOT 50% of the time it takes. this was basing on the completion percentage as reported by BOINC. on these tasks, with this app, the WU % processing rate increases almost exponentially from start to finish. it does the 50% completion in about 75-80% of the total WU run time. so its actually most of the TIME requiring the increased bandwidth.


I see.

its not for one card. this SETI application works on any Nvidia GPU with compute capability of 5.0 (Maxwell generation, GTX 900+) or greater and running Linux. that's a lot of different cards that can be used. no one is making money from this. it was developed by volunteers over at SETI and given to the community totally free. this application will not work for anything other than SETI. the same algorithms used on SETI cant be used for other projects which are doing totally different types of calculations. [quote]

Then they really should launch it mainstream.

[quote]you probably can use longer ones. I think I've used the 2ft-3ft ones before. but YMMV, shorter will be better for signal integrity. since good quality USB 3.0 cables can handle PCIe 3.0, 2.0 and 1.0 should be rather easy to handle.


The ones that came with them are 2 feet already. I'll try longer ones if I ever need to physically position things further away (like having a huge amount of cards on one PC). I don't suppose it's possible to force PCI Express 1.0 if I have a long cable that can't manage 3.0?

the seti app runs CUDA, most other projects run OpenCL as far as i know. it's probably not helpful to mention it. I would just ask if there are any command line optimizations that can be implemented for the applications at those projects. usually applied in a config file or a text file. they will be specific to your projects and the applications they use.


I thought it was a whole new app, not just a command or configuration you typed in.
ID: 95299 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 20 · Next

Message boards : GPUs : PCI express risers to use multiple GPUs on one motherboard - not detecting card?

Copyright © 2021 University of California. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.