BOINC doesn't utilize all CPUs on NUMA multi-node systems (>64 threads)

Message boards : Questions and problems : BOINC doesn't utilize all CPUs on NUMA multi-node systems (>64 threads)
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Mumak
Avatar

Send message
Joined: 14 Mar 15
Posts: 10
Slovakia
Message 60953 - Posted: 14 Mar 2015, 15:51:17 UTC
Last modified: 14 Mar 2015, 15:59:03 UTC

Hello, I have noticed a problem where BOINC doesn't seem to properly schedule threads on systems that have multiple CPU groups.

OS: Win 8.1 x64
CPU: 2 x Intel Xeon E5-2699 v3 (72 threads in total)
BOINC v7.4.36 x64
Tasks: WCG on 100% of threads

This system is setup to have 2 CPU Groups, each of them has 36 threads. So 1 group per CPU.
As per GetLogicalProcessorInformationEx:
Group[0]: MaximumProcessorCount=64, ActiveProcessorCount=36, ActiveProcessorMask=0000000FFFFFFFFF
Group[1]: MaximumProcessorCount=64, ActiveProcessorCount=36, ActiveProcessorMask=0000000FFFFFFFFF

Now the problem is that almost all threads are assigned affinity to Group 0 (NUMA Node 0). This can be clearly seen in Task Manager - NUMA Node 0 usage is 100%, Node 1 much lower and also confirmed by checking particular thread's affinity settings.

I'm not sure whether BOINC manually assigns affinity to each thread, or if it does handle assigning of threads to particular groups. But I think if the originating process is set to Group 0 Affinity, then all subsequent threads are by default assigned to the same group unless you do this explicitly via SetThreadGroupAffinity/SetThreadIdealProcessorEx.


Thanks,
Martin
ID: 60953 · Report as offensive
Profile Mumak
Avatar

Send message
Joined: 14 Mar 15
Posts: 10
Slovakia
Message 60988 - Posted: 16 Mar 2015, 7:46:53 UTC

Actually with "threads" I meant processes in most cases.
So for a particular example, there are 72 WCG processes running and (almost) all of them have affinity set to Group (NUMA Node) 0. Group 1 is almost idle.
There might be 2 solutions:
1. either start as many instances of BOINC process as there are groups, each set to a different group affinity (SetThreadGroupAffinity) and from there create child processes.
2. or manage all process/thread affinity explicitly.
ID: 60988 · Report as offensive
Profile Mumak
Avatar

Send message
Joined: 14 Mar 15
Posts: 10
Slovakia
Message 61069 - Posted: 19 Mar 2015, 7:11:57 UTC
Last modified: 19 Mar 2015, 7:12:11 UTC

Is there no interest from BOINC to utilize more than 64 threads (on Windows) ?!
ID: 61069 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 14593
Netherlands
Message 61070 - Posted: 19 Mar 2015, 10:42:57 UTC - in response to Message 61069.  
Last modified: 19 Mar 2015, 10:43:55 UTC

A couple of problems with your posts:
1) No BOINC start-up messages showing what BOINC is showing.
2) No explanation on what NUMA is. The only NUMA I know is the company that Dirk Pitt works for, it's doubtful you mean that one. And since you now have to look up what I mean, did you mean I have to go look up what you mean?
3) Other than saying you run WCG, you don't specify sub-project and amount of memory. Can you even run 64 single threads based on the amount of memory?
4) Or do you want to run work divided over the 64 threads? In that case, you have to ask the project for an application (OpenCL or MT) that can do so.

You may want to peruse this thread as well, for minimum amounts of useful info to post to get help around here.
ID: 61070 · Report as offensive
Profile Mumak
Avatar

Send message
Joined: 14 Mar 15
Posts: 10
Slovakia
Message 61071 - Posted: 19 Mar 2015, 10:58:40 UTC - in response to Message 61070.  
Last modified: 19 Mar 2015, 10:59:56 UTC

I thought that the information I provided is sufficient for technicians to understand the issue at first look. I have even provided API functions which should be checked/used.

1. BOINC startup (is that relevant in this case?):
 | Starting BOINC client version 7.4.36 for windows_x86_64
 | log flags: file_xfer, sched_ops, task
 | Libraries: libcurl/7.39.0 OpenSSL/1.0.1j zlib/1.2.8
 | Data directory: C:\ProgramData\BOINC
 | Running under account Administrator
 | CUDA: NVIDIA GPU 0: Quadro K5200 (driver version 341.21, CUDA version 6.5, compute capability 3.5, 4096MB, 4096MB available, 3553 GFLOPS peak)
 | OpenCL: NVIDIA GPU 0: Quadro K5200 (driver version 341.21, device version OpenCL 1.1 CUDA, 8192MB, 4096MB available, 3553 GFLOPS peak)
 | Host name: WIN-99OC5SRSPAM
 | Processor: 72 GenuineIntel Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz [Family 6 Model 63 Stepping 2]
 | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 fma cx16 sse4_1 sse4_2 movebe popcnt aes f16c rdrandsyscall nx lm avx avx2 vmx smx tm2 dca pbe fsgsbase bmi1 smep bmi2
 | OS: Microsoft Windows Server 2012 R2: Standard x64 Edition, (06.03.9600.00)
 | Memory: 255.89 GB physical, 257.85 GB virtual
 | Disk: 185.97 GB total, 133.88 GB free
 | Local time is UTC -7 hours
 | VirtualBox version: 4.3.12
 | Config: simulate 72 CPUs
World Community Grid | URL http://www.worldcommunitygrid.org/; Computer ID 3277834; resource share 100
World Community Grid | General prefs: from World Community Grid (last modified 07-Mar-2015 15:09:57)
World Community Grid | Host location: none
World Community Grid | General prefs: using your defaults
 | Reading preferences override file
 | Preferences:
 | max memory usage when active: 196525.90MB
 | max memory usage when idle: 235831.09MB
 | max disk usage: 10.00GB
 | (to change preferences, visit a project web site or select Preferences in the Manager)
 | Not using a proxy


2. NUMA = Non-uniform memory access - http://en.wikipedia.org/wiki/Non-uniform_memory_access; MS: https://msdn.microsoft.com/en-us/library/windows/desktop/aa363804%28v=vs.85%29.aspx

3. Running 72 WCG/MCM tasks. There's enough memory and this issue is not related to memory capacity either.

4. No, as per 3. there are as many tasks running as there are logical CPUs in the system. And the issue is, that almost all are scheduled (by BOINC) to the same NUMA Node (CPU Group), which causes an overload on NUMA n, while NUMA!=n is almost idle.

The situation can be simulated even on systems which don't have >64 CPUs, since it should be possible to set a Windows system to use multiple groups for any system.
ID: 61071 · Report as offensive
Profile Jord
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 29 Aug 05
Posts: 14593
Netherlands
Message 61072 - Posted: 19 Mar 2015, 12:20:28 UTC - in response to Message 61071.  
Last modified: 19 Mar 2015, 12:22:09 UTC

I thought that the information I provided is sufficient for technicians to understand the issue at first look.

From the forum index, second line: These message boards are frequented by volunteers. It's likely (but not guaranteed) that they'll be able to respond to your questions or suggestions.

The BOINC developers don't read here, unless pointed out by me or others. Between us volunteers we can solve quite a lot of problems before we need to ask the developers. We're not all technicians, rocket scientists, and such but we do know our BOINC and are better at helping when we have a basic understanding of what's actually being asked.

1. BOINC startup (is that relevant in this case?)

Yes, that is relevant, because it shows one thing:

Processor: 72 GenuineIntel Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz [Family 6 Model 63 Stepping 2]
All 72 cores are detected by BOINC.

Config: simulate 72 CPUs
I see you set <ncpus>72</ncpus>, which is totally unnecessary when BOINC has already detected the 72 cores. With that option you can tell BOINC to simulate that it has more cores than it actually has, e.g. on a 2 core system set it to 4 and it'll run 4 instances of tasks on those 2 cores.

If you want to juggle amount of CPU cores, use the "Use at most N% of the CPUs" option in the preferences.

3. Running 72 WCG/MCM tasks.

So then the BOINC client works as it should. You have 72 cores, you have 72 single threads taking up one task and one WCG science application per thread. The science applications determine how they use the memory and how much, not the BOINC client.

Meaning that if you want this changed, you'll have to ask WCG to change their way of application memory management. And then if WCG thinks this is a viable request, they can ask BOINC to come up with a different API to make science applications that way. But that's a different BOINC than the client.
ID: 61072 · Report as offensive
Profile Mumak
Avatar

Send message
Joined: 14 Mar 15
Posts: 10
Slovakia
Message 61073 - Posted: 19 Mar 2015, 12:36:46 UTC - in response to Message 61072.  
Last modified: 19 Mar 2015, 12:40:02 UTC

Ah, I thought the developers watch this forum.
I expected you asking about the <ncpus> directive used ;-) This was only added later as a desperate attempt to get it working properly. In fact it doesn't matter if it's present or not, the effect is always the same.

I have to disagree with you on the conclusion, because:
1. Even though I'm not familiar with BOINC internals and who's actually starting science apps/processes, I think it's the BOINC client - as seen in the process structure. So the BOINC client is actually responsible for starting a process and managing its affinity to particular resource (CPU Group, CPU thread). I'm not sure whether there's any affinity management implemented now, but I'd rather believe there's none, since it's not needed for systems that have a single NUMA node.

2. This is not just a problem of WCG or particular project application. I'm sure the same would happen to any other tasks from other projects as well.

3. Sure, you can shift the burden to project/application developers, but this way ALL projects would have to do the same - create new apps that are NUMA-aware (to adjust CPU affinity for their processes). Moreover, if there would be multiple CPU projects running on such a machine (which is a common case), the different projects would have to talk somehow to each other in order not to collide with affinities. I believe such solution is almost impossible, thus it would be best if BOINC would manage the CPU group affinity globally for each started application. It's not a big nor difficult task at all.

I believe this discussion is worth to be checked by the BOINC devs.
ID: 61073 · Report as offensive
Richard Haselgrove
Volunteer tester
Help desk expert

Send message
Joined: 5 Oct 06
Posts: 4478
United Kingdom
Message 61074 - Posted: 19 Mar 2015, 12:55:31 UTC - in response to Message 61073.  

I have to disagree with you on the conclusion, because:
1. Even though I'm not familiar with BOINC internals and who's actually starting science apps/processes, I think it's the BOINC client - as seen in the process structure. So the BOINC client is actually responsible for starting a process and managing its affinity to particular resource (CPU Group, CPU thread). I'm not sure whether there's any affinity management implemented now, but I'd rather believe there's none, since it's not needed for systems that have a single NUMA node.

There is device management at the BOINC client level, for GPUs and other co-processors, but even that is very rudimentary - for example, it isn't (yet) possible for the client to manage tasks automatically if they have different requirements, e.g. if a computer has a mixture of single-precision-only and double-precision-capable GPUs from the same manufacturer.

I'm pretty sure that for the time being, the BOINC client treats all CPUs as identical replicas of each other, and delegates all core management to the operating system.

I believe this discussion is worth to be checked by the BOINC devs.

Agreed. They'll probably need to consider it as a feature request.
ID: 61074 · Report as offensive
Profile Mumak
Avatar

Send message
Joined: 14 Mar 15
Posts: 10
Slovakia
Message 61075 - Posted: 19 Mar 2015, 13:06:48 UTC - in response to Message 61074.  


I'm pretty sure that for the time being, the BOINC client treats all CPUs as identical replicas of each other, and delegates all core management to the operating system.


Yes - for systems with a single CPU Group (NUMA Node) it's not needed to do anything special - the CPU affinity is set for all threads (default) and the OS manages it. But for those with multiple groups it's different since each process has affinity set to a given group (not all groups), so it can run only within a given node. My assumption is that the child processes inherit the group affinity of the parent BOINC process, so I feel the responsibility is on the BOINC client.
ID: 61075 · Report as offensive
Profile Mumak
Avatar

Send message
Joined: 14 Mar 15
Posts: 10
Slovakia
Message 61077 - Posted: 19 Mar 2015, 16:14:30 UTC
Last modified: 19 Mar 2015, 16:14:39 UTC

Issue submitted to BOINC Git: https://github.com/BOINC/boinc/issues/1357
ID: 61077 · Report as offensive
Rom Walton
Project developer
Avatar

Send message
Joined: 26 Aug 05
Posts: 164
Message 61078 - Posted: 19 Mar 2015, 16:36:29 UTC

My assumption is that the child processes inherit the group affinity of the parent BOINC process, so I feel the responsibility is on the BOINC client.


According to the published information, applications will be distributed across all the NUMA nodes.

See:
https://msdn.microsoft.com/en-us/library/windows/hardware/dn653313%28v=vs.85%29.aspx
----- Rom
BOINC Development Team, U.C. Berkeley
My Blog
ID: 61078 · Report as offensive
Profile Mumak
Avatar

Send message
Joined: 14 Mar 15
Posts: 10
Slovakia
Message 61079 - Posted: 19 Mar 2015, 17:19:25 UTC - in response to Message 61078.  

But that document says otherwise:

Each newly created thread is by default assigned to the same group as the thread that created it.


Only the system process is assigned a multigroup affinity at startup time. All other processes must explicitly assign threads to a different group to use the full set of processors in the system.


An application that requires the use of multiple groups so that it can run on more than 64 processors must intentionally determine where to run its threads. The application is responsible for setting thread affinities to the desired groups.


Also my experience and testing on that system confirms the above facts.
If you need any details from such a system let me know and I will provide them.
ID: 61079 · Report as offensive
Rom Walton
Project developer
Avatar

Send message
Joined: 26 Aug 05
Posts: 164
Message 61081 - Posted: 19 Mar 2015, 17:34:06 UTC - in response to Message 61079.  

But that document says otherwise:

Each newly created thread is by default assigned to the same group as the thread that created it.


Only the system process is assigned a multigroup affinity at startup time. All other processes must explicitly assign threads to a different group to use the full set of processors in the system.


An application that requires the use of multiple groups so that it can run on more than 64 processors must intentionally determine where to run its threads. The application is responsible for setting thread affinities to the desired groups.


Also my experience and testing on that system confirms the above facts.
If you need any details from such a system let me know and I will provide them.


It also says:

• Windows 7 initially assigns each process to a single group in a round-robin manner across the groups in the system. A process starts its execution assigned to exactly one group.


Which implies that each time BOINC starts a new process, it'll be assigned to a different NUMA node.

I guess we'll need to setup a demo project with an app that can print out which group it has been assigned to, to know what is going on.
----- Rom
BOINC Development Team, U.C. Berkeley
My Blog
ID: 61081 · Report as offensive
Profile Mumak
Avatar

Send message
Joined: 14 Mar 15
Posts: 10
Slovakia
Message 61082 - Posted: 19 Mar 2015, 17:49:11 UTC - in response to Message 61081.  

I'm not sure if that concerns the way how BOINC creates processes. Sometimes it seems that they are spread over nodes, but most of the time they are not.

Here a screenshot that shows how it looks like. Running 72 WCG MCM tasks:


As you can see, the usage in Group0 is almost 0, while Group1 is 100%.
When manually checking the affinity of particular MCM processes (via Task Manager), they all seem to be assigned to Group1.
I haven't found a tool that would show group affinities for all processes in a nice list and don't have time to write such..
ID: 61082 · Report as offensive
Rom Walton
Project developer
Avatar

Send message
Joined: 26 Aug 05
Posts: 164
Message 61094 - Posted: 19 Mar 2015, 23:50:41 UTC

Could you run this private drop of BOINC:
http://www.romwnet.org/files/boinc.190315.x64.zip

Thanks in advance.
----- Rom
BOINC Development Team, U.C. Berkeley
My Blog
ID: 61094 · Report as offensive
Profile Mumak
Avatar

Send message
Joined: 14 Mar 15
Posts: 10
Slovakia
Message 61097 - Posted: 20 Mar 2015, 7:15:19 UTC - in response to Message 61094.  

I'm sorry, but this doesn't seem to have changed anything.
Node0 is 100%, Node1 ~0%. Briefly checking process affnities confirms that all are running in Group 0.
What have you changed there?
ID: 61097 · Report as offensive

Message boards : Questions and problems : BOINC doesn't utilize all CPUs on NUMA multi-node systems (>64 threads)

Copyright © 2021 University of California. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.