Message boards :
Questions and problems :
BOINC doesn't utilize all CPUs on NUMA multi-node systems (>64 threads)
Message board moderation
Author | Message |
---|---|
Send message Joined: 14 Mar 15 Posts: 10 |
Hello, I have noticed a problem where BOINC doesn't seem to properly schedule threads on systems that have multiple CPU groups. OS: Win 8.1 x64 CPU: 2 x Intel Xeon E5-2699 v3 (72 threads in total) BOINC v7.4.36 x64 Tasks: WCG on 100% of threads This system is setup to have 2 CPU Groups, each of them has 36 threads. So 1 group per CPU. As per GetLogicalProcessorInformationEx: Group[0]: MaximumProcessorCount=64, ActiveProcessorCount=36, ActiveProcessorMask=0000000FFFFFFFFF Group[1]: MaximumProcessorCount=64, ActiveProcessorCount=36, ActiveProcessorMask=0000000FFFFFFFFF Now the problem is that almost all threads are assigned affinity to Group 0 (NUMA Node 0). This can be clearly seen in Task Manager - NUMA Node 0 usage is 100%, Node 1 much lower and also confirmed by checking particular thread's affinity settings. I'm not sure whether BOINC manually assigns affinity to each thread, or if it does handle assigning of threads to particular groups. But I think if the originating process is set to Group 0 Affinity, then all subsequent threads are by default assigned to the same group unless you do this explicitly via SetThreadGroupAffinity/SetThreadIdealProcessorEx. Thanks, Martin |
Send message Joined: 14 Mar 15 Posts: 10 |
Actually with "threads" I meant processes in most cases. So for a particular example, there are 72 WCG processes running and (almost) all of them have affinity set to Group (NUMA Node) 0. Group 1 is almost idle. There might be 2 solutions: 1. either start as many instances of BOINC process as there are groups, each set to a different group affinity (SetThreadGroupAffinity) and from there create child processes. 2. or manage all process/thread affinity explicitly. |
Send message Joined: 14 Mar 15 Posts: 10 |
Is there no interest from BOINC to utilize more than 64 threads (on Windows) ?! |
Send message Joined: 29 Aug 05 Posts: 15477 |
A couple of problems with your posts: 1) No BOINC start-up messages showing what BOINC is showing. 2) No explanation on what NUMA is. The only NUMA I know is the company that Dirk Pitt works for, it's doubtful you mean that one. And since you now have to look up what I mean, did you mean I have to go look up what you mean? 3) Other than saying you run WCG, you don't specify sub-project and amount of memory. Can you even run 64 single threads based on the amount of memory? 4) Or do you want to run work divided over the 64 threads? In that case, you have to ask the project for an application (OpenCL or MT) that can do so. You may want to peruse this thread as well, for minimum amounts of useful info to post to get help around here. |
Send message Joined: 14 Mar 15 Posts: 10 |
I thought that the information I provided is sufficient for technicians to understand the issue at first look. I have even provided API functions which should be checked/used. 1. BOINC startup (is that relevant in this case?): | Starting BOINC client version 7.4.36 for windows_x86_64 | log flags: file_xfer, sched_ops, task | Libraries: libcurl/7.39.0 OpenSSL/1.0.1j zlib/1.2.8 | Data directory: C:\ProgramData\BOINC | Running under account Administrator | CUDA: NVIDIA GPU 0: Quadro K5200 (driver version 341.21, CUDA version 6.5, compute capability 3.5, 4096MB, 4096MB available, 3553 GFLOPS peak) | OpenCL: NVIDIA GPU 0: Quadro K5200 (driver version 341.21, device version OpenCL 1.1 CUDA, 8192MB, 4096MB available, 3553 GFLOPS peak) | Host name: WIN-99OC5SRSPAM | Processor: 72 GenuineIntel Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz [Family 6 Model 63 Stepping 2] | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 fma cx16 sse4_1 sse4_2 movebe popcnt aes f16c rdrandsyscall nx lm avx avx2 vmx smx tm2 dca pbe fsgsbase bmi1 smep bmi2 | OS: Microsoft Windows Server 2012 R2: Standard x64 Edition, (06.03.9600.00) | Memory: 255.89 GB physical, 257.85 GB virtual | Disk: 185.97 GB total, 133.88 GB free | Local time is UTC -7 hours | VirtualBox version: 4.3.12 | Config: simulate 72 CPUs World Community Grid | URL http://www.worldcommunitygrid.org/; Computer ID 3277834; resource share 100 World Community Grid | General prefs: from World Community Grid (last modified 07-Mar-2015 15:09:57) World Community Grid | Host location: none World Community Grid | General prefs: using your defaults | Reading preferences override file | Preferences: | max memory usage when active: 196525.90MB | max memory usage when idle: 235831.09MB | max disk usage: 10.00GB | (to change preferences, visit a project web site or select Preferences in the Manager) | Not using a proxy 2. NUMA = Non-uniform memory access - http://en.wikipedia.org/wiki/Non-uniform_memory_access; MS: https://msdn.microsoft.com/en-us/library/windows/desktop/aa363804%28v=vs.85%29.aspx 3. Running 72 WCG/MCM tasks. There's enough memory and this issue is not related to memory capacity either. 4. No, as per 3. there are as many tasks running as there are logical CPUs in the system. And the issue is, that almost all are scheduled (by BOINC) to the same NUMA Node (CPU Group), which causes an overload on NUMA n, while NUMA!=n is almost idle. The situation can be simulated even on systems which don't have >64 CPUs, since it should be possible to set a Windows system to use multiple groups for any system. |
Send message Joined: 29 Aug 05 Posts: 15477 |
I thought that the information I provided is sufficient for technicians to understand the issue at first look. From the forum index, second line: These message boards are frequented by volunteers. It's likely (but not guaranteed) that they'll be able to respond to your questions or suggestions. The BOINC developers don't read here, unless pointed out by me or others. Between us volunteers we can solve quite a lot of problems before we need to ask the developers. We're not all technicians, rocket scientists, and such but we do know our BOINC and are better at helping when we have a basic understanding of what's actually being asked. 1. BOINC startup (is that relevant in this case?) Yes, that is relevant, because it shows one thing: Processor: 72 GenuineIntel Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz [Family 6 Model 63 Stepping 2]All 72 cores are detected by BOINC. Config: simulate 72 CPUsI see you set <ncpus>72</ncpus>, which is totally unnecessary when BOINC has already detected the 72 cores. With that option you can tell BOINC to simulate that it has more cores than it actually has, e.g. on a 2 core system set it to 4 and it'll run 4 instances of tasks on those 2 cores. If you want to juggle amount of CPU cores, use the "Use at most N% of the CPUs" option in the preferences. 3. Running 72 WCG/MCM tasks. So then the BOINC client works as it should. You have 72 cores, you have 72 single threads taking up one task and one WCG science application per thread. The science applications determine how they use the memory and how much, not the BOINC client. Meaning that if you want this changed, you'll have to ask WCG to change their way of application memory management. And then if WCG thinks this is a viable request, they can ask BOINC to come up with a different API to make science applications that way. But that's a different BOINC than the client. |
Send message Joined: 14 Mar 15 Posts: 10 |
Ah, I thought the developers watch this forum. I expected you asking about the <ncpus> directive used ;-) This was only added later as a desperate attempt to get it working properly. In fact it doesn't matter if it's present or not, the effect is always the same. I have to disagree with you on the conclusion, because: 1. Even though I'm not familiar with BOINC internals and who's actually starting science apps/processes, I think it's the BOINC client - as seen in the process structure. So the BOINC client is actually responsible for starting a process and managing its affinity to particular resource (CPU Group, CPU thread). I'm not sure whether there's any affinity management implemented now, but I'd rather believe there's none, since it's not needed for systems that have a single NUMA node. 2. This is not just a problem of WCG or particular project application. I'm sure the same would happen to any other tasks from other projects as well. 3. Sure, you can shift the burden to project/application developers, but this way ALL projects would have to do the same - create new apps that are NUMA-aware (to adjust CPU affinity for their processes). Moreover, if there would be multiple CPU projects running on such a machine (which is a common case), the different projects would have to talk somehow to each other in order not to collide with affinities. I believe such solution is almost impossible, thus it would be best if BOINC would manage the CPU group affinity globally for each started application. It's not a big nor difficult task at all. I believe this discussion is worth to be checked by the BOINC devs. |
Send message Joined: 5 Oct 06 Posts: 5077 |
I have to disagree with you on the conclusion, because: There is device management at the BOINC client level, for GPUs and other co-processors, but even that is very rudimentary - for example, it isn't (yet) possible for the client to manage tasks automatically if they have different requirements, e.g. if a computer has a mixture of single-precision-only and double-precision-capable GPUs from the same manufacturer. I'm pretty sure that for the time being, the BOINC client treats all CPUs as identical replicas of each other, and delegates all core management to the operating system. I believe this discussion is worth to be checked by the BOINC devs. Agreed. They'll probably need to consider it as a feature request. |
Send message Joined: 14 Mar 15 Posts: 10 |
Yes - for systems with a single CPU Group (NUMA Node) it's not needed to do anything special - the CPU affinity is set for all threads (default) and the OS manages it. But for those with multiple groups it's different since each process has affinity set to a given group (not all groups), so it can run only within a given node. My assumption is that the child processes inherit the group affinity of the parent BOINC process, so I feel the responsibility is on the BOINC client. |
Send message Joined: 14 Mar 15 Posts: 10 |
Issue submitted to BOINC Git: https://github.com/BOINC/boinc/issues/1357 |
Send message Joined: 26 Aug 05 Posts: 164 |
My assumption is that the child processes inherit the group affinity of the parent BOINC process, so I feel the responsibility is on the BOINC client. According to the published information, applications will be distributed across all the NUMA nodes. See: https://msdn.microsoft.com/en-us/library/windows/hardware/dn653313%28v=vs.85%29.aspx ----- Rom BOINC Development Team, U.C. Berkeley My Blog |
Send message Joined: 14 Mar 15 Posts: 10 |
But that document says otherwise: Each newly created thread is by default assigned to the same group as the thread that created it. Only the system process is assigned a multigroup affinity at startup time. All other processes must explicitly assign threads to a different group to use the full set of processors in the system. An application that requires the use of multiple groups so that it can run on more than 64 processors must intentionally determine where to run its threads. The application is responsible for setting thread affinities to the desired groups. Also my experience and testing on that system confirms the above facts. If you need any details from such a system let me know and I will provide them. |
Send message Joined: 26 Aug 05 Posts: 164 |
But that document says otherwise: It also says:
Which implies that each time BOINC starts a new process, it'll be assigned to a different NUMA node. I guess we'll need to setup a demo project with an app that can print out which group it has been assigned to, to know what is going on. ----- Rom BOINC Development Team, U.C. Berkeley My Blog |
Send message Joined: 14 Mar 15 Posts: 10 |
I'm not sure if that concerns the way how BOINC creates processes. Sometimes it seems that they are spread over nodes, but most of the time they are not. Here a screenshot that shows how it looks like. Running 72 WCG MCM tasks: As you can see, the usage in Group0 is almost 0, while Group1 is 100%. When manually checking the affinity of particular MCM processes (via Task Manager), they all seem to be assigned to Group1. I haven't found a tool that would show group affinities for all processes in a nice list and don't have time to write such.. |
Send message Joined: 26 Aug 05 Posts: 164 |
Could you run this private drop of BOINC: http://www.romwnet.org/files/boinc.190315.x64.zip Thanks in advance. ----- Rom BOINC Development Team, U.C. Berkeley My Blog |
Send message Joined: 14 Mar 15 Posts: 10 |
I'm sorry, but this doesn't seem to have changed anything. Node0 is 100%, Node1 ~0%. Briefly checking process affnities confirms that all are running in Group 0. What have you changed there? |
Copyright © 2024 University of California.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License,
Version 1.2 or any later version published by the Free Software Foundation.