aboutsummaryrefslogtreecommitdiff
path: root/include/linux/sched/topology.h
AgeCommit message (Collapse)AuthorFilesLines
2026-03-24sched/topology: Fix sched_domain_span()Peter Zijlstra1-6/+18
Commit 8e8e23dea43e ("sched/topology: Compute sd_weight considering cpuset partitions") ends up relying on the fact that structure initialization should not touch the flexible array. However, the official GCC specification for "Arrays of Length Zero" [*] says: Although the size of a zero-length array is zero, an array member of this kind may increase the size of the enclosing type as a result of tail padding. Additionally, structure initialization will zero tail padding. With the end result that since offsetof(*type, member) < sizeof(*type), array initialization will clobber the flex array. Luckily, the way flexible array sizes are calculated is: sizeof(*type) + count * sizeof(*type->member) This means we have the complete size of the flex array *outside* of sizeof(*type), so use that instead of relying on the broken flex array definition. [*] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html Fixes: 8e8e23dea43e ("sched/topology: Compute sd_weight considering cpuset partitions") Reported-by: Nathan Chancellor <nathan@kernel.org> Debugged-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Chen Yu <yu.c.chen@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: Nathan Chancellor <nathan@kernel.org> Link: https://patch.msgid.link/20260323093627.GY3738010@noisy.programming.kicks-ass.net
2026-03-18sched/topology: Remove sched_domain_shared allocation with sd_dataK Prateek Nayak1-1/+0
Now that "sd->shared" assignments are using the sched_domain_shared objects allocated with s_data, remove the sd_data based allocations. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://patch.msgid.link/20260312044434.1974-6-kprateek.nayak@amd.com
2026-02-23sched/fair: More complex proportional newidle balancePeter Zijlstra1-0/+1
It turns out that a few workloads (easyWave, fio) have a fairly low success rate on newidle balance, but still benefit greatly from having it anyway. Luckliky these workloads have a faily low newidle rate, so the cost if doing the newidle is relatively low, even if unsuccessfull. Add a simple rate based part to the newidle ratio compute, such that low rate newidle will still have a high newidle ratio. This cures the easyWave and fio workloads while not affecting the schbench numbers either (which have a very high newidle rate). Reported-by: Mario Roy <marioeroy@gmail.com> Reported-by: "Mohamed Abuelfotoh, Hazem" <abuehaze@amazon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Mario Roy <marioeroy@gmail.com> Tested-by: "Mohamed Abuelfotoh, Hazem" <abuehaze@amazon.com> Link: https://patch.msgid.link/20260127151748.GA1079264@noisy.programming.kicks-ass.net
2025-11-17sched/fair: Proportional newidle balancePeter Zijlstra1-0/+3
Add a randomized algorithm that runs newidle balancing proportional to its success rate. This improves schbench significantly: 6.18-rc4: 2.22 Mrps/s 6.18-rc4+revert: 2.04 Mrps/s 6.18-rc4+revert+random: 2.18 Mrps/S Conversely, per Adam Li this affects SpecJBB slightly, reducing it by 1%: 6.17: -6% 6.17+revert: 0% 6.17+revert+random: -1% Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Chris Mason <clm@meta.com> Link: https://lkml.kernel.org/r/6825c50d-7fa7-45d8-9b81-c6e7e25738e2@meta.com Link: https://patch.msgid.link/20251107161739.770122091@infradead.org
2025-09-03sched: Move STDL_INIT() functions out-of-linePeter Zijlstra1-42/+7
Since all these functions are address-taken in SDTL_INIT() and called indirectly, it doesn't really make sense for them to be inline. Suggested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-09-03sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask()Peter Zijlstra1-1/+27
Leon [1] and Vinicius [2] noted a topology_span_sane() warning during their testing starting from v6.16-rc1. Debug that followed pointed to the tl->mask() for the NODE domain being incorrectly resolved to that of the highest NUMA domain. tl->mask() for NODE is set to the sd_numa_mask() which depends on the global "sched_domains_curr_level" hack. "sched_domains_curr_level" is set to the "tl->numa_level" during tl traversal in build_sched_domains() calling sd_init() but was not reset before topology_span_sane(). Since "tl->numa_level" still reflected the old value from build_sched_domains(), topology_span_sane() for the NODE domain trips when the span of the last NUMA domain overlaps. Instead of replicating the "sched_domains_curr_level" hack, get rid of it entirely and instead, pass the entire "sched_domain_topology_level" object to tl->cpumask() function to prevent such mishap in the future. sd_numa_mask() now directly references "tl->numa_level" instead of relying on the global "sched_domains_curr_level" hack to index into sched_domains_numa_masks[]. The original warning was reproducible on the following NUMA topology reported by Leon: $ sudo numactl -H available: 5 nodes (0-4) node 0 cpus: 0 1 node 0 size: 2927 MB node 0 free: 1603 MB node 1 cpus: 2 3 node 1 size: 3023 MB node 1 free: 3008 MB node 2 cpus: 4 5 node 2 size: 3023 MB node 2 free: 3007 MB node 3 cpus: 6 7 node 3 size: 3023 MB node 3 free: 3002 MB node 4 cpus: 8 9 node 4 size: 3022 MB node 4 free: 2718 MB node distances: node 0 1 2 3 4 0: 10 39 38 37 36 1: 39 10 38 37 36 2: 38 38 10 37 36 3: 37 37 37 10 36 4: 36 36 36 36 10 The above topology can be mimicked using the following QEMU cmd that was used to reproduce the warning and test the fix: sudo qemu-system-x86_64 -enable-kvm -cpu host \ -m 20G -smp cpus=10,sockets=10 -machine q35 \ -object memory-backend-ram,size=4G,id=m0 \ -object memory-backend-ram,size=4G,id=m1 \ -object memory-backend-ram,size=4G,id=m2 \ -object memory-backend-ram,size=4G,id=m3 \ -object memory-backend-ram,size=4G,id=m4 \ -numa node,cpus=0-1,memdev=m0,nodeid=0 \ -numa node,cpus=2-3,memdev=m1,nodeid=1 \ -numa node,cpus=4-5,memdev=m2,nodeid=2 \ -numa node,cpus=6-7,memdev=m3,nodeid=3 \ -numa node,cpus=8-9,memdev=m4,nodeid=4 \ -numa dist,src=0,dst=1,val=39 \ -numa dist,src=0,dst=2,val=38 \ -numa dist,src=0,dst=3,val=37 \ -numa dist,src=0,dst=4,val=36 \ -numa dist,src=1,dst=0,val=39 \ -numa dist,src=1,dst=2,val=38 \ -numa dist,src=1,dst=3,val=37 \ -numa dist,src=1,dst=4,val=36 \ -numa dist,src=2,dst=0,val=38 \ -numa dist,src=2,dst=1,val=38 \ -numa dist,src=2,dst=3,val=37 \ -numa dist,src=2,dst=4,val=36 \ -numa dist,src=3,dst=0,val=37 \ -numa dist,src=3,dst=1,val=37 \ -numa dist,src=3,dst=2,val=37 \ -numa dist,src=3,dst=4,val=36 \ -numa dist,src=4,dst=0,val=36 \ -numa dist,src=4,dst=1,val=36 \ -numa dist,src=4,dst=2,val=36 \ -numa dist,src=4,dst=3,val=36 \ ... [ prateek: Moved common functions to include/linux/sched/topology.h, reuse the common bits for s390 and ppc, commit message ] Closes: https://lore.kernel.org/lkml/20250610110701.GA256154@unreal/ [1] Fixes: ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap") # ce29a7da84cd, f55dac1dafb3 Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reported-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Tested-by: Valentin Schneider <vschneid@redhat.com> # x86 Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> # powerpc Link: https://lore.kernel.org/lkml/a3de98387abad28592e6ab591f3ff6107fe01dc1.1755893468.git.tim.c.chen@linux.intel.com/ [2]
2025-07-14sched/topology: Remove sched_domain_topology_level::flagsK Prateek Nayak1-3/+0
Support for overlapping domains added in commit e3589f6c81e4 ("sched: Allow for overlapping sched_domain spans") also allowed forcefully setting SD_OVERLAP for !NUMA domains via FORCE_SD_OVERLAP sched_feat(). Since NUMA domains had to be presumed overlapping to ensure correct behavior, "sched_domain_topology_level::flags" was introduced. NUMA domains added the SDTL_OVERLAP flag would ensure SD_OVERLAP was always added during build_sched_domains() for these domains, even when FORCE_SD_OVERLAP was off. Condition for adding the SD_OVERLAP flag at the aforementioned commit was as follows: if (tl->flags & SDTL_OVERLAP || sched_feat(FORCE_SD_OVERLAP)) sd->flags |= SD_OVERLAP; The FORCE_SD_OVERLAP debug feature was removed in commit af85596c74de ("sched/topology: Remove FORCE_SD_OVERLAP") which left the NUMA domains as the exclusive users of SDTL_OVERLAP, SD_OVERLAP, and SD_NUMA flags. Get rid of SDTL_OVERLAP and SD_OVERLAP as they have become redundant and instead rely on SD_NUMA to detect the only overlapping domain currently supported. Since SDTL_OVERLAP was the only user of "tl->flags", get rid of "sched_domain_topology_level::flags" too. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/ba4dbdf8-bc37-493d-b2e0-2efb00ea3e19@amd.com
2025-07-14smpboot: introduce SDTL_INIT() helper to tidy sched topology setupLi Chen1-2/+2
Define a small SDTL_INIT(maskfn, flagsfn, name) macro and use it to build the sched_domain_topology_level array. Purely a cleanup; behaviour is unchanged. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20250710105715.66594-2-me@linux.beauty
2025-06-13sched/smp: Make SMP unconditionalIngo Molnar1-32/+0
Simplify the scheduler by making CONFIG_SMP=y primitives and data structures unconditional. Introduce transitory wrappers for functionality not yet converted to SMP. Note that this patch is pretty large, because there's no clear separation between various aspects of the SMP scheduler, it's basically a huge block of #ifdef CONFIG_SMP. A fair amount of it has to be switched on for it to boot and work on UP systems. Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20250528080924.2273858-21-mingo@kernel.org
2025-04-16sched/topology: Introduce sched_update_asym_prefer_cpu()K Prateek Nayak1-0/+6
A subset of AMD Processors supporting Preferred Core Rankings also feature the ability to dynamically switch these rankings at runtime to bias load balancing towards or away from the LLC domain with larger cache. To support dynamically updating "sg->asym_prefer_cpu" without needing to rebuild the sched domain, introduce sched_update_asym_prefer_cpu() which recomutes the "asym_prefer_cpu" when the core-ranking of a CPU changes. sched_update_asym_prefer_cpu() swaps the "sg->asym_prefer_cpu" with the CPU whose ranking has changed if the new ranking is greater than that of the "asym_prefer_cpu". If CPU whose ranking has changed is the current "asym_prefer_cpu", it scans the CPUs of the sched groups to find the new "asym_prefer_cpu" and sets it accordingly. get_group() for non-overlapping sched domains returns the sched group for the first CPU in the sched_group_span() which ensures all CPUs in the group see the updated value of "asym_prefer_cpu". Overlapping groups are allocated differently and will require moving the "asym_prefer_cpu" to "sg->sgc" but since the current implementations do not set "SD_ASYM_PACKING" at NUMA domains, skip additional indirection and place a SCHED_WARN_ON() to alert any future users. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250409053446.23367-3-kprateek.nayak@amd.com
2025-03-19sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditionalIngo Molnar1-4/+0
All the big Linux distros enable CONFIG_SCHED_DEBUG, because the various features it provides help not just with kernel development, but with system administration and user-space software development as well. Reflect this reality and enable this functionality unconditionally. Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250317104257.3496611-4-mingo@kernel.org
2025-03-17include/{topology,cpuset}: Move dl_rebuild_rd_accounting to cpuset.hJuri Lelli1-2/+0
dl_rebuild_rd_accounting() is defined in cpuset.c, so it makes more sense to move related declarations to cpuset.h. Implement the move. Suggested-by: Waiman Long <llong@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Waiman Long <llong@redhat.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Waiman Long <longman@redhat.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/Z9MSOVMpU7jpVrMU@jlelli-thinkpadt14gen4.remote.csb
2025-03-17sched/topology: Stop exposing partition_sched_domains_lockedJuri Lelli1-10/+0
The are no callers of partition_sched_domains_locked() outside topology.c. Stop exposing such function. Suggested-by: Waiman Long <llong@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Waiman Long <longman@redhat.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/Z9MSC96a8FcqWV3G@jlelli-thinkpadt14gen4.remote.csb
2025-03-17sched/deadline: Rebuild root domain accounting after every updateJuri Lelli1-0/+2
Rebuilding of root domains accounting information (total_bw) is currently broken on some cases, e.g. suspend/resume on aarch64. Problem is that the way we keep track of domain changes and try to add bandwidth back is convoluted and fragile. Fix it by simplify things by making sure bandwidth accounting is cleared and completely restored after root domains changes (after root domains are again stable). To be sure we always call dl_rebuild_rd_accounting while holding cpuset_mutex we also add cpuset_reset_sched_domains() wrapper. Fixes: 53916d5fd3c0 ("sched/deadline: Check bandwidth overflow earlier for hotplug") Reported-by: Jon Hunter <jonathanh@nvidia.com> Co-developed-by: Waiman Long <llong@redhat.com> Signed-off-by: Waiman Long <llong@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/Z9MRfeJKJUOyUSto@jlelli-thinkpadt14gen4.remote.csb
2024-12-20sched: Move sched domain name out of CONFIG_SCHED_DEBUGSwapnil Sapkal1-8/+0
/proc/schedstat file shows cpu and sched domain level scheduler statistics. It does not show domain name instead shows domain level. It will be very useful for tools like `perf sched stats`[1] to aggragate domain level stats if domain names are shown in /proc/schedstat. But sched domain name is guarded by CONFIG_SCHED_DEBUG. As per the discussion[2], move sched domain name out of CONFIG_SCHED_DEBUG. [1] https://lore.kernel.org/lkml/20241122084452.1064968-1-swapnil.sapkal@amd.com/ [2] https://lore.kernel.org/lkml/fcefeb4d-3acb-462d-9c9b-3df8d927e522@amd.com/ Suggested-by: "Gautham R. Shenoy" <gautham.shenoy@amd.com> Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20241220063224.17767-5-swapnil.sapkal@amd.com
2024-12-20sched: Report the different kinds of imbalances in /proc/schedstatSwapnil Sapkal1-1/+4
In /proc/schedstat, lb_imbalance reports the sum of imbalances discovered in sched domains with each call to sched_balance_rq(), which is not very useful because lb_imbalance does not mention whether the imbalance is due to load, utilization, nr_tasks or misfit_tasks. Remove this field from /proc/schedstat. Currently there is no field in /proc/schedstat to report different types of imbalances. Introduce new fields in /proc/schedstat to report the total imbalances in load, utilization, nr_tasks or misfit_tasks. Added fields to /proc/schedstat: - lb_imbalance_load: Total imbalance due to load. - lb_imbalance_util: Total imbalance due to utilization. - lb_imbalance_task: Total imbalance due to number of tasks. - lb_imbalance_misfit: Total imbalance due to misfit tasks. Signed-off-by: Swapnil Sapkal <swapnil.sapkal@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20241220063224.17767-4-swapnil.sapkal@amd.com
2024-04-24sched/cpufreq: Rename arch_update_thermal_pressure() => ↵Vincent Guittot1-4/+4
arch_update_hw_pressure() Now that cpufreq provides a pressure value to the scheduler, rename arch_update_thermal_pressure into HW pressure to reflect that it returns a pressure applied by HW (i.e. with a high frequency change) and not always related to thermal mitigation but also generated by max current limitation as an example. Such high frequency signal needs filtering to be smoothed and provide an value that reflects the average available capacity into the scheduler time scale. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://lore.kernel.org/r/20240326091616.3696851-5-vincent.guittot@linaro.org
2024-03-12sched/balancing: Rename load_balance() => sched_balance_rq()Ingo Molnar1-1/+1
Standardize scheduler load-balancing function names on the sched_balance_() prefix. Also load_balance() has become somewhat of a misnomer: historically it was the first and primary load-balancing function that was called, but with the introduction of sched domains, it's become a lower layer function that balances runqueues. Rename it to sched_balance_rq() accordingly. Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://lore.kernel.org/r/20240308111819.1101550-6-mingo@kernel.org
2024-03-11Merge tag 'sched-core-2024-03-11' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Fix inconsistency in misfit task load-balancing - Fix CPU isolation bugs in the task-wakeup logic - Rework and unify the sched_use_asym_prio() and sched_asym_prefer() logic - Clean up and simplify ->avg_* accesses - Misc cleanups and fixes * tag 'sched-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/topology: Rename SD_SHARE_PKG_RESOURCES to SD_SHARE_LLC sched/fair: Check the SD_ASYM_PACKING flag in sched_use_asym_prio() sched/fair: Rework sched_use_asym_prio() and sched_asym_prefer() sched/fair: Remove unused parameter from sched_asym() sched/topology: Remove duplicate descriptions from TOPOLOGY_SD_FLAGS sched/fair: Simplify the update_sd_pick_busiest() logic sched/fair: Do strict inequality check for busiest misfit task group sched/fair: Remove unnecessary goto in update_sd_lb_stats() sched/fair: Take the scheduling domain into account in select_idle_core() sched/fair: Take the scheduling domain into account in select_idle_smt() sched/fair: Add READ_ONCE() and use existing helper function to access ->avg_irq sched/fair: Use existing helper functions to access ->avg_rt and ->avg_dl sched/core: Simplify code by removing duplicate #ifdefs
2024-02-28sched/topology: Rename SD_SHARE_PKG_RESOURCES to SD_SHARE_LLCAlex Shi1-3/+3
SD_SHARE_PKG_RESOURCES is a bit of a misnomer: its naming suggests that it's sharing all 'package resources' - while in reality it's specifically for sharing the LLC only. Rename it to SD_SHARE_LLC to reduce confusion. [ mingo: Rewrote the confusing changelog as well. ] Suggested-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Alex Shi <alexs@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Reviewed-by: Barry Song <baohua@kernel.org> Link: https://lore.kernel.org/r/20240210113924.1130448-5-alexs@kernel.org
2024-02-24sched: Add a new function to compare if two cpus have the same capacityQais Yousef1-0/+6
The new helper function is needed to help blk-mq check if it needs to dispatch the softirq on another CPU to match the performance level the IO requester is running at. This is important on HMP systems where not all CPUs have the same compute capacity. Signed-off-by: Qais Yousef <qyousef@layalina.io> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240223155749.2958009-2-qyousef@layalina.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-23sched/topology: Add a new arch_scale_freq_ref() methodVincent Guittot1-0/+8
Create a new method to get a unique and fixed max frequency. Currently cpuinfo.max_freq or the highest (or last) state of performance domain are used as the max frequency when computing the frequency for a level of utilization, but: - cpuinfo_max_freq can change at runtime. boost is one example of such change. - cpuinfo.max_freq and last item of the PD can be different leading to different results between cpufreq and energy model. We need to save the reference frequency that has been used when computing the CPUs capacity and use this fixed and coherent value to convert between frequency and CPU's capacity. In fact, we already save the frequency that has been used when computing the capacity of each CPU. We extend the precision to save kHz instead of MHz currently and we modify the type to be aligned with other variables used when converting frequency to capacity and the other way. [ mingo: Minor edits. ] Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Lukasz Luba <lukasz.luba@arm.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Acked-by: Sudeep Holla <sudeep.holla@arm.com> Link: https://lore.kernel.org/r/20231211104855.558096-2-vincent.guittot@linaro.org
2023-10-24sched/fair: Remove SIS_PROPPeter Zijlstra1-2/+0
SIS_UTIL seems to work well, lets remove the old thing. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20231020134337.GD33965@noisy.programming.kicks-ass.net
2023-10-24sched: Add cpus_share_resources APIBarry Song1-1/+7
Add cpus_share_resources() API. This is the preparation for the optimization of select_idle_cpu() on platforms with cluster scheduler level. On a machine with clusters cpus_share_resources() will test whether two cpus are within the same cluster. On a non-cluster machine it will behaves the same as cpus_share_cache(). So we use "resources" here for cache resources. Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lkml.kernel.org/r/20231019033323.54147-2-yangyicong@huawei.com
2023-06-16sched/topology: Mark set_sched_topology() __initMiaohe Lin1-1/+1
All callers of set_sched_topology() are within __init section. Mark it __init too. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230603073645.1173332-1-linmiaohe@huawei.com
2022-06-28sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avgChen Yu1-0/+1
[Problem Statement] select_idle_cpu() might spend too much time searching for an idle CPU, when the system is overloaded. The following histogram is the time spent in select_idle_cpu(), when running 224 instances of netperf on a system with 112 CPUs per LLC domain: @usecs: [0] 533 | | [1] 5495 | | [2, 4) 12008 | | [4, 8) 239252 | | [8, 16) 4041924 |@@@@@@@@@@@@@@ | [16, 32) 12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32, 64) 14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [64, 128) 13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [128, 256) 8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [256, 512) 4507667 |@@@@@@@@@@@@@@@ | [512, 1K) 2600472 |@@@@@@@@@ | [1K, 2K) 927912 |@@@ | [2K, 4K) 218720 | | [4K, 8K) 98161 | | [8K, 16K) 37722 | | [16K, 32K) 6715 | | [32K, 64K) 477 | | [64K, 128K) 7 | | netperf latency usecs: ======= case load Lat_99th std% TCP_RR thread-224 257.39 ( 0.21) The time spent in select_idle_cpu() is visible to netperf and might have a negative impact. [Symptom analysis] The patch [1] from Mel Gorman has been applied to track the efficiency of select_idle_sibling. Copy the indicators here: SIS Search Efficiency(se_eff%): A ratio expressed as a percentage of runqueues scanned versus idle CPUs found. A 100% efficiency indicates that the target, prev or recent CPU of a task was idle at wakeup. The lower the efficiency, the more runqueues were scanned before an idle CPU was found. SIS Domain Search Efficiency(dom_eff%): Similar, except only for the slower SIS patch. SIS Fast Success Rate(fast_rate%): Percentage of SIS that used target, prev or recent CPUs. SIS Success rate(success_rate%): Percentage of scans that found an idle CPU. The test is based on Aubrey's schedtests tool, including netperf, hackbench, schbench and tbench. Test on vanilla kernel: schedstat_parse.py -f netperf_vanilla.log case load se_eff% dom_eff% fast_rate% success_rate% TCP_RR 28 threads 99.978 18.535 99.995 100.000 TCP_RR 56 threads 99.397 5.671 99.964 100.000 TCP_RR 84 threads 21.721 6.818 73.632 100.000 TCP_RR 112 threads 12.500 5.533 59.000 100.000 TCP_RR 140 threads 8.524 4.535 49.020 100.000 TCP_RR 168 threads 6.438 3.945 40.309 99.999 TCP_RR 196 threads 5.397 3.718 32.320 99.982 TCP_RR 224 threads 4.874 3.661 25.775 99.767 UDP_RR 28 threads 99.988 17.704 99.997 100.000 UDP_RR 56 threads 99.528 5.977 99.970 100.000 UDP_RR 84 threads 24.219 6.992 76.479 100.000 UDP_RR 112 threads 13.907 5.706 62.538 100.000 UDP_RR 140 threads 9.408 4.699 52.519 100.000 UDP_RR 168 threads 7.095 4.077 44.352 100.000 UDP_RR 196 threads 5.757 3.775 35.764 99.991 UDP_RR 224 threads 5.124 3.704 28.748 99.860 schedstat_parse.py -f schbench_vanilla.log (each group has 28 tasks) case load se_eff% dom_eff% fast_rate% success_rate% normal 1 mthread 99.152 6.400 99.941 100.000 normal 2 mthreads 97.844 4.003 99.908 100.000 normal 3 mthreads 96.395 2.118 99.917 99.998 normal 4 mthreads 55.288 1.451 98.615 99.804 normal 5 mthreads 7.004 1.870 45.597 61.036 normal 6 mthreads 3.354 1.346 20.777 34.230 normal 7 mthreads 2.183 1.028 11.257 21.055 normal 8 mthreads 1.653 0.825 7.849 15.549 schedstat_parse.py -f hackbench_vanilla.log (each group has 28 tasks) case load se_eff% dom_eff% fast_rate% success_rate% process-pipe 1 group 99.991 7.692 99.999 100.000 process-pipe 2 groups 99.934 4.615 99.997 100.000 process-pipe 3 groups 99.597 3.198 99.987 100.000 process-pipe 4 groups 98.378 2.464 99.958 100.000 process-pipe 5 groups 27.474 3.653 89.811 99.800 process-pipe 6 groups 20.201 4.098 82.763 99.570 process-pipe 7 groups 16.423 4.156 77.398 99.316 process-pipe 8 groups 13.165 3.920 72.232 98.828 process-sockets 1 group 99.977 5.882 99.999 100.000 process-sockets 2 groups 99.927 5.505 99.996 100.000 process-sockets 3 groups 99.397 3.250 99.980 100.000 process-sockets 4 groups 79.680 4.258 98.864 99.998 process-sockets 5 groups 7.673 2.503 63.659 92.115 process-sockets 6 groups 4.642 1.584 58.946 88.048 process-sockets 7 groups 3.493 1.379 49.816 81.164 process-sockets 8 groups 3.015 1.407 40.845 75.500 threads-pipe 1 group 99.997 0.000 100.000 100.000 threads-pipe 2 groups 99.894 2.932 99.997 100.000 threads-pipe 3 groups 99.611 4.117 99.983 100.000 threads-pipe 4 groups 97.703 2.624 99.937 100.000 threads-pipe 5 groups 22.919 3.623 87.150 99.764 threads-pipe 6 groups 18.016 4.038 80.491 99.557 threads-pipe 7 groups 14.663 3.991 75.239 99.247 threads-pipe 8 groups 12.242 3.808 70.651 98.644 threads-sockets 1 group 99.990 6.667 99.999 100.000 threads-sockets 2 groups 99.940 5.114 99.997 100.000 threads-sockets 3 groups 99.469 4.115 99.977 100.000 threads-sockets 4 groups 87.528 4.038 99.400 100.000 threads-sockets 5 groups 6.942 2.398 59.244 88.337 threads-sockets 6 groups 4.359 1.954 49.448 87.860 threads-sockets 7 groups 2.845 1.345 41.198 77.102 threads-sockets 8 groups 2.871 1.404 38.512 74.312 schedstat_parse.py -f tbench_vanilla.log case load se_eff% dom_eff% fast_rate% success_rate% loopback 28 threads 99.976 18.369 99.995 100.000 loopback 56 threads 99.222 7.799 99.934 100.000 loopback 84 threads 19.723 6.819 70.215 100.000 loopback 112 threads 11.283 5.371 55.371 99.999 loopback 140 threads 0.000 0.000 0.000 0.000 loopback 168 threads 0.000 0.000 0.000 0.000 loopback 196 threads 0.000 0.000 0.000 0.000 loopback 224 threads 0.000 0.000 0.000 0.000 According to the test above, if the system becomes busy, the SIS Search Efficiency(se_eff%) drops significantly. Although some benchmarks would finally find an idle CPU(success_rate% = 100%), it is doubtful whether it is worth it to search the whole LLC domain. [Proposal] It would be ideal to have a crystal ball to answer this question: How many CPUs must a wakeup path walk down, before it can find an idle CPU? Many potential metrics could be used to predict the number. One candidate is the sum of util_avg in this LLC domain. The benefit of choosing util_avg is that it is a metric of accumulated historic activity, which seems to be smoother than instantaneous metrics (such as rq->nr_running). Besides, choosing the sum of util_avg would help predict the load of the LLC domain more precisely, because SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle time. In summary, the lower the util_avg is, the more select_idle_cpu() should scan for idle CPU, and vice versa. When the sum of util_avg in this LLC domain hits 85% or above, the scan stops. The reason to choose 85% as the threshold is that this is the imbalance_pct(117) when a LLC sched group is overloaded. Introduce the quadratic function: y = SCHED_CAPACITY_SCALE - p * x^2 and y'= y / SCHED_CAPACITY_SCALE x is the ratio of sum_util compared to the CPU capacity: x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE) y' is the ratio of CPUs to be scanned in the LLC domain, and the number of CPUs to scan is calculated by: nr_scan = llc_weight * y' Choosing quadratic function is because: [1] Compared to the linear function, it scans more aggressively when the sum_util is low. [2] Compared to the exponential function, it is easier to calculate. [3] It seems that there is no accurate mapping between the sum of util_avg and the number of CPUs to be scanned. Use heuristic scan for now. For a platform with 112 CPUs per LLC, the number of CPUs to scan is: sum_util% 0 5 15 25 35 45 55 65 75 85 86 ... scan_nr 112 111 108 102 93 81 65 47 25 1 0 ... For a platform with 16 CPUs per LLC, the number of CPUs to scan is: sum_util% 0 5 15 25 35 45 55 65 75 85 86 ... scan_nr 16 15 15 14 13 11 9 6 3 0 0 ... Furthermore, to minimize the overhead of calculating the metrics in select_idle_cpu(), borrow the statistics from periodic load balance. As mentioned by Abel, on a platform with 112 CPUs per LLC, the sum_util calculated by periodic load balance after 112 ms would decay to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay in reflecting the latest utilization. But it is a trade-off. Checking the util_avg in newidle load balance would be more frequent, but it brings overhead - multiple CPUs write/read the per-LLC shared variable and introduces cache contention. Tim also mentioned that, it is allowed to be non-optimal in terms of scheduling for the short-term variations, but if there is a long-term trend in the load behavior, the scheduler can adjust for that. When SIS_UTIL is enabled, the select_idle_cpu() uses the nr_scan calculated by SIS_UTIL instead of the one from SIS_PROP. As Peter and Mel suggested, SIS_UTIL should be enabled by default. This patch is based on the util_avg, which is very sensitive to the CPU frequency invariance. There is an issue that, when the max frequency has been clamp, the util_avg would decay insanely fast when the CPU is idle. Commit addca285120b ("cpufreq: intel_pstate: Handle no_turbo in frequency invariance") could be used to mitigate this symptom, by adjusting the arch_max_freq_ratio when turbo is disabled. But this issue is still not thoroughly fixed, because the current code is unaware of the user-specified max CPU frequency. [Test result] netperf and tbench were launched with 25% 50% 75% 100% 125% 150% 175% 200% of CPU number respectively. Hackbench and schbench were launched by 1, 2 ,4, 8 groups. Each test lasts for 100 seconds and repeats 3 times. The following is the benchmark result comparison between baseline:vanilla v5.19-rc1 and compare:patched kernel. Positive compare% indicates better performance. Each netperf test is a: netperf -4 -H 127.0.1 -t TCP/UDP_RR -c -C -l 100 netperf.throughput ======= case load baseline(std%) compare%( std%) TCP_RR 28 threads 1.00 ( 0.34) -0.16 ( 0.40) TCP_RR 56 threads 1.00 ( 0.19) -0.02 ( 0.20) TCP_RR 84 threads 1.00 ( 0.39) -0.47 ( 0.40) TCP_RR 112 threads 1.00 ( 0.21) -0.66 ( 0.22) TCP_RR 140 threads 1.00 ( 0.19) -0.69 ( 0.19) TCP_RR 168 threads 1.00 ( 0.18) -0.48 ( 0.18) TCP_RR 196 threads 1.00 ( 0.16) +194.70 ( 16.43) TCP_RR 224 threads 1.00 ( 0.16) +197.30 ( 7.85) UDP_RR 28 threads 1.00 ( 0.37) +0.35 ( 0.33) UDP_RR 56 threads 1.00 ( 11.18) -0.32 ( 0.21) UDP_RR 84 threads 1.00 ( 1.46) -0.98 ( 0.32) UDP_RR 112 threads 1.00 ( 28.85) -2.48 ( 19.61) UDP_RR 140 threads 1.00 ( 0.70) -0.71 ( 14.04) UDP_RR 168 threads 1.00 ( 14.33) -0.26 ( 11.16) UDP_RR 196 threads 1.00 ( 12.92) +186.92 ( 20.93) UDP_RR 224 threads 1.00 ( 11.74) +196.79 ( 18.62) Take the 224 threads as an example, the SIS search metrics changes are illustrated below: vanilla patched 4544492 +237.5% 15338634 sched_debug.cpu.sis_domain_search.avg 38539 +39686.8% 15333634 sched_debug.cpu.sis_failed.avg 128300000 -87.9% 15551326 sched_debug.cpu.sis_scanned.avg 5842896 +162.7% 15347978 sched_debug.cpu.sis_search.avg There is -87.9% less CPU scans after patched, which indicates lower overhead. Besides, with this patch applied, there is -13% less rq lock contention in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested .try_to_wake_up.default_wake_function.woken_wake_function. This might help explain the performance improvement - Because this patch allows the waking task to remain on the previous CPU, rather than grabbing other CPUs' lock. Each hackbench test is a: hackbench -g $job --process/threads --pipe/sockets -l 1000000 -s 100 hackbench.throughput ========= case load baseline(std%) compare%( std%) process-pipe 1 group 1.00 ( 1.29) +0.57 ( 0.47) process-pipe 2 groups 1.00 ( 0.27) +0.77 ( 0.81) process-pipe 4 groups 1.00 ( 0.26) +1.17 ( 0.02) process-pipe 8 groups 1.00 ( 0.15) -4.79 ( 0.02) process-sockets 1 group 1.00 ( 0.63) -0.92 ( 0.13) process-sockets 2 groups 1.00 ( 0.03) -0.83 ( 0.14) process-sockets 4 groups 1.00 ( 0.40) +5.20 ( 0.26) process-sockets 8 groups 1.00 ( 0.04) +3.52 ( 0.03) threads-pipe 1 group 1.00 ( 1.28) +0.07 ( 0.14) threads-pipe 2 groups 1.00 ( 0.22) -0.49 ( 0.74) threads-pipe 4 groups 1.00 ( 0.05) +1.88 ( 0.13) threads-pipe 8 groups 1.00 ( 0.09) -4.90 ( 0.06) threads-sockets 1 group 1.00 ( 0.25) -0.70 ( 0.53) threads-sockets 2 groups 1.00 ( 0.10) -0.63 ( 0.26) threads-sockets 4 groups 1.00 ( 0.19) +11.92 ( 0.24) threads-sockets 8 groups 1.00 ( 0.08) +4.31 ( 0.11) Each tbench test is a: tbench -t 100 $job 127.0.0.1 tbench.throughput ====== case load baseline(std%) compare%( std%) loopback 28 threads 1.00 ( 0.06) -0.14 ( 0.09) loopback 56 threads 1.00 ( 0.03) -0.04 ( 0.17) loopback 84 threads 1.00 ( 0.05) +0.36 ( 0.13) loopback 112 threads 1.00 ( 0.03) +0.51 ( 0.03) loopback 140 threads 1.00 ( 0.02) -1.67 ( 0.19) loopback 168 threads 1.00 ( 0.38) +1.27 ( 0.27) loopback 196 threads 1.00 ( 0.11) +1.34 ( 0.17) loopback 224 threads 1.00 ( 0.11) +1.67 ( 0.22) Each schbench test is a: schbench -m $job -t 28 -r 100 -s 30000 -c 30000 schbench.latency_90%_us ======== case load baseline(std%) compare%( std%) normal 1 mthread 1.00 ( 31.22) -7.36 ( 20.25)* normal 2 mthreads 1.00 ( 2.45) -0.48 ( 1.79) normal 4 mthreads 1.00 ( 1.69) +0.45 ( 0.64) normal 8 mthreads 1.00 ( 5.47) +9.81 ( 14.28) *Consider the Standard Deviation, this -7.36% regression might not be valid. Also, a OLTP workload with a commercial RDBMS has been tested, and there is no significant change. There were concerns that unbalanced tasks among CPUs would cause problems. For example, suppose the LLC domain is composed of 8 CPUs, and 7 tasks are bound to CPU0~CPU6, while CPU7 is idle: CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 util_avg 1024 1024 1024 1024 1024 1024 1024 0 Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%, select_idle_cpu() will not scan, thus CPU7 is undetected during scan. But according to Mel, it is unlikely the CPU7 will be idle all the time because CPU7 could pull some tasks via CPU_NEWLY_IDLE. lkp(kernel test robot) has reported a regression on stress-ng.sock on a very busy system. According to the sched_debug statistics, it might be caused by SIS_UTIL terminates the scan and chooses a previous CPU earlier, and this might introduce more context switch, especially involuntary preemption, which impacts a busy stress-ng. This regression has shown that, not all benchmarks in every scenario benefit from idle CPU scan limit, and it needs further investigation. Besides, there is slight regression in hackbench's 16 groups case when the LLC domain has 16 CPUs. Prateek mentioned that we should scan aggressively in an LLC domain with 16 CPUs. Because the cost to search for an idle one among 16 CPUs is negligible. The current patch aims to propose a generic solution and only considers the util_avg. Something like the below could be applied on top of the current patch to fulfill the requirement: if (llc_weight <= 16) nr_scan = nr_scan * 32 / llc_weight; For LLC domain with 16 CPUs, the nr_scan will be expanded to 2 times large. The smaller the CPU number this LLC domain has, the larger nr_scan will be expanded. This needs further investigation. There is also ongoing work[2] from Abel to filter out the busy CPUs during wakeup, to further speed up the idle CPU scan. And it could be a following-up optimization on top of this change. Suggested-by: Tim Chen <tim.c.chen@intel.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Yicong Yang <yangyicong@hisilicon.com> Tested-by: Mohini Narkhede <mohini.narkhede@intel.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20220612163428.849378-1-yu.c.chen@intel.com
2022-02-11sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCsMel Gorman1-0/+1
Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes") allowed an imbalance between NUMA nodes such that communicating tasks would not be pulled apart by the load balancer. This works fine when there is a 1:1 relationship between LLC and node but can be suboptimal for multiple LLCs if independent tasks prematurely use CPUs sharing cache. Zen* has multiple LLCs per node with local memory channels and due to the allowed imbalance, it's far harder to tune some workloads to run optimally than it is on hardware that has 1 LLC per node. This patch allows an imbalance to exist up to the point where LLCs should be balanced between nodes. On a Zen3 machine running STREAM parallelised with OMP to have on instance per LLC the results and without binding, the results are 5.17.0-rc0 5.17.0-rc0 vanilla sched-numaimb-v6 MB/sec copy-16 162596.94 ( 0.00%) 580559.74 ( 257.05%) MB/sec scale-16 136901.28 ( 0.00%) 374450.52 ( 173.52%) MB/sec add-16 157300.70 ( 0.00%) 564113.76 ( 258.62%) MB/sec triad-16 151446.88 ( 0.00%) 564304.24 ( 272.61%) STREAM can use directives to force the spread if the OpenMP is new enough but that doesn't help if an application uses threads and it's not known in advance how many threads will be created. Coremark is a CPU and cache intensive benchmark parallelised with threads. When running with 1 thread per core, the vanilla kernel allows threads to contend on cache. With the patch; 5.17.0-rc0 5.17.0-rc0 vanilla sched-numaimb-v5 Min Score-16 368239.36 ( 0.00%) 389816.06 ( 5.86%) Hmean Score-16 388607.33 ( 0.00%) 427877.08 * 10.11%* Max Score-16 408945.69 ( 0.00%) 481022.17 ( 17.62%) Stddev Score-16 15247.04 ( 0.00%) 24966.82 ( -63.75%) CoeffVar Score-16 3.92 ( 0.00%) 5.82 ( -48.48%) It can also make a big difference for semi-realistic workloads like specjbb which can execute arbitrary numbers of threads without advance knowledge of how they should be placed. Even in cases where the average performance is neutral, the results are more stable. 5.17.0-rc0 5.17.0-rc0 vanilla sched-numaimb-v6 Hmean tput-1 71631.55 ( 0.00%) 73065.57 ( 2.00%) Hmean tput-8 582758.78 ( 0.00%) 556777.23 ( -4.46%) Hmean tput-16 1020372.75 ( 0.00%) 1009995.26 ( -1.02%) Hmean tput-24 1416430.67 ( 0.00%) 1398700.11 ( -1.25%) Hmean tput-32 1687702.72 ( 0.00%) 1671357.04 ( -0.97%) Hmean tput-40 1798094.90 ( 0.00%) 2015616.46 * 12.10%* Hmean tput-48 1972731.77 ( 0.00%) 2333233.72 ( 18.27%) Hmean tput-56 2386872.38 ( 0.00%) 2759483.38 ( 15.61%) Hmean tput-64 2909475.33 ( 0.00%) 2925074.69 ( 0.54%) Hmean tput-72 2585071.36 ( 0.00%) 2962443.97 ( 14.60%) Hmean tput-80 2994387.24 ( 0.00%) 3015980.59 ( 0.72%) Hmean tput-88 3061408.57 ( 0.00%) 3010296.16 ( -1.67%) Hmean tput-96 3052394.82 ( 0.00%) 2784743.41 ( -8.77%) Hmean tput-104 2997814.76 ( 0.00%) 2758184.50 ( -7.99%) Hmean tput-112 2955353.29 ( 0.00%) 2859705.09 ( -3.24%) Hmean tput-120 2889770.71 ( 0.00%) 2764478.46 ( -4.34%) Hmean tput-128 2871713.84 ( 0.00%) 2750136.73 ( -4.23%) Stddev tput-1 5325.93 ( 0.00%) 2002.53 ( 62.40%) Stddev tput-8 6630.54 ( 0.00%) 10905.00 ( -64.47%) Stddev tput-16 25608.58 ( 0.00%) 6851.16 ( 73.25%) Stddev tput-24 12117.69 ( 0.00%) 4227.79 ( 65.11%) Stddev tput-32 27577.16 ( 0.00%) 8761.05 ( 68.23%) Stddev tput-40 59505.86 ( 0.00%) 2048.49 ( 96.56%) Stddev tput-48 168330.30 ( 0.00%) 93058.08 ( 44.72%) Stddev tput-56 219540.39 ( 0.00%) 30687.02 ( 86.02%) Stddev tput-64 121750.35 ( 0.00%) 9617.36 ( 92.10%) Stddev tput-72 223387.05 ( 0.00%) 34081.13 ( 84.74%) Stddev tput-80 128198.46 ( 0.00%) 22565.19 ( 82.40%) Stddev tput-88 136665.36 ( 0.00%) 27905.97 ( 79.58%) Stddev tput-96 111925.81 ( 0.00%) 99615.79 ( 11.00%) Stddev tput-104 146455.96 ( 0.00%) 28861.98 ( 80.29%) Stddev tput-112 88740.49 ( 0.00%) 58288.23 ( 34.32%) Stddev tp