aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2025-09-03sched_ext: Introduce scx_bpf_cpu_curr()Christian Loehle1-0/+14
Provide scx_bpf_cpu_curr() as a way for scx schedulers to check the curr task of a remote rq without assuming its lock is held. Many scx schedulers make use of scx_bpf_cpu_rq() to check a remote curr (e.g. to see if it should be preempted). This is problematic because scx_bpf_cpu_rq() provides access to all fields of struct rq, most of which aren't safe to use without holding the associated rq lock. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-03sched_ext: Introduce scx_bpf_locked_rq()Christian Loehle1-0/+23
Most fields in scx_bpf_cpu_rq() assume that its rq_lock is held. Furthermore they become meaningless without rq lock, too. Make a safer version of scx_bpf_cpu_rq() that only returns a rq if we hold rq lock of that rq. Also mark the new scx_bpf_locked_rq() as returning NULL as scx_bpf_cpu_rq() should've been too. Signed-off-by: Christian Loehle <christian.loehle@arm.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-03sched_ext: Use cgroup_lock/unlock() to synchronize against cgroup operationsTejun Heo3-56/+14
SCX hooks into CPU cgroup controller operations and read-locks scx_cgroup_rwsem to exclude them while enabling and disable schedulers. While this works, it's unnecessarily complicated given that cgroup_[un]lock() are available and thus the cgroup operations can be locked out that way. Drop scx_cgroup_rwsem locking from the tg on/offline and cgroup [can_]attach operations. Instead, grab cgroup_lock() from scx_cgroup_lock(). Drop scx_cgroup_finish_attach() which is no longer necessary. Drop the now unnecessary rcu locking and css ref bumping in scx_cgroup_init() and scx_cgroup_exit(). As scx_cgroup_set_weight/bandwidth() paths aren't protected by cgroup_lock(), rename scx_cgroup_rwsem to scx_cgroup_ops_rwsem and retain the locking there. This is overall simpler and will also allow enable/disable paths to synchronize against cgroup changes independent of the CPU controller. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-09-03sched_ext: Put event_stats_cpu in struct scx_sched_pcpuTejun Heo2-16/+19
scx_sched.event_stats_cpu is the percpu counters that are used to track stats. Introduce struct scx_sched_pcpu and move the counters inside. This will ease adding more per-cpu fields. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-09-03sched_ext: Move internal type and accessor definitions to ext_internal.hTejun Heo4-1057/+1062
There currently isn't a place to place SCX-internal types and accessors to be shared between ext.c and ext_idle.c. Create kernel/sched/ext_internal.h and move internal type and accessor definitions there. This trims ext.c a bit and makes future additions easier. Pure code reorganization. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-09-03sched_ext: Keep bypass on between enable failure and scx_disable_workfn()Tejun Heo1-1/+1
scx_enable() turns on the bypass mode while enable is in progress. If enabling fails, it turns off the bypass mode and then triggers scx_error(). scx_error() will trigger scx_disable_workfn() which will turn on the bypass mode again and unload the failed scheduler. This moves the system out of bypass mode between the enable error path and the disable path, which is unnecessary and can be brittle - e.g. the thread running scx_enable() may already be on the failed scheduler and can be switched out before it triggers scx_error() leading to a stall. The watchdog would eventually kick in, so the situation isn't critical but is still suboptimal. There is nothing to be gained by turning off the bypass mode between scx_enable() failure and scx_disable_workfn(). Keep bypass on. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-09-03sched_ext: Make explicit scx_task_iter_relock() calls unnecessaryTejun Heo1-20/+23
During tasks iteration, the locks can be dropped using scx_task_iter_unlock() to perform e.g. sleepable allocations. Afterwards, scx_task_iter_relock() has to be called prior to other iteration operations, which is error-prone. This can be easily automated by tracking whether scx_tasks_lock is held in scx_task_iter and re-acquiring when necessary. It already tracks whether the task's rq is locked after all. - Add scx_task_iter->list_locked which remembers whether scx_tasks_lock is held. - Rename scx_task_iter->locked to scx_task_iter->locked_task to better distinguish it from ->list_locked. - Replace scx_task_iter_relock() with __scx_task_iter_maybe_relock() which is automatically called by scx_task_iter_next() and scx_task_iter_stop(). - Drop explicit scx_task_iter_relock() calls. The resulting behavior should be equivalent. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-09-03audit: fix out-of-bounds read in audit_compare_dname_path()Stanislav Fort1-1/+1
When a watch on dir=/ is combined with an fsnotify event for a single-character name directly under / (e.g., creating /a), an out-of-bounds read can occur in audit_compare_dname_path(). The helper parent_len() returns 1 for "/". In audit_compare_dname_path(), when parentlen equals the full path length (1), the code sets p = path + 1 and pathlen = 1 - 1 = 0. The subsequent loop then dereferences p[pathlen - 1] (i.e., p[-1]), causing an out-of-bounds read. Fix this by adding a pathlen > 0 check to the while loop condition to prevent the out-of-bounds access. Cc: stable@vger.kernel.org Fixes: e92eebb0d611 ("audit: fix suffixed '/' filename matching") Reported-by: Stanislav Fort <disclosure@aisle.com> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Signed-off-by: Stanislav Fort <stanislav.fort@aisle.com> [PM: subject tweak, sign-off email fixes] Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-09-03cgroup/cpuset: Prevent NULL pointer access in free_tmpmasks()Waiman Long1-0/+3
Commit 5806b3d05165 ("cpuset: decouple tmpmasks and cpumasks freeing in cgroup") separates out the freeing of tmpmasks into a new free_tmpmask() helper but removes the NULL pointer check in the process. Unfortunately a NULL pointer can be passed to free_tmpmasks() in cpuset_handle_hotplug() if cpuset v1 is active. This can cause segmentation fault and crash the kernel. Fix that by adding the NULL pointer check to free_tmpmasks(). Fixes: 5806b3d05165 ("cpuset: decouple tmpmasks and cpumasks freeing in cgroup") Reported-by: Ashay Jaiswal <quic_ashayj@quicinc.com> Closes: https://lore.kernel.org/lkml/20250902-cpuset-free-on-condition-v1-1-f46ffab53eac@quicinc.com/ Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-03sched: Fix sched_numa_find_nth_cpu() if mask offlineChristian Loehle1-0/+2
sched_numa_find_nth_cpu() uses a bsearch to look for the 'closest' CPU in sched_domains_numa_masks and given cpus mask. However they might not intersect if all CPUs in the cpus mask are offline. bsearch will return NULL in that case, bail out instead of dereferencing a bogus pointer. The previous behaviour lead to this bug when using maxcpus=4 on an rk3399 (LLLLbb) (i.e. booting with all big CPUs offline): [ 1.422922] Unable to handle kernel paging request at virtual address ffffff8000000000 [ 1.423635] Mem abort info: [ 1.423889] ESR = 0x0000000096000006 [ 1.424227] EC = 0x25: DABT (current EL), IL = 32 bits [ 1.424715] SET = 0, FnV = 0 [ 1.424995] EA = 0, S1PTW = 0 [ 1.425279] FSC = 0x06: level 2 translation fault [ 1.425735] Data abort info: [ 1.425998] ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000 [ 1.426499] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 1.426952] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 1.427428] swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000004a9f000 [ 1.428038] [ffffff8000000000] pgd=18000000f7fff403, p4d=18000000f7fff403, pud=18000000f7fff403, pmd=0000000000000000 [ 1.429014] Internal error: Oops: 0000000096000006 [#1] SMP [ 1.429525] Modules linked in: [ 1.429813] CPU: 3 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.17.0-rc4-dirty #343 PREEMPT [ 1.430559] Hardware name: Pine64 RockPro64 v2.1 (DT) [ 1.431012] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 1.431634] pc : sched_numa_find_nth_cpu+0x2a0/0x488 [ 1.432094] lr : sched_numa_find_nth_cpu+0x284/0x488 [ 1.432543] sp : ffffffc084e1b960 [ 1.432843] x29: ffffffc084e1b960 x28: ffffff80078a8800 x27: ffffffc0846eb1d0 [ 1.433495] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000 [ 1.434144] x23: 0000000000000000 x22: fffffffffff7f093 x21: ffffffc081de6378 [ 1.434792] x20: 0000000000000000 x19: 0000000ffff7f093 x18: 00000000ffffffff [ 1.435441] x17: 3030303866666666 x16: 66663d736b73616d x15: ffffffc104e1b5b7 [ 1.436091] x14: 0000000000000000 x13: ffffffc084712860 x12: 0000000000000372 [ 1.436739] x11: 0000000000000126 x10: ffffffc08476a860 x9 : ffffffc084712860 [ 1.437389] x8 : 00000000ffffefff x7 : ffffffc08476a860 x6 : 0000000000000000 [ 1.438036] x5 : 000000000000bff4 x4 : 0000000000000000 x3 : 0000000000000000 [ 1.438683] x2 : 0000000000000000 x1 : ffffffc0846eb000 x0 : ffffff8000407b68 [ 1.439332] Call trace: [ 1.439559] sched_numa_find_nth_cpu+0x2a0/0x488 (P) [ 1.440016] smp_call_function_any+0xc8/0xd0 [ 1.440416] armv8_pmu_init+0x58/0x27c [ 1.440770] armv8_cortex_a72_pmu_init+0x20/0x2c [ 1.441199] arm_pmu_device_probe+0x1e4/0x5e8 [ 1.441603] armv8_pmu_device_probe+0x1c/0x28 [ 1.442007] platform_probe+0x5c/0xac [ 1.442347] really_probe+0xbc/0x298 [ 1.442683] __driver_probe_device+0x78/0x12c [ 1.443087] driver_probe_device+0xdc/0x160 [ 1.443475] __driver_attach+0x94/0x19c [ 1.443833] bus_for_each_dev+0x74/0xd4 [ 1.444190] driver_attach+0x24/0x30 [ 1.444525] bus_add_driver+0xe4/0x208 [ 1.444874] driver_register+0x60/0x128 [ 1.445233] __platform_driver_register+0x24/0x30 [ 1.445662] armv8_pmu_driver_init+0x28/0x4c [ 1.446059] do_one_initcall+0x44/0x25c [ 1.446416] kernel_init_freeable+0x1dc/0x3bc [ 1.446820] kernel_init+0x20/0x1d8 [ 1.447151] ret_from_fork+0x10/0x20 [ 1.447493] Code: 90022e21 f000e5f5 910de2b5 2a1703e2 (f8767803) [ 1.448040] ---[ end trace 0000000000000000 ]--- [ 1.448483] note: swapper/0[1] exited with preempt_count 1 [ 1.449047] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b [ 1.449741] SMP: stopping secondary CPUs [ 1.450105] Kernel Offset: disabled [ 1.450419] CPU features: 0x000000,00080000,20002001,0400421b [ 1.450935] Memory Limit: none [ 1.451217] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]--- Yury: with the fix, the function returns cpu == nr_cpu_ids, and later in smp_call_function_any -> smp_call_function_single -> generic_exec_single we test the cpu for '>= nr_cpu_ids' and return -ENXIO. So everything is handled correctly. Fixes: cd7f55359c90 ("sched: add sched_numa_find_nth_cpu()") Cc: stable@vger.kernel.org Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
2025-09-03genirq/test: Ensure CPU 1 is online for hotplug testBrian Norris1-0/+2
It's possible to run these tests on platforms that think they have a hotpluggable CPU1, but for whatever reason, CPU1 is not online and can't be brought online: # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:210 Expected remove_cpu(1) == 0, but remove_cpu(1) == 1 (0x1) CPU1: failed to boot: -38 # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:214 Expected add_cpu(1) == 0, but add_cpu(1) == -38 (0xffffffffffffffda) Check that CPU1 is actually online before trying to run the test. Fixes: 66067c3c8a1e ("genirq: Add kunit tests for depth counts") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: David Gow <davidgow@google.com> Link: https://lore.kernel.org/all/20250822190140.2154646-7-briannorris@chromium.org
2025-09-03genirq/test: Drop CONFIG_GENERIC_IRQ_MIGRATION assumptionsBrian Norris1-4/+0
Not all platforms use the generic IRQ migration code, even if they select GENERIC_IRQ_MIGRATION. (See, for example, powerpc / pseries_cpu_disable().) If such platforms don't perform managed shutdown the same way, the interrupt may not actually shut down, and these tests fail: [ 4.357022][ T101] # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:211 [ 4.357022][ T101] Expected irqd_is_activated(data) to be false, but is true [ 4.358128][ T101] # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:212 [ 4.358128][ T101] Expected irqd_is_started(data) to be false, but is true [ 4.375558][ T101] # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:216 [ 4.375558][ T101] Expected irqd_is_activated(data) to be false, but is true [ 4.376088][ T101] # irq_cpuhotplug_test: EXPECTATION FAILED at kernel/irq/irq_test.c:217 [ 4.376088][ T101] Expected irqd_is_started(data) to be false, but is true [ 4.377851][ T1] # irq_cpuhotplug_test: pass:0 fail:1 skip:0 total:1 [ 4.377901][ T1] not ok 4 irq_cpuhotplug_test [ 4.378073][ T1] # irq_test_cases: pass:3 fail:1 skip:0 total:4 Rather than test that PowerPC performs migration the same way as the unterrupt core, just drop the state checks. The point of the test was to ensure that the code kept |depth| balanced, which still can be tested for. Fixes: 66067c3c8a1e ("genirq: Add kunit tests for depth counts") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: David Gow <davidgow@google.com> Link: https://lore.kernel.org/all/20250822190140.2154646-6-briannorris@chromium.org
2025-09-03genirq/test: Depend on SPARSE_IRQBrian Norris1-0/+1
Some architectures have a static interrupt layout, with a limited number of interrupts. Without SPARSE_IRQ, the test may not be able to allocate any fake interrupts, and the test will fail. (This occurs on ARCH=m68k, for example.) Additionally, managed-affinity is only supported with CONFIG_SPARSE_IRQ=y, so irq_shutdown_depth_test() and irq_cpuhotplug_test() would fail without it. Add a 'SPARSE_IRQ' dependency to avoid these problems. Many architectures 'select SPARSE_IRQ', so this is easy to miss. Notably, this also excludes ARCH=um from running any of these tests, even though some of them might work. Fixes: 66067c3c8a1e ("genirq: Add kunit tests for depth counts") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: David Gow <davidgow@google.com> Link: https://lore.kernel.org/all/20250822190140.2154646-5-briannorris@chromium.org
2025-09-03genirq/test: Fail early if interrupt request failsBrian Norris1-5/+5
Requesting an interrupt is part of the basic test setup. If it fails, most of the subsequent tests are likely to fail, and the output gets noisy. Use "assert" to fail early. Signed-off-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: David Gow <davidgow@google.com> Link: https://lore.kernel.org/all/20250822190140.2154646-4-briannorris@chromium.org
2025-09-03genirq/test: Factor out fake-virq setupBrian Norris1-25/+20
A few things need to be repeated in tests. Factor out the creation of fake interrupts. Signed-off-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: David Gow <davidgow@google.com> Link: https://lore.kernel.org/all/20250822190140.2154646-3-briannorris@chromium.org
2025-09-03genirq/test: Select IRQ_DOMAINBrian Norris1-0/+1
These tests use irq_domain_alloc_descs() and so require CONFIG_IRQ_DOMAIN. Fixes: 66067c3c8a1e ("genirq: Add kunit tests for depth counts") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: David Gow <davidgow@google.com> Link: https://lore.kernel.org/all/20250822190140.2154646-2-briannorris@chromium.org Closes: https://lore.kernel.org/lkml/ded44edf-eeb7-420c-b8a8-d6543b955e6e@roeck-us.net/
2025-09-03genirq/test: Fix depth tests on architectures with NOREQUEST by default.David Gow1-0/+12
The new irq KUnit tests fail on some architectures (notably PowerPC and 32-bit ARM), as the request_irq() call fails due to the ARCH_IRQ_INIT_FLAGS containing IRQ_NOREQUEST, yielding the following errors: [10:17:45] # irq_free_disabled_test: EXPECTATION FAILED at kernel/irq/irq_test.c:88 [10:17:45] Expected ret == 0, but [10:17:45] ret == -22 (0xffffffffffffffea) [10:17:45] # irq_free_disabled_test: EXPECTATION FAILED at kernel/irq/irq_test.c:90 [10:17:45] Expected desc->depth == 0, but [10:17:45] desc->depth == 1 (0x1) [10:17:45] # irq_free_disabled_test: EXPECTATION FAILED at kernel/irq/irq_test.c:93 [10:17:45] Expected desc->depth == 1, but [10:17:45] desc->depth == 2 (0x2) By clearing IRQ_NOREQUEST from the interrupt descriptor, these tests now pass on ARM and PowerPC. Fixes: 66067c3c8a1e ("genirq: Add kunit tests for depth counts") Signed-off-by: David Gow <davidgow@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Brian Norris <briannorris@chromium.org> Reviewed-by: Brian Norris <briannorris@chromium.org> Link: https://lore.kernel.org/all/20250816094528.3560222-2-davidgow@google.com
2025-09-03genirq: Add support for warning on long-running interrupt handlersWladislav Wiebe1-1/+48
Introduce a mechanism to detect and warn about prolonged interrupt handlers. With a new command-line parameter (irqhandler.duration_warn_us=), users can configure the duration threshold in microseconds when a warning in such format should be emitted: "[CPU14] long duration of IRQ[159:bad_irq_handler [long_irq]], took: 1330 us" The implementation uses local_clock() to measure the execution duration of the generic IRQ per-CPU event handler. Signed-off-by: Wladislav Wiebe <wladislav.wiebe@nokia.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jiri Slaby <jirislaby@kernel.org> Link: https://lore.kernel.org/all/20250804093525.851-1-wladislav.wiebe@nokia.com
2025-09-03vdso/vsyscall: Avoid slow division loop in auxiliary clock updateThomas Weißschuh2-4/+10
The call to __iter_div_u64_rem() in vdso_time_update_aux() is a wrapper around subtraction. It cannot be used to divide large numbers, as that introduces long, computationally expensive delays. A regular u64 division is also not possible in the timekeeper update path as it can be too slow. Instead of splitting the ktime_t offset into into second and subsecond components during the timekeeper update fast-path, do it together with the adjustment of tk->offs_aux in the slow-path. Equivalent to the handling of offs_boot and monotonic_to_boot. Reuse the storage of monotonic_to_boot for the new field, as it is not used by auxiliary timekeepers. Fixes: 380b84e168e5 ("vdso/vsyscall: Update auxiliary clock data in the datapage") Reported-by: Miroslav Lichvar <mlichvar@redhat.com> Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250825-vdso-auxclock-division-v1-1-a1d32a16a313@linutronix.de Closes: https://lore.kernel.org/lkml/aKwsNNWsHJg8IKzj@localhost/
2025-09-03perf: Fix the POLL_HUP delivery breakageKan Liang1-0/+1
The event_limit can be set by the PERF_EVENT_IOC_REFRESH to limit the number of events. When the event_limit reaches 0, the POLL_HUP signal should be sent. But it's missed. The corresponding counter should be stopped when the event_limit reaches 0. It was implemented in the ARCH-specific code. However, since the commit 9734e25fbf5a ("perf: Fix the throttle logic for a group"), all the ARCH-specific code has been moved to the generic code. The code to handle the event_limit was lost. Add the event->pmu->stop(event, 0); back. Fixes: 9734e25fbf5a ("perf: Fix the throttle logic for a group") Closes: https://lore.kernel.org/lkml/aICYAqM5EQUlTqtX@li-2b55cdcc-350b-11b2-a85c-a78bff51fc11.ibm.com/ Reported-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Link: https://lkml.kernel.org/r/20250811182644.1305952-1-kan.liang@linux.intel.com
2025-09-03sched/fair: Get rid of throttled_lb_pair()Aaron Lu1-31/+4
Now that throttled tasks are dequeued and can not stay on rq's cfs_tasks list, there is no need to take special care of these throttled tasks anymore in load balance. Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20250829081120.806-6-ziqianlu@bytedance.com
2025-09-03sched/fair: Task based throttle time accountingAaron Lu2-25/+32
With task based throttle model, the previous way to check cfs_rq's nr_queued to decide if throttled time should be accounted doesn't work as expected, e.g. when a cfs_rq which has a single task is throttled, that task could later block in kernel mode instead of being dequeued on limbo list and accounting this as throttled time is not accurate. Rework throttle time accounting for a cfs_rq as follows: - start accounting when the first task gets throttled in its hierarchy; - stop accounting on unthrottle. Note that there will be a time gap between when a cfs_rq is throttled and when a task in its hierarchy is actually throttled. This accounting mechanism only starts accounting in the latter case. Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # accounting mechanism Co-developed-by: K Prateek Nayak <kprateek.nayak@amd.com> # simplify implementation Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20250829081120.806-5-ziqianlu@bytedance.com
2025-09-03sched/fair: Switch to task based throttle modelValentin Schneider3-167/+181
In current throttle model, when a cfs_rq is throttled, its entity will be dequeued from cpu's rq, making tasks attached to it not able to run, thus achiveing the throttle target. This has a drawback though: assume a task is a reader of percpu_rwsem and is waiting. When it gets woken, it can not run till its task group's next period comes, which can be a relatively long time. Waiting writer will have to wait longer due to this and it also makes further reader build up and eventually trigger task hung. To improve this situation, change the throttle model to task based, i.e. when a cfs_rq is throttled, record its throttled status but do not remove it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when they get picked, add a task work to them so that when they return to user, they can be dequeued there. In this way, tasks throttled will not hold any kernel resources. And on unthrottle, enqueue back those tasks so they can continue to run. Throttled cfs_rq's PELT clock is handled differently now: previously the cfs_rq's PELT clock is stopped once it entered throttled state but since now tasks(in kernel mode) can continue to run, change the behaviour to stop PELT clock when the throttled cfs_rq has no tasks left. Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # tag on pick Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Chen Yu <yu.c.chen@intel.com> Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20250829081120.806-4-ziqianlu@bytedance.com
2025-09-03sched/fair: Implement throttle task work and related helpersValentin Schneider1-0/+65
Implement throttle_cfs_rq_work() task work which gets executed on task's ret2user path where the task is dequeued and marked as throttled. Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Tested-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20250829081120.806-3-ziqianlu@bytedance.com
2025-09-03sched/fair: Add related data structure for task based throttleValentin Schneider3-0/+19
Add related data structures for this new throttle functionality. Tesed-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Tested-by: Valentin Schneider <vschneid@redhat.com> Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk> Link: https://lore.kernel.org/r/20250829081120.806-2-ziqianlu@bytedance.com
2025-09-03sched: Move STDL_INIT() functions out-of-linePeter Zijlstra1-0/+45
Since all these functions are address-taken in SDTL_INIT() and called indirectly, it doesn't really make sense for them to be inline. Suggested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-09-03sched/fair: Get rid of sched_domains_curr_level hack for tl->cpumask()Peter Zijlstra1-18/+10
Leon [1] and Vinicius [2] noted a topology_span_sane() warning during their testing starting from v6.16-rc1. Debug that followed pointed to the tl->mask() for the NODE domain being incorrectly resolved to that of the highest NUMA domain. tl->mask() for NODE is set to the sd_numa_mask() which depends on the global "sched_domains_curr_level" hack. "sched_domains_curr_level" is set to the "tl->numa_level" during tl traversal in build_sched_domains() calling sd_init() but was not reset before topology_span_sane(). Since "tl->numa_level" still reflected the old value from build_sched_domains(), topology_span_sane() for the NODE domain trips when the span of the last NUMA domain overlaps. Instead of replicating the "sched_domains_curr_level" hack, get rid of it entirely and instead, pass the entire "sched_domain_topology_level" object to tl->cpumask() function to prevent such mishap in the future. sd_numa_mask() now directly references "tl->numa_level" instead of relying on the global "sched_domains_curr_level" hack to index into sched_domains_numa_masks[]. The original warning was reproducible on the following NUMA topology reported by Leon: $ sudo numactl -H available: 5 nodes (0-4) node 0 cpus: 0 1 node 0 size: 2927 MB node 0 free: 1603 MB node 1 cpus: 2 3 node 1 size: 3023 MB node 1 free: 3008 MB node 2 cpus: 4 5 node 2 size: 3023 MB node 2 free: 3007 MB node 3 cpus: 6 7 node 3 size: 3023 MB node 3 free: 3002 MB node 4 cpus: 8 9 node 4 size: 3022 MB node 4 free: 2718 MB node distances: node 0 1 2 3 4 0: 10 39 38 37 36 1: 39 10 38 37 36 2: 38 38 10 37 36 3: 37 37 37 10 36 4: 36 36 36 36 10 The above topology can be mimicked using the following QEMU cmd that was used to reproduce the warning and test the fix: sudo qemu-system-x86_64 -enable-kvm -cpu host \ -m 20G -smp cpus=10,sockets=10 -machine q35 \ -object memory-backend-ram,size=4G,id=m0 \ -object memory-backend-ram,size=4G,id=m1 \ -object memory-backend-ram,size=4G,id=m2 \ -object memory-backend-ram,size=4G,id=m3 \ -object memory-backend-ram,size=4G,id=m4 \ -numa node,cpus=0-1,memdev=m0,nodeid=0 \ -numa node,cpus=2-3,memdev=m1,nodeid=1 \ -numa node,cpus=4-5,memdev=m2,nodeid=2 \ -numa node,cpus=6-7,memdev=m3,nodeid=3 \ -numa node,cpus=8-9,memdev=m4,nodeid=4 \ -numa dist,src=0,dst=1,val=39 \ -numa dist,src=0,dst=2,val=38 \ -numa dist,src=0,dst=3,val=37 \ -numa dist,src=0,dst=4,val=36 \ -numa dist,src=1,dst=0,val=39 \ -numa dist,src=1,dst=2,val=38 \ -numa dist,src=1,dst=3,val=37 \ -numa dist,src=1,dst=4,val=36 \ -numa dist,src=2,dst=0,val=38 \ -numa dist,src=2,dst=1,val=38 \ -numa dist,src=2,dst=3,val=37 \ -numa dist,src=2,dst=4,val=36 \ -numa dist,src=3,dst=0,val=37 \ -numa dist,src=3,dst=1,val=37 \ -numa dist,src=3,dst=2,val=37 \ -numa dist,src=3,dst=4,val=36 \ -numa dist,src=4,dst=0,val=36 \ -numa dist,src=4,dst=1,val=36 \ -numa dist,src=4,dst=2,val=36 \ -numa dist,src=4,dst=3,val=36 \ ... [ prateek: Moved common functions to include/linux/sched/topology.h, reuse the common bits for s390 and ppc, commit message ] Closes: https://lore.kernel.org/lkml/20250610110701.GA256154@unreal/ [1] Fixes: ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap") # ce29a7da84cd, f55dac1dafb3 Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reported-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> Tested-by: Valentin Schneider <vschneid@redhat.com> # x86 Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> # powerpc Link: https://lore.kernel.org/lkml/a3de98387abad28592e6ab591f3ff6107fe01dc1.1755893468.git.tim.c.chen@linux.intel.com/ [2]
2025-09-03sched/deadline: Fix race in push_dl_task()Harshit Agarwal1-24/+49
When a CPU chooses to call push_dl_task and picks a task to push to another CPU's runqueue then it will call find_lock_later_rq method which would take a double lock on both CPUs' runqueues. If one of the locks aren't readily available, it may lead to dropping the current runqueue lock and reacquiring both the locks at once. During this window it is possible that the task is already migrated and is running on some other CPU. These cases are already handled. However, if the task is migrated and has already been executed and another CPU is now trying to wake it up (ttwu) such that it is queued again on the runqeue (on_rq is 1) and also if the task was run by the same CPU, then the current checks will pass even though the task was migrated out and is no longer in the pushable tasks list. Please go through the original rt change for more details on the issue. To fix this, after the lock is obtained inside the find_lock_later_rq, it ensures that the task is still at the head of pushable tasks list. Also removed some checks that are no longer needed with the addition of this new check. However, the new check of pushable tasks list only applies when find_lock_later_rq is called by push_dl_task. For the other caller i.e. dl_task_offline_migration, existing checks are used. Signed-off-by: Harshit Agarwal <harshit@nutanix.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250408045021.3283624-1-harshit@nutanix.com
2025-09-02tracing: Fix tracing_marker may trigger page fault during preempt_disableLuo Gengkun1-2/+2
Both tracing_mark_write and tracing_mark_raw_write call __copy_from_user_inatomic during preempt_disable. But in some case, __copy_from_user_inatomic may trigger page fault, and will call schedule() subtly. And if a task is migrated to other cpu, the following warning will be trigger: if (RB_WARN_ON(cpu_buffer, !local_read(&cpu_buffer->committing))) An example can illustrate this issue: process flow CPU --------------------------------------------------------------------- tracing_mark_raw_write(): cpu:0 ... ring_buffer_lock_reserve(): cpu:0 ... cpu = raw_smp_processor_id() cpu:0 cpu_buffer = buffer->buffers[cpu] cpu:0 ... ... __copy_from_user_inatomic(): cpu:0 ... # page fault do_mem_abort(): cpu:0 ... # Call schedule schedule() cpu:0 ... # the task schedule to cpu1 __buffer_unlock_commit(): cpu:1 ... ring_buffer_unlock_commit(): cpu:1 ... cpu = raw_smp_processor_id() cpu:1 cpu_buffer = buffer->buffers[cpu] cpu:1 As shown above, the process will acquire cpuid twice and the return values are not the same. To fix this problem using copy_from_user_nofault instead of __copy_from_user_inatomic, as the former performs 'access_ok' before copying. Link: https://lore.kernel.org/20250819105152.2766363-1-luogengkun@huaweicloud.com Fixes: 656c7f0d2d2b ("tracing: Replace kmap with copy_from_user() in trace_marker writing") Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-02trace: Remove redundant __GFP_NOWARNQianfeng Rong1-1/+1
Commit 16f5dfbc851b ("gfp: include __GFP_NOWARN in GFP_NOWAIT") made GFP_NOWAIT implicitly include __GFP_NOWARN. Therefore, explicit __GFP_NOWARN combined with GFP_NOWAIT (e.g., `GFP_NOWAIT | __GFP_NOWARN`) is now redundant. Let's clean up these redundant flags across subsystems. No functional changes. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250805023630.335719-1-rongqianfeng@vivo.com Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-09-02bpf: Replace kvfree with kfree for kzalloc memoryFeng Yang1-2/+2
These pointers are allocated by kzalloc. Therefore, replace kvfree() with kfree() to avoid unnecessary is_vmalloc_addr() check in kvfree(). This is the remaining unmodified part from [1]. Signed-off-by: Feng Yang <yangfeng@kylinos.cn> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20250811123949.552885-1-rongqianfeng@vivo.com [1] Link: https://lore.kernel.org/bpf/20250827032812.498216-1-yangfeng59949@163.com
2025-09-02pidns: move is-ancestor logic to helperAleksa Sarai1-8/+14
This check will be needed in later patches, and there's no point open-coding it each time. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Link: https://lore.kernel.org/20250805-procfs-pidns-api-v4-1-705f984940e7@cyphar.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-02dma-debug: don't enforce dma mapping check on noncoherent allocationsBaochen Qiang3-3/+69
As discussed in [1], there is no need to enforce dma mapping check on noncoherent allocations, a simple test on the returned CPU address is good enough. Add a new pair of debug helpers and use them for noncoherent alloc/free to fix this issue. Fixes: efa70f2fdc84 ("dma-mapping: add a new dma_alloc_pages API") Link: https://lore.kernel.org/all/ff6c1fe6-820f-4e58-8395-df06aa91706c@oss.qualcomm.com # 1 Signed-off-by: Baochen Qiang <baochen.qiang@oss.qualcomm.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20250828-dma-debug-fix-noncoherent-dma-check-v1-1-76e9be0dd7fc@oss.qualcomm.com
2025-09-01copy_process: pass clone_flags as u64 across calltreeSimon Schuster11-17/+17
With the introduction of clone3 in commit 7f192e3cd316 ("fork: add clone3") the effective bit width of clone_flags on all architectures was increased from 32-bit to 64-bit, with a new type of u64 for the flags. However, for most consumers of clone_flags the interface was not changed from the previous type of unsigned long. While this works fine as long as none of the new 64-bit flag bits (CLONE_CLEAR_SIGHAND and CLONE_INTO_CGROUP) are evaluated, this is still undesirable in terms of the principle of least surprise. Thus, this commit fixes all relevant interfaces of callees to sys_clone3/copy_process (excluding the architecture-specific copy_thread) to consistently pass clone_flags as u64, so that no truncation to 32-bit integers occurs on 32-bit architectures. Signed-off-by: Simon Schuster <schuster.simon@siemens-energy.com> Link: https://lore.kernel.org/20250901-nios2-implement-clone3-v2-2-53fcf5577d57@siemens-energy.com Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-01copy_sighand: Handle architectures where sizeof(unsigned long) < sizeof(u64)Simon Schuster1-1/+1
With the introduction of clone3 in commit 7f192e3cd316 ("fork: add clone3") the effective bit width of clone_flags on all architectures was increased from 32-bit to 64-bit. However, the signature of the copy_* helper functions (e.g., copy_sighand) used by copy_process was not adapted. As such, they truncate the flags on any 32-bit architectures that supports clone3 (arc, arm, csky, m68k, microblaze, mips32, openrisc, parisc32, powerpc32, riscv32, x86-32 and xtensa). For copy_sighand with CLONE_CLEAR_SIGHAND being an actual u64 constant, this triggers an observable bug in kernel selftest clone3_clear_sighand: if (clone_flags & CLONE_CLEAR_SIGHAND) in function copy_sighand within fork.c will always fail given: unsigned long /* == uint32_t */ clone_flags #define CLONE_CLEAR_SIGHAND 0x100000000ULL This commit fixes the bug by always passing clone_flags to copy_sighand via their declared u64 type, invariant of architecture-dependent integer sizes. Fixes: b612e5df4587 ("clone3: add CLONE_CLEAR_SIGHAND") Cc: stable@vger.kernel.org # linux-5.5+ Signed-off-by: Simon Schuster <schuster.simon@siemens-energy.com> Link: https://lore.kernel.org/20250901-nios2-implement-clone3-v2-1-53fcf5577d57@siemens-energy.com Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-09-01PM: hibernate: Fix typo in memory bitmaps description commentLi Jun1-1/+1
Correct 'leave' to 'leaf' in memory bitmaps description comment. Signed-off-by: Li Jun <lijun01@kylinos.cn> Link: https://patch.msgid.link/20250819104038.1596952-1-lijun01@kylinos.cn [ rjw: Subject and changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-09-01PM: hibernate: Use vmalloc_array() and vcalloc() to improve codeQianfeng Rong1-3/+3
Remove array_size() calls and replace vmalloc() and vzalloc() with vmalloc_array() and vcalloc() respectively to simplify the code in save_compressed_image() and load_compressed_image(). vmalloc_array() is also optimized better, resulting in less instructions being used, and vmalloc_array() handling overhead is lower [1]. Link: https://lore.kernel.org/lkml/abc66ec5-85a4-47e1-9759-2f60ab111971@vivo.com/ [1] Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com> Link: https://patch.msgid.link/20250817083636.53872-1-rongqianfeng@vivo.com [ rjw: Changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-08-31Merge tag 'sched_urgent_for_v6.17_rc4' of ↵Linus Torvalds2-10/+14
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Fix a stall on the CPU offline path due to mis-counting a deadline server task twice as part of the runqueue's running tasks count - Fix a realtime tasks starvation case where failure to enqueue a timer whose expiration time is already in the past would cause repeated attempts to re-enqueue a deadline server task which leads to starving the former, realtime one - Prevent a delayed deadline server task stop from breaking the per-runqueue bandwidth tracking - Have a function checking whether the deadline server task has stopped, return the correct value * tag 'sched_urgent_for_v6.17_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Don't count nr_running for dl_server proxy tasks sched/deadline: Fix RT task potential starvation when expiry time passed sched/deadline: Always stop dl-server before changing parameters sched/deadline: Fix dl_server_stopped()
2025-08-31futex: Move futex_hash_free() back to __mmput()Sebastian Andrzej Siewior2-5/+13
To avoid a memory leak via mm_alloc() + mmdrop() the futex cleanup code has been moved to __mmdrop(). This resulted in a warnings if the futex hash table has been allocated via vmalloc() the mmdrop() was invoked from atomic context. The free path must stay in __mmput() to ensure it is invoked from preemptible context. In order to avoid the memory leak, delay the allocation of mm_struct::mm->futex_ref to futex_hash_allocate(). This works because neither the per-CPU counter nor the private hash has been allocated and therefore - futex_private_hash() callers (such as exit_pi_state_list()) don't acquire reference if there is no private hash yet. There is also no reference put. - Regular callers (futex_hash()) fallback to global hash. No reference counting here. The futex_ref member can be allocated in futex_hash_allocate() before the private hash itself is allocated. This happens either while the first thread is created or on request. In both cases the process has just a single thread so there can be either futex operation in progress or the request to create a private hash. Move futex_hash_free() back to __mmput(); Move the allocation of mm_struct::futex_ref to futex_hash_allocate(). [ bp: Fold a follow-up fix to prevent a use-after-free: https://lore.kernel.org/r/20250830213806.sEKuuGSm@linutronix.de ] Fixes: e703b7e247503 ("futex: Move futex cleanup to __mmdrop()") Closes: https://lore.kernel.org/all/20250821102721.6deae493@kernel.org/ Reported-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lkml.kernel.org/r/20250822141238.PfnkTjFb@linutronix.de
2025-08-30audit: add record for multiple object contextsCasey Schaufler2-32/+64
Create a new audit record AUDIT_MAC_OBJ_CONTEXTS. An example of the MAC_OBJ_CONTEXTS record is: type=MAC_OBJ_CONTEXTS msg=audit(1601152467.009:1050): obj_selinux=unconfined_u:object_r:user_home_t:s0 When an audit event includes a AUDIT_MAC_OBJ_CONTEXTS record the "obj=" field in other records in the event will be "obj=?". An AUDIT_MAC_OBJ_CONTEXTS record is supplied when the system has multiple security modules that may make access decisions based on an object security context. Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> [PM: subj tweak, audit example readability indents] Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-08-30audit: add record for multiple task security contextsCasey Schaufler1-33/+175
Replace the single skb pointer in an audit_buffer with a list of skb pointers. Add the audit_stamp information to the audit_buffer as there's no guarantee that there will be an audit_context containing the stamp associated with the event. At audit_log_end() time create auxiliary records as have been added to the list. Functions are created to manage the skb list in the audit_buffer. Create a new audit record AUDIT_MAC_TASK_CONTEXTS. An example of the MAC_TASK_CONTEXTS record is: type=MAC_TASK_CONTEXTS msg=audit(1600880931.832:113) subj_apparmor=unconfined subj_smack=_ When an audit event includes a AUDIT_MAC_TASK_CONTEXTS record the "subj=" field in other records in the event will be "subj=?". An AUDIT_MAC_TASK_CONTEXTS record is supplied when the system has multiple security modules that may make access decisions based on a subject security context. Refactor audit_log_task_context(), creating a new audit_log_subj_ctx(). This is used in