| Age | Commit message (Collapse) | Author | Files | Lines |
|
show_cpu_pool_hog() and show_cpu_pools_hogs() no longer only dump CPU
hogs — since commit 8823eaef45da ("workqueue: Show all busy workers in
stall diagnostics"), they dump every in-flight worker in the pool's
busy_hash.
Rename them to show_cpu_pool_busy_workers() and
show_cpu_pools_busy_workers() to accurately describe what they do.
Also fix the pr_info() message to say "stalled worker pools" instead of
"stalled CPU-bound worker pools", since sleeping/blocked workers are now
included.
No functional change.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
show_cpu_pool_hog() only prints workers whose task is currently running
on the CPU (task_is_running()). This misses workers that are busy
processing a work item but are sleeping or blocked — for example, a
worker that clears PF_WQ_WORKER and enters wait_event_idle(). Such a
worker still occupies a pool slot and prevents progress, yet produces
an empty backtrace section in the watchdog output.
This is happening on real arm64 systems, where
toggle_allocation_gate() IPIs every single CPU in the machine (which
lacks NMI), causing workqueue stalls that show empty backtraces because
toggle_allocation_gate() is sleeping in wait_event_idle().
Remove the task_is_running() filter so every in-flight worker in the
pool's busy_hash is dumped. The busy_hash is protected by pool->lock,
which is already held.
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
When diagnosing workqueue stalls, knowing how long each in-flight work
item has been executing is valuable. Add a current_start timestamp
(jiffies) to struct worker, set it when a work item begins execution in
process_one_work(), and print the elapsed wall-clock time in show_pwq().
Unlike current_at (which tracks CPU runtime and resets on wakeup for
CPU-intensive detection), current_start is never reset because the
diagnostic cares about total wall-clock time including sleeps.
Before: in-flight: 165:stall_work_fn [wq_stall]
After: in-flight: 165:stall_work_fn [wq_stall] for 100s
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The watchdog_ts name doesn't convey what the timestamp actually tracks.
This field tracks the last time a workqueue got progress.
Rename it to last_progress_ts to make it clear that it records when the
pool last made forward progress (started processing new work items).
No functional change.
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
pr_cont_worker_id() checks pool->flags against WQ_BH, which is a
workqueue-level flag (defined in workqueue.h). Pool flags use a
separate namespace with POOL_* constants (defined in workqueue.c).
The correct constant is POOL_BH. Both WQ_BH and POOL_BH are defined
as (1 << 0) so this has no behavioral impact, but it is semantically
wrong and inconsistent with every other pool-level BH check in the
file.
Fixes: 4cb1ef64609f ("workqueue: Implement BH workqueues to eventually replace tasklets")
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
This is the exact same thing as the 'alloc_obj()' version, only much
smaller because there are a lot fewer users of the *alloc_flex()
interface.
As with alloc_obj() version, this was done entirely with mindless brute
force, using the same script, except using 'flex' in the pattern rather
than 'objs*'.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
Pull workqueue updates from Tejun Heo:
- Rework the rescuer to process work items one-by-one instead of
slurping all pending work items in a single pass.
As there is only one rescuer per workqueue, a single long-blocking
work item could cause high latency for all tasks queued behind it,
even after memory pressure is relieved and regular kworkers become
available to service them.
- Add CONFIG_BOOTPARAM_WQ_STALL_PANIC build-time option and
workqueue.panic_on_stall_time parameter for time-based stall panic,
giving systems more control over workqueue stall handling.
- Replace BUG_ON() with panic() in the stall panic path for clearer
intent and more informative output.
* tag 'wq-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: replace BUG_ON with panic in panic_on_wq_watchdog
workqueue: add time-based panic for stalls
workqueue: add CONFIG_BOOTPARAM_WQ_STALL_PANIC option
workqueue: Process extra works in rescuer on memory pressure
workqueue: Process rescuer work items one-by-one using a cursor
workqueue: Make send_mayday() take a PWQ argument directly
|
|
Replace BUG_ON() with panic() in panic_on_wq_watchdog(). This is not
a bug condition but a deliberate forced panic requested by the user
via module parameters to crash the system for debugging purposes.
Using panic() instead of BUG_ON() makes this intent clearer and provides
more informative output about which threshold was exceeded and the actual
values, making it easier to diagnose the stall condition from crash dumps.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Add a new module parameter 'panic_on_stall_time' that triggers a panic
when a workqueue stall persists for longer than the specified duration
in seconds.
Unlike 'panic_on_stall' which counts accumulated stall events, this
parameter triggers based on the duration of a single continuous stall.
This is useful for catching truly stuck workqueues rather than
accumulating transient stalls.
Usage:
workqueue.panic_on_stall_time=120
This would panic if any workqueue pool has been stalled for 120 seconds
or more.
The stall duration is measured from the workqueue last progress
(poll_ts) which accounts for legitimate system stalls.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Add a kernel config option to set the default value of
workqueue.panic_on_stall, similar to CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC,
CONFIG_BOOTPARAM_HARDLOCKUP_PANIC and CONFIG_BOOTPARAM_HUNG_TASK_PANIC.
This allows setting the number of workqueue stalls before triggering
a kernel panic at build time, which is useful for high-availability
systems that need consistent panic-on-stall, in other words, those
servers which run with CONFIG_BOOTPARAM_*_PANIC=y already.
The default remains 0 (disabled). Setting it to 1 will panic on the
first stall, and higher values will panic after that many stall
warnings. The value can still be overridden at runtime via the
workqueue.panic_on_stall boot parameter or sysfs.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Until now, cpuset would propagate isolated partition changes to
workqueues so that unbound workers get properly reaffined.
Since housekeeping now centralizes, synchronize and propagates isolation
cpumask changes, perform the work from that subsystem for consolidation
and consistency purposes.
For simplification purpose, the target function is adapted to take the
new housekeeping mask instead of the isolated mask.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Waiman Long <longman@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: "Michal Koutný" <mkoutny@suse.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Marco Crivellari <marco.crivellari@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Waiman Long <longman@redhat.com>
Cc: cgroups@vger.kernel.org
|
|
Make the rescuer process more work on the last pwq when there are no
more to rescue for the whole workqueue to help the regular workers in
case it is a temporary memory pressure relief and to reduce relapse.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Previously, the rescuer scanned for all matching work items at once and
processed them within a single rescuer thread, which could cause one
blocking work item to stall all others.
Make the rescuer process work items one-by-one instead of slurping all
matches in a single pass.
Break the rescuer loop after finding and processing the first matching
work item, then restart the search to pick up the next. This gives
normal worker threads a chance to process other items which gives them
the opportunity to be processed instead of waiting on the rescuer's
queue and prevents a blocking work item from stalling the rest once
memory pressure is relieved.
Introduce a dummy cursor work item to avoid potentially O(N^2)
rescans of the work list. The marker records the resume position for
the next scan, eliminating redundant traversals.
Also introduce RESCUER_BATCH to control the maximum number of work items
the rescuer processes in each turn, and move on to other PWQs when the
limit is reached.
Cc: ying chen <yc1082463@gmail.com>
Reported-by: ying chen <yc1082463@gmail.com>
Fixes: e22bee782b3b ("workqueue: implement concurrency managed dynamic worker pool")
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Make send_mayday() operate on a PWQ directly instead of taking a work
item, so that rescuer_thread() now calls send_mayday(pwq) instead of
open-coding the mayday list manipulation.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The commit1 def98c84b6cd ("workqueue: Fix spurious sanity check failures
in destroy_workqueue()") tries to fix spurious sanity check failures by
stopping send_mayday() via setting wq->rescuer to NULL.
But it fails to stop the pwq->mayday_node requeuing in the rescuer, and
the commit2 e66b39af00f4 ("workqueue: Fix pwq ref leak in
rescuer_thread()") fixes it by checking wq->rescuer which is the result
of commit1.
Both commits together really fix spurious sanity check failures caused
by the rescuer, but they both use a convoluted method by relying on
wq->rescuer state rather than the real count of work items.
Actually __WQ_DESTROYING and drain_workqueue() together already stop
send_mayday() by draining all the work items and ensuring no new work
item requeuing.
And the more proper fix to stop the pwq->mayday_node requeuing in the
rescuer is from commit3 4f3f4cf388f8 ("workqueue: avoid unneeded
requeuing the pwq in rescuer thread") and renders the checking of
wq->rescuer in commit2 unnecessary.
So __WQ_DESTROYING, drain_workqueue() and commit3 together fix spurious
sanity check failures introduced by the rescuer.
Just remove the convoluted code of using wq->rescuer.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
If the pwq does not need rescue (normal workers have been created or
become available), the rescuer can immediately move on to other stalled
pwqs.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Move the code to assign work to rescuer and assign_rescuer_work().
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The affinity to set to the rescuers should be consistent in all paths
when a rescuer is in detached state. The affinity could be either
wq_unbound_cpumask or unbound_effective_cpumask(wq).
Related paths:
rescuer's worker_detach_from_pool()
update wq_unbound_cpumask
update wq's cpumask
init_rescuer()
Both affinities are Ok as long as they are consistent in all paths.
In the commit 449b31ad2937 ("workqueue: Init rescuer's affinities as
the wq's effective cpumask") makes init_rescuer use
unbound_effective_cpumask(wq) which is consistent with then
apply_wqattrs_commit().
But using unbound_effective_cpumask(wq) requres much more code to
maintain the consistency, and it doesn't make much sense since the
affinity is only effective when the rescuer is not processing works.
wq_unbound_cpumask is more favorable.
So apply_wqattrs_commit() and the path of "updating wq's cpumask" had
been changed to not update the rescuer's affinity, and both the paths
of "updating wq_unbound_cpumask" and "rescuer's
worker_detach_from_pool()" had been changed to use wq_unbound_cpumask.
Now, make init_rescuer() use wq_unbound_cpumask for rescuer's affinity
and make all the paths consistent.
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
When workqueue cpumask changes are committed, the DISASSOCIATED workers
affinity is not touched and this might be a problem down the line for
isolated setups when the DISASSOCIATED pools still have works to run
after the cpu is offline.
Make sure the workers' affinity is updated every time a workqueue cpumask
changes, so these workers can't break isolation.
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
When a rescuer is attached to a pool, its affinity should be only
managed by the pool.
But updating the detached rescuer's affinity is still meaningful so
that it will not disrupt isolated CPUs when it is to be waken up.
But the commit d64f2fa064f8 ("kernel/workqueue: Let rescuers follow
unbound wq cpumask changes") updates the affinity unconditionally, and
causes some issues
1) it also changes the affinity when the rescuer is already attached to
a pool, which violates the affinity management.
2) the said commit tries to update the affinity of the rescuers, but it
misses the rescuers of the PERCPU workqueues, and isolated CPUs can
be possibly disrupted by these rescuers when they are summoned.
3) The affinity to set to the rescuers should be consistent in all paths
when a rescuer is in detached state. The affinity could be either
wq_unbound_cpumask or unbound_effective_cpumask(wq). Related paths:
rescuer's worker_detach_from_pool()
update wq_unbound_cpumask
update wq's cpumask
init_rescuer()
Both affinities are Ok as long as they are consistent in all paths.
But using unbound_effective_cpumask(wq) requres much more code to
maintain the consistency, and it doesn't make much sense since the
affinity is only effective when the rescuer is not processing works.
wq_unbound_cpumask is more favorable.
Fix the 1) issue by testing rescuer->pool before updating with
wq_pool_attach_mutex held.
Fix the 2) issue by moving the rescuer's affinity updating code to
the place updating wq_unbound_cpumask and make it also update for
PERCPU workqueues.
Partially cleanup the 3) consistency issue by using wq_unbound_cpumask.
So that the path of "updating wq's cpumask" doesn't need to maintain it.
and both the paths of "updating wq_unbound_cpumask" and "rescuer's
worker_detach_from_pool()" use wq_unbound_cpumask.
Cleanup for init_rescuer()'s consistency for affinity can be done in
future.
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
assert_rcu_or_wq_mutex_or_pool_mutex is never referenced in the code.
Just remove it.
Signed-off-by: zhang jiao <zhangjiao2@cmss.chinamobile.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.
This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.
This patch adds a new WQ_PERCPU flag to explicitly request the use of
the per-CPU behavior. Both flags coexist for one release cycle to allow
callers to transition their calls.
Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.
With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.
All existing users have been updated accordingly.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
system_wq is a per-CPU worqueue, yet nothing in its name tells about that
CPU affinity constraint, which is very often not required by users. Make
it clear by adding a system_percpu_wq.
queue_work() / queue_delayed_work() mod_delayed_work() will now use the
new per-cpu wq: whether the user still stick on the old name a warn will
be printed along a wq redirect to the new one.
This patch add the new system_percpu_wq except for mm, fs and net
subsystem, whom are handled in separated patches.
The old wq will be kept for a few release cylces.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
system_unbound_wq should be the default workqueue so as not to enforce
locality constraints for random work whenever it's not required.
Adding system_dfl_wq to encourage its use when unbound work should be used.
queue_work() / queue_delayed_work() / mod_delayed_work() will now use the
new unbound wq: whether the user still use the old wq a warn will be
printed along with a wq redirect to the new one.
The old system_unbound_wq will be kept for a few release cycles.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
While a BH work item is canceled, the core code spins until it
determines that the item completed. On PREEMPT_RT the spinning relies on
a lock in local_bh_disable() to avoid a live lock if the canceling
thread has higher priority than the BH-worker and preempts it. This lock
ensures that the BH-worker makes progress by PI-boosting it.
This lock in local_bh_disable() is a central per-CPU BKL and about to be
removed.
To provide the required synchronisation add a per pool lock. The lock is
acquired by the bh_worker at the begin while the individual callbacks
are invoked. To enforce progress in case of interruption, __flush_work()
needs to acquire the lock.
This will flush all BH-work items assigned to that pool.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The wq_watchdog_timer_fn() is executed in the softirq context, this
is already in the RCU read critical section, this commit therefore
remove rcu_read_lock/unlock() in wq_watchdog_timer_fn().
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The preempt_disable/enable() has already formed RCU read crtical
section, this commit therefore remove rcu_read_lock/unlock() in
workqueue_congested().
Signed-off-by: Zqiang <qiang.zhang@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Pull workqueue updates from Tejun Heo:
- Prepare for defaulting to unbound workqueue. A separate branch was
created to ease pulling in from other trees but none of the
conversions have landed yet
- Memory allocation profiling support added
- Misc changes
* tag 'wq-for-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Use atomic_try_cmpxchg_relaxed() in tryinc_node_nr_active()
workqueue: Remove unused work_on_cpu_safe
workqueue: Add new WQ_PERCPU flag
workqueue: Add system_percpu_wq and system_dfl_wq
workqueue: Basic memory allocation profiling support
workqueue: fix opencoded cpumask_next_and_wrap() in wq_select_unbound_cpu()
|
|
Use try_cmpxchg() family of locking primitives instead of
cmpxchg(*ptr, old, new) == old.
The x86 CMPXCHG instruction returns success in the ZF flag, so this
change saves a compare after CMPXCHG (and related move instruction
in front of CMPXCHG).
Also, try_cmpxchg() implicitly assigns old *ptr value to "old" when
CMPXCHG fails. There is no need to re-read the value in the loop.
The generated assembly improves from:
3f7: 44 8b 0a mov (%rdx),%r9d
3fa: eb 12 jmp 40e <...>
3fc: 8d 79 01 lea 0x1(%rcx),%edi
3ff: 89 c8 mov %ecx,%eax
401: f0 0f b1 7a 04 lock cmpxchg %edi,0x4(%rdx)
406: 39 c1 cmp %eax,%ecx
408: 0f 84 83 00 00 00 je 491 <...>
40e: 8b 4a 04 mov 0x4(%rdx),%ecx
411: 41 39 c9 cmp %ecx,%r9d
414: 7f e6 jg 3fc <...>
to:
256b: 45 8b 08 mov (%r8),%r9d
256e: 41 8b 40 04 mov 0x4(%r8),%eax
2572: 41 39 c1 cmp %eax,%r9d
2575: 7e 10 jle 2587 <...>
2577: 8d 78 01 lea 0x1(%rax),%edi
257a: f0 41 0f b1 78 04 lock cmpxchg %edi,0x4(%r8)
2580: 75 f0 jne 2572 <...>
No functional change intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The last use of the work_on_cpu_safe() macro was removed recently by
commit 9cda46babdfe ("crypto: n2 - remove Niagara2 SPU driver")
Remove it, and the work_on_cpu_safe_key() function it calls.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Now when isolcpus is enabled via the cmdline, wq_isolated_cpumask does
not include these isolated CPUs, even wq_unbound_cpumask has already
excluded them. It is only when we successfully configure an isolate cpuset
partition that wq_isolated_cpumask gets overwritten by
workqueue_unbound_exclude_cpumask(), including both the cmdline-specified
isolated CPUs and the isolated CPUs within the cpuset partitions.
Fix this issue by initializing wq_isolated_cpumask properly in
workqueue_init_early().
Fixes: fe28f631fa94 ("workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask")
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
|
|
Currently, if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.
system_wq is a per-CPU worqueue, yet nothing in its name tells about that
CPU affinity constraint, which is very often not required by users. Make it
clear by adding a system_percpu_wq.
system_unbound_wq should be the default workqueue so as not to enforce
locality constraints for random work whenever it's not required.
Adding system_dfl_wq to encourage its use when unbound work should be used.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Hook alloc_workqueue and alloc_workqueue_attrs() so that they're
accounted to the callsite. Since we're doing allocations on behalf of
another subsystem, this helps when using memory allocation profiling to
check for leaks.
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The dedicated helper is more verbose and effective comparing to
cpumask_first() followed by cpumask_next().
Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Move this API to the canonical timer_*() namespace.
[ tglx: Redone against pre rc1 ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
|
|
Pull workqueue updates from Tejun Heo:
"Fix statistic update race condition and a couple documentation
updates"
* tag 'wq-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: fix typo in comment
workqueue: Fix race condition in wq->stats incrementation
workqueue: Better document teardown for delayed_work
|
|
Move this API to the canonical timer_*() namespace.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250507175338.672442-10-mingo@kernel.org
|
|
Fixed a race condition in incrementing wq->stats[PWQ_STAT_COMPLETED] by
moving the operation under pool->lock.
Reported-by: syzbot+01affb1491750534256d@syzkaller.appspotmail.com
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
destroy_workqueue() does not ensure that non-pending work submitted with
queue_delayed_work() gets cancelled. The caller has to ensure that
manually.
Add this information about delayed_work in destroy_workqueue()'s
docstring.
Add a TODO for destroy_workqueue() to wait for all delayed_work.
Signed-off-by: Philipp Stanner <phasta@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.
Conversion was done with coccinelle plus manual fixups where necessary.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue update from Tejun Heo:
"This contains a patch improve debug visibility.
While it isn't a fix, the change carries virtually no risk and makes
it substantially easier to chase down a class of problems"
* tag 'wq-for-6.14-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Log additional details when rejecting work
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fix from Tejun Heo:
- Fix a regression where a worker pool can be freed before rescuer
workers are done with it leading to user-after-free
* tag 'wq-for-6.14-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Put the pwq after detaching the rescuer from the pool
|
|
Syzbot regularly runs into the following warning on arm64:
| WARNING: CPU: 1 PID: 6023 at kernel/workqueue.c:2257 current_wq_worker kernel/workqueue_internal.h:69 [inline]
| WARNING: CPU: 1 PID: 6023 at kernel/workqueue.c:2257 is_chained_work kernel/workqueue.c:2199 [inline]
| WARNING: CPU: 1 PID: 6023 at kernel/workqueue.c:2257 __queue_work+0xe50/0x1308 kernel/workqueue.c:2256
| Modules linked in:
| CPU: 1 UID: 0 PID: 6023 Comm: klogd Not tainted 6.13.0-rc2-syzkaller-g2e7aff49b5da #0
| Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
| pstate: 404000c5 (nZcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : __queue_work+0xe50/0x1308 kernel/workqueue_internal.h:69
| lr : current_wq_worker kernel/workqueue_internal.h:69 [inline]
| lr : is_chained_work kernel/workqueue.c:2199 [inline]
| lr : __queue_work+0xe50/0x1308 kernel/workqueue.c:2256
[...]
| __queue_work+0xe50/0x1308 kernel/workqueue.c:2256 (L)
| delayed_work_timer_fn+0x74/0x90 kernel/workqueue.c:2485
| call_timer_fn+0x1b4/0x8b8 kernel/time/timer.c:1793
| expire_timers kernel/time/timer.c:1839 [inline]
| __run_timers kernel/time/timer.c:2418 [inline]
| __run_timer_base+0x59c/0x7b4 kernel/time/timer.c:2430
| run_timer_base kernel/time/timer.c:2439 [inline]
| run_timer_softirq+0xcc/0x194 kernel/time/timer.c:2449
The warning is probably because we are trying to queue work into a
destroyed workqueue, but the softirq context makes it hard to pinpoint
the problematic caller.
Extend the warning diagnostics to print both the function we are trying
to queue as well as the name of the workqueue.
Cc: Tejun Heo <tj@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Link: https://syzkaller.appspot.com/bug?extid=e13e654d315d4da1277c
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"The various patchsets are summarized below. Plus of course many
indivudual patches which are described in their changelogs.
- "Allocate and free frozen pages" from Matthew Wilcox reorganizes
the page allocator so we end up with the ability to allocate and
free zero-refcount pages. So that callers (ie, slab) can avoid a
refcount inc & dec
- "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to
use large folios other than PMD-sized ones
- "Fix mm/rodata_test" from Petr Tesarik performs some maintenance
and fixes for this small built-in kernel selftest
- "mas_anode_descend() related cleanup" from Wei Yang tidies up part
of the mapletree code
- "mm: fix format issues and param types" from Keren Sun implements a
few minor code cleanups
- "simplify split calculation" from Wei Yang provides a few fixes and
a test for the mapletree code
- "mm/vma: make more mmap logic userland testable" from Lorenzo
Stoakes continues the work of moving vma-related code into the
(relatively) new mm/vma.c
- "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
Hildenbrand cleans up and rationalizes handling of gfp flags in the
page allocator
- "readahead: Reintroduce fix for improper RA window sizing" from Jan
Kara is a second attempt at fixing a readahead window sizing issue.
It should reduce the amount of unnecessary reading
- "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
addresses an issue where "huge" amounts of pte pagetables are
accumulated:
https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/
Qi's series addresses this windup by synchronously freeing PTE
memory within the context of madvise(MADV_DONTNEED)
- "selftest/mm: Remove warnings found by adding compiler flags" from
Muhammad Usama Anjum fixes some build warnings in the selftests
code when optional compiler warnings are enabled
- "mm: don't use __GFP_HARDWALL when migrating remote pages" from
David Hildenbrand tightens the allocator's observance of
__GFP_HARDWALL
- "pkeys kselftests improvements" from Kevin Brodsky implements
various fixes and cleanups in the MM selftests code, mainly
pertaining to the pkeys tests
- "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
estimate application working set size
- "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
provides some cleanups to memcg's hugetlb charging logic
- "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
removes the global swap cgroup lock. A speedup of 10% for a
tmpfs-based kernel build was demonstrated
- "zram: split page type read/write handling" from Sergey Senozhatsky
has several fixes and cleaups for zram in the area of
zram_write_page(). A watchdog softlockup warning was eliminated
- "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin
Brodsky cleans up the pagetable destructor implementations. A rare
use-after-free race is fixed
- "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
simplifies and cleans up the debugging code in the VMA merging
logic
- "Account page tables at all levels" from Kevin Brodsky cleans up
and regularizes the pagetable ctor/dtor handling. This results in
improvements in accounting accuracy
- "mm/damon: replace most damon_callback usages in sysfs with new
core functions" from SeongJae Park cleans up and generalizes
DAMON's sysfs file interface logic
- "mm/damon: enable page level properties based monitoring" from
SeongJae Park increases the amount of information which is
presented in response to DAMOS actions
- "mm/damon: remove DAMON debugfs interface" from SeongJae Park
removes DAMON's long-deprecated debugfs interfaces. Thus the
migration to sysfs is completed
- "mm/hugetlb: Refactor hugetlb allocation resv accounting" from
Peter Xu cleans up and generalizes the hugetlb reservation
accounting
- "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
removes a never-used feature of the alloc_pages_bulk() interface
- "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
extends DAMOS filters to support not only exclusion (rejecting),
but also inclusion (allowing) behavior
- "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
introduces a new memory descriptor for zswap.zpool that currently
overlaps with struct page for now. This is part of the effort to
reduce the size of struct page and to enable dynamic allocation of
memory descriptors
- "mm, swap: rework of swap allocator locks" from Kairui Song redoes
and simplifies the swap allocator locking. A speedup of 400% was
demonstrated for one workload. As was a 35% reduction for kernel
build time with swap-on-zram
- "mm: update mips to use do_mmap(), make mmap_region() internal"
from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
mmap_region() can be made MM-internal
- "mm/mglru: performance optimizations" from Yu Zhao fixes a few
MGLRU regressions and otherwise improves MGLRU performance
- "Docs/mm/damon: add tuning guide and misc updates" from SeongJae
Park updates DAMON documentation
- "Cleanup for memfd_create()" from Isaac Manjarres does that thing
- "mm: hugetlb+THP folio and migration cleanups" from David
Hildenbrand provides various cleanups in the areas of hugetlb
folios, THP folios and migration
- "Uncached buffered IO" from Jens Axboe implements the new
RWF_DONTCACHE flag which provides synchronous dropbehind for
pagecache reading and writing. To permite userspace to address
issues with massive buildup of useless pagecache when
reading/writing fast devices
- "selftests/mm: virtual_address_range: Reduce memory" from Thomas
Weißschuh fixes and optimizes some of the MM selftests"
* tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
mm/compaction: fix UBSAN shift-out-of-bounds warning
s390/mm: add missing ctor/dtor on page table upgrade
kasan: sw_tags: use str_on_off() helper |