| Age | Commit message (Collapse) | Author | Files | Lines |
|
cgroups
Consider the following sequence on a CPU configured with nohz_full:
1) A task P runs in cgroup A, and cgroup A becomes throttled due to CFS
bandwidth control. The gse (cgroup A) where the task P attached is
dequeued and the CPU switches to idle.
2) Before cgroup A is unthrottled, task P is migrated from cgroup A to
another cgroup B (not throttled).
During sched_move_task(), the task P is observed as queued but not
running, and therefore no resched_curr() is triggered.
3) Since the CPU is nohz_full, it remains in do_idle() waiting for an
explicit scheduling event, i.e., resched_curr().
4) For kernel <= 5.10: Later, cgroup A is unthrottled. However, the task
P has already been migrated out of cgroup A, so unthrottle_cfs_rq()
may observe load_weight == 0 and return early without resched_curr()
called. For kernel >= 6.6: The unthrottling path normally triggers
`resched_curr()` almost cases even when no runnable tasks remain in the
unthrottled cgroup, preventing the idle stall described above. However,
if cgroup A is removed before it gets unthrottled, the unthrottling path
for cgroup A is never executed. In a result, no `resched_curr()` can be
called.
5) At this point, the task P is runnable in cgroup B (not throttled), but
the CPU remains in do_idle() with no pending reschedule point. The
system stays in this state until an unrelated event (e.g. a new task
wakeup or any cases) that can trigger a resched_curr() breaks the
nohz_full idle state, and then the task P finally gets scheduled.
The root cause is that sched_move_task() may classify the task as only
queued, not running, and therefore fails to trigger a resched_curr(),
while the later unthrottling path no longer has visibility of the
migrated task.
Preserve the existing behavior for running tasks by issuing
resched_curr(), and explicitly invoke check_preempt_curr() for tasks
that were queued at the time of migration. This ensures that runnable
tasks are reconsidered for scheduling even when nohz_full suppresses
periodic ticks.
Fixes: 29f59db3a74b ("sched: group-scheduler core")
Signed-off-by: Zicheng Qu <quzicheng@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Aaron Lu <ziqianlu@bytedance.com>
Tested-by: Aaron Lu <ziqianlu@bytedance.com>
Link: https://patch.msgid.link/20260130083438.1122457-1-quzicheng@huawei.com
|
|
Use %pe format specifier for printing PTR_ERR() error values
to make error messages more readable.
Found by Coccinelle:
./cpufreq_schedutil.c:685:49-56: WARNING: Consider using %pe to print PTR_ERR()
Signed-off-by: zenghongling <zenghongling@kylinos.cn>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260120083333.148385-1-zenghongling@kylinos.cn
|
|
CPU0 becomes overloaded when hosting a CPU-bound RT task, a non-CPU-bound
RT task, and a CFS task stuck in kernel space. When other CPUs switch from
RT to non-RT tasks, RT load balancing (LB) is triggered; with
HAVE_RT_PUSH_IPI enabled, they send IPIs to CPU0 to drive the execution
of rto_push_irq_work_func. During push_rt_task on CPU0,
if next_task->prio < rq->donor->prio, resched_curr() sets NEED_RESCHED
and after the push operation completes, CPU0 calls rto_next_cpu().
Since only CPU0 is overloaded in this scenario, rto_next_cpu() should
ideally return -1 (no further IPI needed).
However, multiple CPUs invoking tell_cpu_to_push() during LB increments
rd->rto_loop_next. Even when rd->rto_cpu is set to -1, the mismatch between
rd->rto_loop and rd->rto_loop_next forces rto_next_cpu() to restart its
search from -1. With CPU0 remaining overloaded (satisfying rt_nr_migratory
&& rt_nr_total > 1), it gets reselected, causing CPU0 to queue irq_work to
itself and send self-IPIs repeatedly. As long as CPU0 stays overloaded and
other CPUs run pull_rt_tasks(), it falls into an infinite self-IPI loop,
which triggers a CPU hardlockup due to continuous self-interrupts.
The trigging scenario is as follows:
cpu0 cpu1 cpu2
pull_rt_task
tell_cpu_to_push
<------------irq_work_queue_on
rto_push_irq_work_func
push_rt_task
resched_curr(rq) pull_rt_task
rto_next_cpu tell_cpu_to_push
<-------------------------- atomic_inc(rto_loop_next)
rd->rto_loop != next
rto_next_cpu
irq_work_queue_on
rto_push_irq_work_func
Fix redundant self-IPI by filtering the initiating CPU in rto_next_cpu().
This solution has been verified to effectively eliminate spurious self-IPIs
and prevent CPU hardlockup scenarios.
Fixes: 4bdced5c9a29 ("sched/rt: Simplify the IPI based RT balancing logic")
Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://patch.msgid.link/20260122012533.673768-1-chenjinghuang2@huawei.com
|
|
Read-mostly sched_clock_irqtime may share the same cacheline with
frequently updated nohz struct. Make it as static_key to avoid
false sharing issue.
The only user of disable_sched_clock_irqtime()
is tsc_.*mark_unstable() which may be invoked under atomic context
and require a workqueue to disable static_key. But both of them
calls clear_sched_clock_stable() just before doing
disable_sched_clock_irqtime(). We can reuse
"sched_clock_work" to also disable sched_clock_irqtime().
One additional case need to handle is if the tsc is marked unstable
before late_initcall() phase, sched_clock_work will not be invoked
and sched_clock_irqtime will stay enabled although clock is unstable:
tsc_init()
enable_sched_clock_irqtime() # irqtime accounting is enabled here
...
if (unsynchronized_tsc()) # true
mark_tsc_unstable()
clear_sched_clock_stable()
__sched_clock_stable_early = 0;
...
if (static_key_count(&sched_clock_running.key) == 2)
# Only happens at sched_clock_init_late()
__clear_sched_clock_stable(); # Never executed
...
# late_initcall() phase
sched_clock_init_late()
if (__sched_clock_stable_early) # Already false
__set_sched_clock_stable(); # sched_clock is never marked stable
# TSC unstable, but sched_clock_work won't run to disable irqtime
So we need to disable_sched_clock_irqtime() in sched_clock_init_late()
if clock is unstable.
Reported-by: Benjamin Lei <benjamin.lei@intel.com>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Signed-off-by: Wangyang Guo <wangyang.guo@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Tianyou Li <tianyou.li@intel.com>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://patch.msgid.link/20260127072509.2627346-1-wangyang.guo@intel.com
|
|
Add a new kselftest to verify that the total_bw value in
/sys/kernel/debug/sched/debug remains consistent across all CPUs
under different sched_ext BPF program states:
1. Before a BPF scheduler is loaded
2. While a BPF scheduler is loaded and active
3. After a BPF scheduler is unloaded
The test runs CPU stress threads to ensure DL server bandwidth
values stabilize before checking consistency. This helps catch
potential issues with DL server bandwidth accounting during
sched_ext transitions.
Co-developed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20260126100050.3854740-8-arighi@nvidia.com
|
|
Add a selftest to validate the correct behavior of the deadline server
for the ext_sched_class.
Co-developed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20260126100050.3854740-7-arighi@nvidia.com
|
|
There are two problems with sched_server_write_common() that can cause the
dl_server to malfunction upon attempting to change the parameters:
1) when, after having disabled the dl_server by setting runtime=0, it is
enabled again while tasks are already enqueued. In this case is_active would
still be 0 and dl_server_start() would not be called.
2) when dl_server_apply_params() would fail, runtime is not applied and does
not reflect the new state.
Instead have dl_server_start() check its actual dl_runtime, and have
sched_server_write_common() unconditionally (re)start the dl_server. It will
automatically stop if there isn't anything to do, so spurious activation is
harmless -- while failing to start it is a problem.
While there, move the printk out of the locked region and make it symmetric,
also printing on enable.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260203103407.GK1282955@noisy.programming.kicks-ass.net
|
|
When a sched_ext server is loaded, tasks in the fair class are
automatically moved to the sched_ext class. Add support to modify the
ext server parameters similar to how the fair server parameters are
modified.
Re-use common code between ext and fair servers as needed.
Co-developed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20260126100050.3854740-6-arighi@nvidia.com
|
|
sched_ext currently suffers starvation due to RT. The same workload when
converted to EXT can get zero runtime if RT is 100% running, causing EXT
processes to stall. Fix it by adding a DL server for EXT.
A kselftest is also included later to confirm that both DL servers are
functioning correctly:
# ./runner -t rt_stall
===== START =====
TEST: rt_stall
DESCRIPTION: Verify that RT tasks cannot stall SCHED_EXT tasks
OUTPUT:
TAP version 13
1..1
# Runtime of FAIR task (PID 1511) is 0.250000 seconds
# Runtime of RT task (PID 1512) is 4.750000 seconds
# FAIR task got 5.00% of total runtime
ok 1 PASS: FAIR task got more than 4.00% of runtime
TAP version 13
1..1
# Runtime of EXT task (PID 1514) is 0.250000 seconds
# Runtime of RT task (PID 1515) is 4.750000 seconds
# EXT task got 5.00% of total runtime
ok 2 PASS: EXT task got more than 4.00% of runtime
TAP version 13
1..1
# Runtime of FAIR task (PID 1517) is 0.250000 seconds
# Runtime of RT task (PID 1518) is 4.750000 seconds
# FAIR task got 5.00% of total runtime
ok 3 PASS: FAIR task got more than 4.00% of runtime
TAP version 13
1..1
# Runtime of EXT task (PID 1521) is 0.250000 seconds
# Runtime of RT task (PID 1522) is 4.750000 seconds
# EXT task got 5.00% of total runtime
ok 4 PASS: EXT task got more than 4.00% of runtime
ok 1 rt_stall #
===== END =====
Co-developed-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20260126100050.3854740-5-arighi@nvidia.com
|
|
Currently the DL server interface for applying parameters checks
CFS-internals to identify if the server is active. This is error-prone
and makes it difficult when adding new servers in the future.
Fix it, by using dl_server_active() which is also used by the DL server
code to determine if the DL server was started.
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Tejun Heo <tj@kernel.org>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20260126100050.3854740-4-arighi@nvidia.com
|
|
Updating "ppos" on error conditions does not make much sense. The pattern
is to return the error code directly without modifying the position, or
modify the position on success and return the number of bytes written.
Since on success, the return value of apply is 0, there is no point in
modifying ppos either. Fix it by removing all this and just returning
error code or number of bytes written on success.
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Juri Lelli <juri.lelli@redhat.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Tejun Heo <tj@kernel.org>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20260126100050.3854740-3-arighi@nvidia.com
|
|
The defer params were not cleared in __dl_clear_params. Clear them.
Without this is some of my test cases are flaking and the DL timer is
not starting correctly AFAICS.
Fixes: a110a81c52a9 ("sched/deadline: Deferrable dl server")
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Tested-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/20260126100050.3854740-2-arighi@nvidia.com
|
|
Update to avoid conflicts with /urgent patches.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
|
|
Fix two memory leaks in the gfs2_fill_super() error handling path when
transitioning a filesystem to read-write mode fails.
First leak: kthread objects (thread_struct, task_struct, etc.)
When gfs2_freeze_lock_shared() fails after init_threads() succeeds, the
created kernel threads (logd and quotad) are never destroyed. This
occurs because the fail_per_node label doesn't call
gfs2_destroy_threads().
Second leak: quota bitmap buffer (8192 bytes)
When gfs2_make_fs_rw() fails after gfs2_quota_init() succeeds but
before other operations complete, the allocated quota bitmap is never
freed.
The fix moves thread cleanup to the fail_per_node label to handle all
error paths uniformly. gfs2_destroy_threads() is safe to call
unconditionally as it checks for NULL pointers. Quota cleanup is added
in gfs2_make_fs_rw() to properly handle the withdrawal case where
quota initialization succeeds but the filesystem is then withdrawn.
Thread leak backtrace (gfs2_freeze_lock_shared failure):
unreferenced object 0xffff88801d7bca80 (size 4480):
copy_process+0x3a1/0x4670 kernel/fork.c:2422
kernel_clone+0xf3/0x6e0 kernel/fork.c:2779
kthread_create_on_node+0x100/0x150 kernel/kthread.c:478
init_threads+0xab/0x350 fs/gfs2/ops_fstype.c:611
gfs2_fill_super+0xe5c/0x1240 fs/gfs2/ops_fstype.c:1265
Quota leak backtrace (gfs2_make_fs_rw failure):
unreferenced object 0xffff88812de7c000 (size 8192):
gfs2_quota_init+0xe5/0x820 fs/gfs2/quota.c:1409
gfs2_make_fs_rw+0x7a/0xe0 fs/gfs2/super.c:149
gfs2_fill_super+0xfbb/0x1240 fs/gfs2/ops_fstype.c:1275
Reported-by: syzbot+aac438d7a1c44071e04b@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=aac438d7a1c44071e04b
Fixes: 6c7410f44961 ("gfs2: gfs2_freeze_lock_shared cleanup")
Fixes: b66f723bb552 ("gfs2: Improve gfs2_make_fs_rw error handling")
Link: https://lore.kernel.org/all/20260131062509.77974-1-kartikey406@gmail.com/T/ [v1]
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
alloc_gcs() returns an error-encoded pointer on failure, which comes
from do_mmap(), not NULL.
The current NULL check fails to detect errors, which could lead to using
an invalid GCS address.
Use IS_ERR_VALUE() to properly detect errors, consistent with the
check in gcs_alloc_thread_stack().
Fixes: b57180c75c7e ("arm64/gcs: Implement shadow stack prctl() interface")
Reviewed-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Will Deacon <will@kernel.org>
|
|
Clang warns (or errors with CONFIG_WERROR=y / W=e):
drivers/gpu/drm/panel/panel-ilitek-ili9882t.c:95:16: error: initializer overrides prior initialization of this subobject [-Werror,-Winitializer-overrides]
95 | .vbr_enable = 0,
| ^
drivers/gpu/drm/panel/panel-ilitek-ili9882t.c:90:16: note: previous initialization is here
90 | .vbr_enable = false,
| ^~~~~
drivers/gpu/drm/panel/panel-ilitek-ili9882t.c:97:19: error: initializer overrides prior initialization of this subobject [-Werror,-Winitializer-overrides]
97 | .rc_model_size = DSC_RC_MODEL_SIZE_CONST,
| ^~~~~~~~~~~~~~~~~~~~~~~
include/drm/display/drm_dsc.h:22:38: note: expanded from macro 'DSC_RC_MODEL_SIZE_CONST'
22 | #define DSC_RC_MODEL_SIZE_CONST 8192
| ^~~~
drivers/gpu/drm/panel/panel-ilitek-ili9882t.c:91:19: note: previous initialization is here
91 | .rc_model_size = DSC_RC_MODEL_SIZE_CONST,
| ^~~~~~~~~~~~~~~~~~~~~~~
include/drm/display/drm_dsc.h:22:38: note: expanded from macro 'DSC_RC_MODEL_SIZE_CONST'
22 | #define DSC_RC_MODEL_SIZE_CONST 8192
| ^~~~
drivers/gpu/drm/panel/panel-ilitek-ili9882t.c:132:25: error: initializer overrides prior initialization of this subobject [-Werror,-Winitializer-overrides]
132 | .initial_scale_value = 32,
| ^~
drivers/gpu/drm/panel/panel-ilitek-ili9882t.c:126:25: note: previous initialization is here
126 | .initial_scale_value = 32,
| ^~
drivers/gpu/drm/panel/panel-ilitek-ili9882t.c:133:20: error: initializer overrides prior initialization of this subobject [-Werror,-Winitializer-overrides]
133 | .nfl_bpg_offset = 3511,
| ^~~~
drivers/gpu/drm/panel/panel-ilitek-ili9882t.c:108:20: note: previous initialization is here
108 | .nfl_bpg_offset = 1402,
| ^~~~
GCC would warn about this in the same manner but its version,
-Woverride-init, is disabled for a normal kernel build in
scripts/Makefile.warn. For clang, -Wextra in drivers/gpu/drm/Makefile
turns it back but GCC respects turning it off earlier in the command
line.
Of all the duplicate fields in the initializer, only nfl_bpg_offset is a
different value. Clear up the duplicate initializers, keeping the
'false' value for .vbr_enable, as it is bool, and the second value for
.nfl_bpg_offset, assuming it is the correct one since it was the one
tested in the original change.
Fixes: 65ce1f5834e9 ("drm/panel: ilitek-ili9882t: Switch Tianma TL121BVMS07 to DSC 120Hz mode")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Link: https://patch.msgid.link/20260114-panel-ilitek-ili9882t-fix-override-init-v1-1-1d69a2b096df@kernel.org
Signed-off-by: Maxime Ripard <mripard@kernel.org>
|
|
The interrupt handler iio_trigger_generic_data_rdy_poll() will invoke other
interrupt handlers and this supposed to happen from hard interrupt context.
Use IRQF_NO_THREAD to forbid forced-threading.
Fixes: 0ab13674a9bd1 ("media: pci: mgb4: Added Digiteq Automotive MGB4 driver")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260128095540.863589-21-bigeasy@linutronix.de
|
|
Now there is no one utilizing that member, we can safely remove it along
with compressed_bio::nr_folios member. The size is reduced from 352 to
336 bytes on x86_64.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently only encoded writes utilized btrfs_submit_compressed_write(),
which utilized compressed_bio::compressed_folios[] array.
Change the only call site to call the new helper,
btrfs_alloc_compressed_write(), to allocate a compressed bio, then queue
needed folios into that bio, and finally call
btrfs_submit_compressed_write() to submit the compressed bio.
This change has one hidden benefit, previously we used
btrfs_alloc_folio_array() for the folios of
btrfs_submit_compressed_read(), which doesn't utilize the compression
page pool for bs == ps cases.
Now we call btrfs_alloc_compr_folio() which will benefit from the page pool.
The other obvious benefit is that we no longer need to allocate an array
to hold all those folios, thus one less error path.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently btrfs_submit_compressed_read() still uses
compressed_bio::compressed_folios[] array.
Change it to allocate each folio and queue them into the compressed bio
so that we do not need to allocate that array.
Considering how small each compressed read bio is (less than 128KiB), we
do not benefit that much from btrfs_alloc_folio_array() anyway,
while we may benefit more from btrfs_alloc_compr_folio() by using
the global folio pool.
So changing from btrfs_alloc_folio_array() to btrfs_alloc_compr_folio()
in a loop should still be fine.
This removes one error path, and paves the way to completely remove
compressed_folios[] array.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since it's been replaced by btrfs_compress_bio(), remove all involved
functions.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This switch has the following benefits:
- A single structure to handle all compression
No more extra members like compressed_folios[] nor compress_type, all
those members.
This means the structure of async_extent is much smaller.
- Simpler error handling
A single cleanup_compressed_bio() will handle everything, no extra
compressed_folios[] array to bother.
Some extra notes:
- Compressed folios releasing
Now we go bio_for_each_folio_all() loop to release the folios of the
bio. This will work for both the old compressed_folios[] array and the
new pure bio method.
For old compressed_folios[], all folios of that array is queued into
the bio, thus releasing the folios from the bio is the same as
releasing each folio of that array. We just need to be sure no double
releasing from the array and bio.
For the new pure bio method, that array is NULL, just usual folio
releasing of the bio.
The only extra note is for end_bbio_compressed_read(), as the folios
are allocated using btrfs_alloc_folio_array(), thus the folios should
only be released by regular folio_put(), not btrfs_free_compr_folio().
- Rounding up the bio to block size
We cannot simply increase bi_size, as that will not increase the
length of the last bvec.
Thus we have to properly add the last part into the bio.
This will be done by the helper, round_up_last_block().
The reason we do not round those bios up at compression time is to get
the unaligned compressed size, so that they can be utilized for
inline extents.
If we round the bios up at *_compress_bio(), then every compressed bio
will be larger than or equal to one fs block, resulting no inline
compressed extent.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The helper will allocate a new compressed_bio, do the compression, and
return it to the caller.
This greatly simplifies the compression path, as we no longer need to
allocate a folio array thus no extra error path, furthermore the
compressed bio structure can be utilized for submission with very minor
modifications (like rounding up the bi_size and populate the bi_sector).
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The new helper has the following enhancements against the existing
zlib_compress_folios()
- Much smaller parameter list
No more shared IN/OUT members, no need to pre-allocate a
compressed_folios[] array.
Just a workspace and compressed_bio pointer, everything we need can be
extracted from that @cb pointer.
- Ready-to-be-submitted compressed bio
Although the caller still needs to do some common works like
rounding up and zeroing the tailing part of the last fs block.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The new helper has the following enhancements against the existing
zstd_compress_folios()
- Much smaller parameter list
No more shared IN/OUT members, no need to pre-allocate a
compressed_folios[] array.
Just a workspace and compressed_bio pointer, everything we need can be
extracted from that @cb pointer.
- Ready-to-be-submitted compressed bio
Although the caller still needs to do some common works like
rounding up and zeroing the tailing part of the last fs block.
Overall the workflow is the same as zstd_compress_folios(), but with
some minor changes:
- @start/@len is now constant
For the current input file offset, use @start + @tot_in instead.
The original change of @start and @len makes it pretty hard to know
what value we're really comparing to.
- No more @cur_len
It's only utilized when switching input buffer.
Directly use btrfs_calc_input_length() instead.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The new helper has the following enhancements against the existing
lzo_compress_folios()
- Much smaller parameter list
No more shared IN/OUT members, no need to pre-allocate a
compressed_folios[] array.
Just a workspace list header and a compressed_bio pointer.
Everything else can be fetched from that @cb pointer.
- Read-to-be-submitted compressed bio
Although the caller still needs to do some common works like
rounding up and zeroing the tailing part of the last fs block.
Some workloads are specific to lZO that is not needed with other
multi-run compression interfaces:
- Need to write a LZO header or segment header
Use the new write_and_queue_folio() helper to do the bio_add_folio()
call and folio switching.
- Need to update the LZO header after compression is done
Use bio_first_folio_all() to grab the first folio and update the header.
- Extra corner case of error handling
This can happen when we have queued part of a folio and hit an error.
In that case those folios will be released by the bio.
Thus we can only release the folio that has no queued part.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Separate btrfs_load_block_group_* calling path into a function, so that it
can be an entry point of unit test.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Add auto-cleanup helper functions for btrfs_free_dummy_fs_info and
btrfs_free_dummy_block_group.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We allocate the bitmap but we never free it in free_raid_bio_pointers().
Fix this by adding a bitmap_free() call against the stripe_uptodate_bitmap
of a raid bio.
Fixes: 1810350b04ef ("btrfs: raid56: move sector_ptr::uptodate into a dedicated bitmap")
Reported-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/linux-btrfs/20260126045315.GA31641@lst.de/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Tested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
I ran into another sort of trivial bug in v1 of the patch and concluded
that these functions really ought to be unit tested.
These two functions form the core of searching the chunk allocation
pending extent bitmap and have relatively easily definable semantics, so
unit testing them can help ensure the correctness of chunk allocation.
I also made a minor unrelated fix in volumes.h to properly forward
declare btrfs_space_info. Because of the order of the includes in the
new test, this was actually hitting a latent build warning.
Note:
This is an early example for me of a commit authored in part by an AI
agent, so I wanted to more clear about what I did. I defined a
trivial test and explained the set of tests I wanted to the agent and it
produced the large set of test cases seen here. I then checked each test
case to make sure it matched the description and simplified the
constants and numbers until they looked reasonable to me. I then checked
the looping logic to make sure it made sense to the original spirit of
the trivial test. Finally, carefully combed over all the lines it wrote
to loop over the tests it generated to make sure they followed our code
style guide.
Assisted-by: Claude:claude-opus-4-5
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
I have been observing a number of systems aborting at
insert_dev_extents() in btrfs_create_pending_block_groups(). The
following is a sample stack trace of such an abort coming from forced
chunk allocation (typically behind CONFIG_BTRFS_EXPERIMENTAL) but this
can theoretically happen to any DUP chunk allocation.
[81.801] ------------[ cut here ]------------
[81.801] BTRFS: Transaction aborted (error -17)
[81.801] WARNING: fs/btrfs/block-group.c:2876 at btrfs_create_pending_block_groups+0x721/0x770 [btrfs], CPU#1: bash/319
[81.802] Modules linked in: virtio_net btrfs xor zstd_compress raid6_pq null_blk
[81.803] CPU: 1 UID: 0 PID: 319 Comm: bash Kdump: loaded Not tainted 6.19.0-rc6+ #319 NONE
[81.803] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.17.0-2-2 04/01/2014
[81.804] RIP: 0010:btrfs_create_pending_block_groups+0x723/0x770 [btrfs]
[81.806] RSP: 0018:ffffa36241a6bce8 EFLAGS: 00010282
[81.806] RAX: 000000000000000d RBX: ffff8e699921e400 RCX: 0000000000000000
[81.807] RDX: 0000000002040001 RSI: 00000000ffffffef RDI: ffffffffc0608bf0
[81.807] RBP: 00000000ffffffef R08: ffff8e69830f6000 R09: 0000000000000007
[81.808] R10: ffff8e699921e5e8 R11: 0000000000000000 R12: ffff8e6999228000
[81.808] R13: ffff8e6984d82000 R14: ffff8e69966a69c0 R15: ffff8e69aa47b000
[81.809] FS: 00007fec6bdd9740(0000) GS:ffff8e6b1b379000(0000) knlGS:0000000000000000
[81.809] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[81.810] CR2: 00005604833670f0 CR3: 0000000116679000 CR4: 00000000000006f0
[81.810] Call Trace:
[81.810] <TASK>
[81.810] __btrfs_end_transaction+0x3e/0x2b0 [btrfs]
[81.811] btrfs_force_chunk_alloc_store+0xcd/0x140 [btrfs]
[81.811] kernfs_fop_write_iter+0x15f/0x240
[81.812] vfs_write+0x264/0x500
[81.812] ksys_write+0x6c/0xe0
[81.812] do_syscall_64+0x66/0x770
[81.812] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[81.813] RIP: 0033:0x7fec6be66197
[81.814] RSP: 002b:00007fffb159dd30 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[81.815] RAX: ffffffffffffffda RBX: 00007fec6bdd9740 RCX: 00007fec6be66197
[81.815] RDX: 0000000000000002 RSI: 0000560483374f80 RDI: 0000000000000001
[81.816] RBP: 0000560483374f80 R08: 0000000000000000 R09: 0000000000000000
[81.816] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
[81.817] R13: 00007fec6bfb85c0 R14: 00007fec6bfb5ee0 R15: 00005604833729c0
[81.817] </TASK>
[81.817] irq event stamp: 20039
[81.818] hardirqs last enabled at (20047): [<ffffffff99a68302>] __up_console_sem+0x52/0x60
[81.818] hardirqs last disabled at (20056): [<ffffffff99a682e7>] __up_console_sem+0x37/0x60
[81.819] softirqs last enabled at (19470): [<ffffffff999d2b46>] __irq_exit_rcu+0x96/0xc0
[81.819] softirqs last disabled at (19463): [<ffffffff999d2b46>] __irq_exit_rcu+0x96/0xc0
[81.820] ---[ end trace 0000000000000000 ]---
[81.820] BTRFS: error (device dm-7 state A) in btrfs_create_pending_block_groups:2876: errno=-17 Object already exists
Inspecting these aborts with drgn, I observed a pattern of overlapping
chunk_maps. Note how stripe 1 of the first chunk overlaps in physical
address with stripe 0 of the second chunk.
Physical Start Physical End Length Logical Type Stripe
----------------------------------------------------------------------------------------------------
0x0000000102500000 0x0000000142500000 1.0G 0x0000000641d00000 META|DUP 0/2
0x0000000142500000 0x0000000182500000 1.0G 0x0000000641d00000 META|DUP 1/2
0x0000000142500000 0x0000000182500000 1.0G 0x0000000601d00000 META|DUP 0/2
0x0000000182500000 0x00000001c2500000 1.0G 0x0000000601d00000 META|DUP 1/2
Now how could this possibly happen? All chunk allocation is protected by
the chunk_mutex so racing allocations should see a consistent view of
the CHUNK_ALLOCATED bit in the chunk allocation extent-io-tree
(device->alloc_state as set by chunk_map_device_set_bits()) The tree
itself is protected by a spin lock, and clearing/setting the bits is
always protected by fs_info->mapping_tree_lock, so no race is apparent.
It turns out that there is a subtle bug in the logic regarding chunk
allocations that have happened in the current transaction, known as
"pending extents". The chunk allocation as defined in
find_free_dev_extent() is a loop which searches the commit root of the
dev_root and looks for gaps between DEV_EXTENT items. For those gaps, it
then checks alloc_state bitmap for any pending extents and adjusts the
hole that it finds accordingly. However, the logic in that adjustment
assumes that the first pending extent is the only one in that range.
e.g., given a layout with two non-consecutive pending extents in a hole
passed to dev_extent_hole_check() via *hole_start and *hole_size:
|----pending A----| real hole |----pending B----|
| candidate hole |
*hole_start *hole_start + *hole_size
the code incorrectly returns a "hole" from the end of pending extent A
until the passed in hole end, failing to account for pending B.
However, it is not entirely obvious that it is actually possible to
produce such a layout. I was able to reproduce it, but with some
contortions: I continued to use the force chunk allocation sysfs file
and I introduced a long delay (10 seconds) into the start of the cleaner
thread. I also prevented the unused bgs cleaning logic from ever
deleting metadata bgs. These help make it easier to deterministically
produce the condition but shouldn't really matter if you imagine the
conditions happening by race/luck. Allocations/frees can happen
concurrently with the cleaner thread preparing to process an unused
extent and both create some used chunks with an unused chunk
interleaved, all during one transaction. Then btrfs_delete_unused_bgs()
sees the unused one and clears it, leaving a range with several pending
chunk allocations and a gap in the middle.
The basic idea is that the unused_bgs cleanup work happens on a worker
so if we allocate 3 block groups in one transaction, then the cleaner
work kicked off by the previous transaction comes through and deletes
the middle one of the 3, then the commit root shows no dev extents and
we have the bad pattern in the extent-io-tree. One final consideration
is that the code happens to loop to the next hole if there are no more
extents at all, so we need one more dev extent way past the area we are
working in. Something like the following demonstrates the technique:
# push the BG frontier out to 20G
fallocate -l 20G $mnt/foo
# allocate one more that will prevent the "no more dev extents" luck
fallocate -l 1G $mnt/sticky
# sync
sync
# clear out the allocation area
rm $mnt/foo
sync
_cleaner
# let everything quiesce
sleep 20
sync
# dev tree should have one bg 20G out and the rest at the beginning..
# sort of like an empty FS but with a random sticky chunk.
# kick off the cleaner in the background, remember it will sleep 10s
# before doing interesting work
_cleaner &
sleep 3
# create 3 trivial block groups, all empty, all immediately marked as unused.
echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc"
echo 1 > "$(_btrfs_sysfs_space_info $dev data)/force_chunk_alloc"
echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc"
# let the cleaner thread definitely finish, it will remove the data bg
sleep 10
# this allocation sees the non-consecutive pending metadata chunks with
# data chunk gap of 1G and allocates a 2G extent in that hole. ENOSPC!
echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc"
As for the fix, it is not that obvious. I could not see a trivial way to
do it even by adding backup loops into find_free_dev_extent(), so I
opted to change the semantics of dev_extent_hole_check() to not stop
looping until it finds a sufficiently big hole. For clarity, this also
required changing the helper function contains_pending_extent() into two
new helpers which find the first pending extent and the first suitable
hole in a range.
I attempted to clean up the documentation and range calculations to be
as consistent and clear as possible for the future.
I also looked at the zoned case and concluded that the loop there is
different and not to be unified with this one. As far as I can tell, the
zoned check will only further constrain the hole so looping back to find
more holes is acceptable. Though given that zoned really only appends, I
find it highly unlikely that it is susceptible to this bug.
Fixes: 1b9845081633 ("Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole")
Reported-by: Dimitrios Apostolou <jimis@gmx.net>
Closes: https://lore.kernel.org/linux-btrfs/q7760374-q1p4-029o-5149-26p28421s468@tzk.arg/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When trimming unallocated space, btrfs_trim_fs() holds the device_list_mutex
for the entire duration while iterating through all devices. On large
filesystems with significant unallocated space, this operation can take
minutes to hours on large storage systems.
This causes a problem because btrfs_run_dev_stats(), which is called
during transaction commit, also requires device_list_mutex:
btrfs_trim_fs()
mutex_lock(&fs_devices->device_list_mutex)
list_for_each_entry(device, ...)
btrfs_trim_free_extents(device)
mutex_unlock(&fs_devices->device_list_mutex)
commit_transaction()
btrfs_run_dev_stats()
mutex_lock(&fs_devices->device_list_mutex) // blocked!
...
While trim is running, all transaction commits are blocked waiting for
the mutex.
Fix this by refactoring btrfs_trim_free_extents() to process devices in
bounded chunks (up to 2GB per iteration) and release device_list_mutex
between chunks.
Signed-off-by: robbieko <robbieko@synology.com>
Signed-off-by: jinbaohong <jinbaohong@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When a fatal signal is pending or the process is freezing,
btrfs_trim_block_group() and btrfs_trim_free_extents() return -ERESTARTSYS.
Currently this is treated as a regular error: the loops continue to the
next iteration and count it as a block group or device failure.
Instead, break out of the loops immediately and return -ERESTARTSYS to
userspace without counting it as a failure. Also skip the device loop
entirely if the block group loop was interrupted.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: jinbaohong <jinbaohong@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When multiple block groups or devices fail during trim, preserve the
first error encountered rather than the last one. The first error is
typically more useful for debugging as it represents the original
failure, while subsequent errors may be cascading effects.
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: jinbaohong <jinbaohong@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Commit 93bba24d4b5a ("btrfs: Enhance btrfs_trim_fs function to handle
error better") intended to make device trimming continue even if one
device fails, tracking failures and reporting them at the end. However,
it used 'break' instead of 'continue', causing the loop to exit on the
first device failure.
Fix this by replacing 'break' with 'continue'.
Fixes: 93bba24d4b5a ("btrfs: Enhance btrfs_trim_fs function to handle error better")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: jinbaohong <jinbaohong@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There's no need to BUG_ON(), we can just abort the transaction and return
an error.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When btrfs_remove_block_group() fails we abort the transaction in its
single caller (btrfs_remove_chunk()). This makes it harder to find out
where exactly the failure happened, as several steps inside
btrfs_remove_ |