diff options
| author | Baokun Li <libaokun@linux.alibaba.com> | 2026-05-21 17:50:16 +0800 |
|---|---|---|
| committer | Christian Brauner <brauner@kernel.org> | 2026-05-22 12:06:35 +0200 |
| commit | 31c1d19ead2c26a63859a2757d8b786765ba9cdd (patch) | |
| tree | eaf2bf1530d119356ec790fddb0175ac6f601842 /rust/kernel/alloc/kvec/errors.rs | |
| parent | e90a6d668e26e00a72df2d09c173b563468f09c9 (diff) | |
writeback: use a per-sb counter to drain inode wb switches at umount
Tracking in-flight inode wb switches with a single global counter
(isw_nr_in_flight) plus a synchronize_rcu() based wait in
cgroup_writeback_umount() forces every umount to take a global hit
whenever any other superblock on the system has wb switches in flight,
even if the superblock being unmounted has none of its own.
Replace the global synchronize_rcu()/flush_workqueue() pair with a
per-sb counter, s_isw_nr_in_flight, plus three small helpers:
- cgroup_writeback_pin(sb) - increment counter
- cgroup_writeback_unpin(sb) - decrement and wake drainer if last
- cgroup_writeback_drain(sb) - wait for counter to reach zero
The wiring is:
- inode_prepare_wbs_switch() pins before checking SB_ACTIVE and
grabbing the inode; failure paths unpin before returning. A
lockless SB_ACTIVE check at the top of the function lets us skip
the atomic_inc/smp_mb dance once SB_ACTIVE has been cleared (it
is monotonic and never set back).
- process_inode_switch_wbs() unpins after the matching iput().
- cgroup_writeback_umount() drains the per-sb counter via
wait_var_event().
The smp_mb() pair between inode_prepare_wbs_switch() and
cgroup_writeback_umount() keeps the SB_ACTIVE / counter ordering:
either the umounter sees a non-zero counter and waits, or the
switcher sees SB_ACTIVE cleared and aborts before grabbing the
inode.
The global isw_nr_in_flight is left in place, since it is still used
to throttle in-flight switches via WB_FRN_MAX_IN_FLIGHT.
The rcu_read_lock() extension in inode_switch_wbs() and
cleanup_offline_cgwb() that the race fix added is no longer needed
and is reverted; the synchronize_rcu() that the race fix added to
cgroup_writeback_umount() is dropped as well.
The following numbers were measured on a 16 vCPU QEMU guest with 4
background superblocks each churning "create memcg -> write 1 MiB ->
rmdir memcg" to keep the global isw_nr_in_flight non-zero. Latencies
are wall-clock around umount(8); only the target sb's umount is
measured.
Target sb runs its own cgwb churn:
p50 p95 p99 max
global synchronize_rcu() 67.6 ms 88.3 ms 88.3 ms 96.8 ms
per-sb counter (this) 7.9 ms 10.0 ms 10.0 ms 10.1 ms
Idle target umount latency under cross-sb cgwb-switch pressure:
p50 p95 p99 max
global synchronize_rcu() 62.7 ms 95.4 ms 108.1 ms 108.6 ms
per-sb counter (this) 5.3 ms 6.9 ms 7.4 ms 7.4 ms
no-pressure baseline 4.9 ms 5.9 ms 6.3 ms 6.7 ms
8 concurrent umounts of idle sbs under the same pressure:
p50 p95 max
global synchronize_rcu() 61.3 ms 99.5 ms 113.7 ms
per-sb counter (this) 8.1 ms 9.1 ms 9.5 ms
In-kernel cgroup_writeback_umount() time across the same run
(bpftrace, ~340 calls covering all scenarios):
global synchronize_rcu() 12371 ms total (~36 ms / call)
per-sb counter (this) 1.37 ms total ( ~4 us / call)
Suggested-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/177910456953.488929.2169908940676707307.b4-review@b4
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun@linux.alibaba.com>
Link: https://patch.msgid.link/20260521095016.2791354-4-libaokun@linux.alibaba.com
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
Diffstat (limited to 'rust/kernel/alloc/kvec/errors.rs')
0 files changed, 0 insertions, 0 deletions
