| Age | Commit message (Collapse) | Author | Files | Lines |
|
Pull fsverity updates from Eric Biggers:
"fsverity cleanups, speedup, and memory usage optimization from
Christoph Hellwig:
- Move some logic into common code
- Fix btrfs to reject truncates of fsverity files
- Improve the readahead implementation
- Store each inode's fsverity_info in a hash table instead of using a
pointer in the filesystem-specific part of the inode.
This optimizes for memory usage in the usual case where most files
don't have fsverity enabled.
- Look up the fsverity_info fewer times during verification, to
amortize the hash table overhead"
* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux:
fsverity: remove inode from fsverity_verification_ctx
fsverity: use a hashtable to find the fsverity_info
btrfs: consolidate fsverity_info lookup
f2fs: consolidate fsverity_info lookup
ext4: consolidate fsverity_info lookup
fs: consolidate fsverity_info lookup in buffer.c
fsverity: push out fsverity_info lookup
fsverity: deconstify the inode pointer in struct fsverity_info
fsverity: kick off hash readahead at data I/O submission time
ext4: move ->read_folio and ->readahead to readpage.c
readahead: push invalidate_lock out of page_cache_ra_unbounded
fsverity: don't issue readahead for non-ENOENT errors from __filemap_get_folio
fsverity: start consolidating pagecache code
fsverity: pass struct file to ->write_merkle_tree_block
f2fs: don't build the fsverity work handler for !CONFIG_FS_VERITY
ext4: don't build the fsverity work handler for !CONFIG_FS_VERITY
fs,fsverity: clear out fsverity_info from common code
fs,fsverity: reject size changes on fsverity files in setattr_prepare
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"New features and improvements for the ext4 file system
- Avoid unnecessary cache invalidation in the extent status cache
(es_cache) when adding extents to be cached in the es_cache and we
are not changing the extent tree
- Add a sysfs parameter, err_report_sec, to control how frequently to
log a warning message that file system inconsistency has been
detected (Previously we logged the warning message every 24 hours)
- Avoid unnecessary forced ordered writes when appending to a file
when delayed allocation is enabled
- Defer splitting unwritten extents to I/O completion to improve
write performance of concurrent direct I/O writes to multiple files
- Refactor and add kunit tests to the extent splitting and conversion
code paths
Various Bug Fixes:
- Fix a panic when the debugging DOUBLE_CHECK macro is defined
- Avoid using fast commit for rare and complex file system operations
to make fast commit easier to reason about. This can also avoid
some corner cases that could result in file system inconsistency if
there is a crash between the fast commit before a subsequent full
commit
- Fix memory leaks in error paths
- Fix a false positive reports caused when running stress tests using
mixed huge-page workloads caused by a race between page migration
and bitmap updates
- Fix a potential recursion into file system reclaim when evicting an
inode when fast commit is enabled
- Fix a warning caused by a potential double decrement to the dirty
clusters counter when executing FS_IOC_SHUTDOWN when running a
stress test
- Enable mballoc optimized scanning regardless whether the inode is
using indirect blocks or extent trees to map blocks"
* tag 'ext4_for_linus-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (45 commits)
et4: allow zeroout when doing written to unwritten split
ext4: refactor split and convert extents
ext4: refactor zeroout path and handle all cases
ext4: propagate flags to ext4_convert_unwritten_extents_endio()
ext4: propagate flags to convert_initialized_extent()
ext4: add extent status cache support to kunit tests
ext4: kunit tests for higher level extent manipulation functions
ext4: kunit tests for extent splitting and conversion
ext4: use optimized mballoc scanning regardless of inode format
ext4: always allocate blocks only from groups inode can use
ext4: fix dirtyclusters double decrement on fs shutdown
ext4: fast commit: make s_fc_lock reclaim-safe
ext4: fix e4b bitmap inconsistency reports
ext4: remove redundant NULL check after __GFP_NOFAIL
ext4: remove EXT4_GET_BLOCKS_IO_CREATE_EXT
ext4: simplify the mapping query logic in ext4_iomap_begin()
ext4: remove unused unwritten parameter in ext4_dio_write_iter()
ext4: remove useless ext4_iomap_overwrite_ops
ext4: avoid starting handle when dio writing an unwritten extent
ext4: don't split extent before submitting I/O
...
|
|
Use the kernel's resizable hash table (rhashtable) to find the
fsverity_info. This way file systems that want to support fsverity don't
have to bloat every inode in the system with an extra pointer. The
trade-off is that looking up the fsverity_info is a bit more expensive
now, but the main operations are still dominated by I/O and hashing
overhead.
The rhashtable implementations requires no external synchronization, and
the _fast versions of the APIs provide the RCU critical sections required
by the implementation. Because struct fsverity_info is only removed on
inode eviction and does not contain a reference count, there is no need
for an extended critical section to grab a reference or validate the
object state. The file open path uses rhashtable_lookup_get_insert_fast,
which can either find an existing object for the hash key or insert a
new one in a single atomic operation, so that concurrent opens never
instantiate duplicate fsverity_info structure. FS_IOC_ENABLE_VERITY must
already be synchronized by a combination of i_rwsem and file system flags
and uses rhashtable_lookup_insert_fast, which errors out on an existing
object for the hash key as an additional safety check.
Because insertion into the hash table now happens before S_VERITY is set,
fsverity just becomes a barrier and a flag check and doesn't have to look
up the fsverity_info at all, so there is only a single lookup per
->read_folio or ->readahead invocation. For btrfs there is an additional
one for each bio completion, while for ext4 and f2fs the fsverity_info
is stored in the per-I/O context and reused for the completion workqueue.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Link: https://lore.kernel.org/r/20260202060754.270269-12-hch@lst.de
[EB: folded in fix for missing fsverity_free_info()]
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
|
|
Keep all the read into pagecache code in a single file.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20260202060754.270269-4-hch@lst.de
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
|
|
Add more kunit tests to cover the high level caller
ext4_map_create_blocks(). We pass flags in a manner that covers
the below function:
1. ext4_ext_handle_unwritten_extents()
1.1 - Split/Convert unwritten extent to written in endio convtext.
1.2 - Split/Convert unwritten extent to written in non endio context.
1.3 - Zeroout tests for the above 2 cases
2. convert_initialized_extent() - Convert written extent to unwritten
during zero range
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/9d8ad32cb62f44999c0fe3545b44fc3113546c70.1769149131.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
s_fc_lock can be acquired from inode eviction and thus is
reclaim unsafe. Since the fast commit path holds s_fc_lock while writing
the commit log, allocations under the lock can enter reclaim and invert
the lock order with fs_reclaim. Add ext4_fc_lock()/ext4_fc_unlock()
helpers which acquire s_fc_lock under memalloc_nofs_save()/restore()
context and use them everywhere so allocations under the lock cannot
recurse into filesystem reclaim.
Fixes: 6593714d67ba ("ext4: hold s_fc_lock while during fast commit")
Signed-off-by: Li Chen <me@linux.beauty>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260106120621.440126-1-me@linux.beauty
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
We do not use EXT4_GET_BLOCKS_IO_CREATE_EXT or split extents before
submitting I/O; therefore, remove the related code.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/20260105014522.1937690-8-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
ext4_iomap_overwrite_ops was introduced in commit 8cd115bdda17 ("ext4:
Optimize ext4 DIO overwrites"), which can optimize pure overwrite
performance by dropping the IOMAP_WRITE flag to only query the mapped
mapping information. This avoids starting a new journal handle, thereby
improving speed. Later, commit 9faac62d4013 ("ext4: optimize file
overwrites") also optimized similar scenarios, but it performs the check
later, examining the mappings status only when the actual block mapping
is needed. Thus, it can handle the previous commit scenario. That means
in the case of an overwrite scenario, the condition
"offset + length <= i_size_read(inode)" in the write path must always be
true.
Therefore, it is acceptable to remove the ext4_iomap_overwrite_ops,
which will also clarify the write and read paths of ext4_iomap_begin.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://patch.msgid.link/20260105014522.1937690-5-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Add a new sysfs attribute "err_report_sec" to control the s_err_report
timer in ext4_sb_info. Writing '0' disables the timer, while writing
a non-zero value enables the timer and sets the timeout in seconds.
Signed-off-by: Baolin Liu <liubaolin@kylinos.cn>
Link: https://patch.msgid.link/20251211030256.28613-1-liubaolin12138@163.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Stop definining these privately and instead move them to the uapi
errno.h so that they become canonical instead of copy pasta.
Cc: linux-api@vger.kernel.org
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://patch.msgid.link/176826402587.3490369.17659117524205214600.stgit@frogsfrogsfrogs
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Supporting a block size greater than the page size (BS > PS) requires
support for large folios. However, several features (e.g., encrypt)
do not yet support large folios.
To prevent conflicts, this patch adds checks at mount time to prohibit
these features from being used when BS > PS. Since these features cannot
be changed on remount, there is no need to check on remount.
This patch adds s_max_folio_order, initialized during mount according to
filesystem features and mount options. If s_max_folio_order is 0, large
folios are disabled.
With this in place, ext4_set_inode_mapping_order() can be simplified by
checking s_max_folio_order, avoiding redundant checks.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Message-ID: <20251121090654.631996-24-libaokun@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
As BS > PS support is coming, all block number to page index (and
vice-versa) conversions must now go via bytes. Added EXT4_LBLK_TO_PG()
and EXT4_PG_TO_LBLK() macros to simplify these conversions and handle
both BS <= PS and BS > PS scenarios cleanly.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Message-ID: <20251121090654.631996-11-libaokun@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
No functional changes.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Message-ID: <20251121090654.631996-10-libaokun@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
This commit introduces the s_min_folio_order field to the ext4_sb_info
structure. This field will store the minimum folio order required by the
current filesystem, laying groundwork for future support of block sizes
greater than PAGE_SIZE.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Message-ID: <20251121090654.631996-7-libaokun@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Previously, ext4_rec_len_(to|from)_disk only performed complex rec_len
conversions when PAGE_SIZE >= 65536 to reduce complexity.
However, we are soon to support file system block sizes greater than
page size, which makes these conditional checks unnecessary. Thus, these
checks are now removed.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Message-ID: <20251121090654.631996-4-libaokun@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
This flag has been generalized to split an unwritten extent when we do
dio or dioread_nolock writeback, or to avoid merge new extents which was
created by extents split. Update some related comments too.
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Yang Erkun <yangerkun@huawei.com>
Message-ID: <20251112084538.1658232-2-yangerkun@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When creating or querying mapping blocks using the ext4_map_blocks() and
ext4_map_{query|create}_blocks() helpers, also pass out the extent
sequence number of the block mapping info through the ext4_map_blocks
structure. This sequence number can later serve as a valid cookie within
iomap infrastructure and the move extents procedure.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-ID: <20251013015128.499308-5-yi.zhang@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
In the iomap_write_iter(), the iomap buffered write frame does not hold
any locks between querying the inode extent mapping info and performing
page cache writes. As a result, the extent mapping can be changed due to
concurrent I/O in flight. Similarly, in the iomap_writepage_map(), the
write-back process faces a similar problem: concurrent changes can
invalidate the extent mapping before the I/O is submitted.
Therefore, both of these processes must recheck the mapping info after
acquiring the folio lock. To address this, similar to XFS, we propose
introducing an extent sequence number to serve as a validity cookie for
the extent. After commit 24b7a2331fcd ("ext4: clairfy the rules for
modifying extents"), we can ensure the extent information should always
be processed through the extent status tree, and the extent status tree
is always uptodate under i_rwsem or invalidate_lock or folio lock, so
it's safe to introduce this sequence number. The sequence number will be
increased whenever the extent status tree changes, preparing for the
buffered write iomap conversion.
Besides, this mechanism is also applicable for the moving extents case.
In move_extent_per_page(), it also needs to reacquire data_sem and check
the mapping info again under the folio lock.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-ID: <20251013015128.499308-3-yi.zhang@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"New ext4 features:
- Add support so tune2fs can modify/update the superblock using an
ioctl, without needing write access to the block device
- Add support for 32-bit reserved uid's and gid's
Bug fixes:
- Fix potential warnings and other failures caused by corrupted /
fuzzed file systems
- Fail unaligned direct I/O write with EINVAL instead of silently
falling back to buffered I/O
- Correectly handle fsmap queries for metadata mappings
- Avoid journal stalls caused by writeback throttling
- Add some missing GFP_NOFAIL flags to avoid potential deadlocks
under extremem memory pressure
Cleanups:
- Remove obsolete EXT3 Kconfigs"
* tag 'ext4_for_linus-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix checks for orphan inodes
ext4: validate ea_ino and size in check_xattrs
ext4: guard against EA inode refcount underflow in xattr update
ext4: implemet new ioctls to set and get superblock parameters
ext4: add support for 32-bit default reserved uid and gid values
ext4: avoid potential buffer over-read in parse_apply_sb_mount_options()
ext4: fix an off-by-one issue during moving extents
ext4: increase i_disksize to offset + len in ext4_update_disksize_before_punch()
ext4: verify orphan file size is not too big
ext4: fail unaligned direct IO write with EINVAL
ext4: correctly handle queries for metadata mappings
ext4: increase IO priority of fastcommit
ext4: remove obsolete EXT3 config options
jbd2: increase IO priority of checkpoint
ext4: fix potential null deref in ext4_mb_init()
ext4: add ext4_sb_bread_nofail() helper function for ext4_free_branches()
ext4: replace min/max nesting with clamp()
fs: ext4: change GFP_KERNEL to GFP_NOFS to avoid deadlock
|
|
When orphan file feature is enabled, inode can be tracked as orphan
either in the standard orphan list or in the orphan file. The first can
be tested by checking ei->i_orphan list head, the second is recorded by
EXT4_STATE_ORPHAN_FILE inode state flag. There are several places where
we want to check whether inode is tracked as orphan and only some of
them properly check for both possibilities. Luckily the consequences are
mostly minor, the worst that can happen is that we track an inode as
orphan although we don't need to and e2fsck then complains (resulting in
occasional ext4/307 xfstest failures). Fix the problem by introducing a
helper for checking whether an inode is tracked as orphan and use it in
appropriate places.
Fixes: 4a79a98c7b19 ("ext4: Improve scalability of ext4 orphan file handling")
Cc: stable@kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Message-ID: <20250925123038.20264-2-jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Support for specifying the default user id and group id that is
allowed to use the reserved block space was added way back when Linux
only supported 16-bit uid's and gid's. (Yeah, that long ago.) It's
not a commonly used feature, but let's add support for 32-bit user and
group id's.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Message-ID: <20250916-tune2fs-v2-2-d594dc7486f0@mit.edu>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The implicit __GFP_NOFAIL flag in ext4_sb_bread() was removed in commit
8a83ac54940d ("ext4: call bdev_getblk() from sb_getblk_gfp()"), meaning
the function can now fail under memory pressure.
Most callers of ext4_sb_bread() propagate the error to userspace and do not
remount the filesystem read-only. However, ext4_free_branches() handles
ext4_sb_bread() failure by remounting the filesystem read-only.
This implies that an ext3 filesystem (mounted via the ext4 driver) could be
forcibly remounted read-only due to a transient page allocation failure,
which is unacceptable.
To mitigate this, introduce a new helper function, ext4_sb_bread_nofail(),
which explicitly uses __GFP_NOFAIL, and use it in ext4_free_branches().
Fixes: 8a83ac54940d ("ext4: call bdev_getblk() from sb_getblk_gfp()")
Cc: stable@kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Move the fsverity_info pointer into the filesystem-specific part of the
inode by adding the field ext4_inode_info::i_verity_info and configuring
fsverity_operations::inode_info_offs accordingly.
This is a prerequisite for a later commit that removes
inode::i_verity_info, saving memory and improving cache efficiency on
filesystems that don't support fsverity.
Co-developed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://lore.kernel.org/20250810075706.172910-10-ebiggers@kernel.org
Acked-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Move the fscrypt_inode_info pointer into the filesystem-specific part of
the inode by adding the field ext4_inode_info::i_crypt_info and
configuring fscrypt_operations::inode_info_offs accordingly.
This is a prerequisite for a later commit that removes
inode::i_crypt_info, saving memory and improving cache efficiency with
filesystems that don't support fscrypt.
Co-developed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Link: https://lore.kernel.org/20250810075706.172910-4-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Major ext4 changes for 6.17:
- Better scalability for ext4 block allocation
- Fix insufficient credits when writing back large folios
Miscellaneous bug fixes, especially when handling exteded attriutes,
inline data, and fast commit"
* tag 'ext4_for_linus_6.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (39 commits)
ext4: do not BUG when INLINE_DATA_FL lacks system.data xattr
ext4: implement linear-like traversal across order xarrays
ext4: refactor choose group to scan group
ext4: convert free groups order lists to xarrays
ext4: factor out ext4_mb_scan_group()
ext4: factor out ext4_mb_might_prefetch()
ext4: factor out __ext4_mb_scan_group()
ext4: fix largest free orders lists corruption on mb_optimize_scan switch
ext4: fix zombie groups in average fragment size lists
ext4: merge freed extent with existing extents before insertion
ext4: convert sbi->s_mb_free_pending to atomic_t
ext4: fix typo in CR_GOAL_LEN_SLOW comment
ext4: get rid of some obsolete EXT4_MB_HINT flags
ext4: utilize multiple global goals to reduce contention
ext4: remove unnecessary s_md_lock on update s_mb_last_group
ext4: remove unnecessary s_mb_last_start
ext4: separate stream goal hits from s_bal_goals for better tracking
ext4: add ext4_try_lock_group() to skip busy groups
ext4: initialize superblock fields in the kballoc-test.c kunit tests
ext4: refactor the inline directory conversion and new directory codepaths
...
|
|
This commit converts the `choose group` logic to `scan group` using
previously prepared helper functions. This allows us to leverage xarrays
for ordered non-linear traversal, thereby mitigating the "bouncing" issue
inherent in the `choose group` mechanism.
This also decouples linear and non-linear traversals, leading to cleaner
and more readable code.
Key changes:
* ext4_mb_choose_next_group() is refactored to ext4_mb_scan_groups().
* Replaced ext4_mb_good_group() with ext4_mb_scan_group() in non-linear
traversals, and related functions now return error codes instead of
group info.
* Added ext4_mb_scan_groups_linear() for performing linear scans starting
from a specific group for a set number of times.
* Linear scans now execute up to sbi->s_mb_max_linear_groups times,
so ac_groups_linear_remaining is removed as it's no longer used.
* ac->ac_criteria is now used directly instead of passing cr around.
Also, ac->ac_criteria is incremented directly after groups scan fails
for the corresponding criteria.
* Since we're now directly scanning groups instead of finding a good group
then scanning, the following variables and flags are no longer needed,
s_bal_cX_groups_considered is sufficient.
s_bal_p2_aligned_bad_suggestions
s_bal_goal_fast_bad_suggestions
s_bal_best_avail_bad_suggestions
EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED
EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED
EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-17-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
While traversing the list, holding a spin_lock prevents load_buddy, making
direct use of ext4_try_lock_group impossible. This can lead to a bouncing
scenario where spin_is_locked(grp_A) succeeds, but ext4_try_lock_group()
fails, forcing the list traversal to repeatedly restart from grp_A.
In contrast, linear traversal directly uses ext4_try_lock_group(),
avoiding this bouncing. Therefore, we need a lockless, ordered traversal
to achieve linear-like efficiency.
Therefore, this commit converts both average fragment size lists and
largest free order lists into ordered xarrays.
In an xarray, the index represents the block group number and the value
holds the block group information; a non-empty value indicates the block
group's presence.
While insertion and deletion complexity remain O(1), lookup complexity
changes from O(1) to O(nlogn), which may slightly reduce single-threaded
performance.
Additionally, xarray insertions might fail, potentially due to memory
allocation issues. However, since we have linear traversal as a fallback,
this isn't a major problem. Therefore, we've only added a warning message
for insertion failures here.
A helper function ext4_mb_find_good_group_xarray() is added to find good
groups in the specified xarray starting at the specified position start,
and when it reaches ngroups-1, it wraps around to 0 and then to start-1.
This ensures an ordered traversal within the xarray.
Performance test results are as follows: Single-process operations
on an empty disk show negligible impact, while multi-process workloads
demonstrate a noticeable performance gain.
|CPU: Kunpeng 920 | P80 | P1 |
|Memory: 512GB |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 20097 | 19555 (-2.6%) | 316141 | 315636 (-0.2%) |
|mb_optimize_scan=1 | 13318 | 15496 (+16.3%) | 325273 | 323569 (-0.5%) |
|CPU: AMD 9654 * 2 | P96 | P1 |
|Memory: 1536GB |------------------------|-------------------------|
|960GB SSD (1GB/s) | base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 53603 | 53192 (-0.7%) | 214243 | 212678 (-0.7%) |
|mb_optimize_scan=1 | 20887 | 37636 (+80.1%) | 213632 | 214189 (+0.2%) |
[ Applied spelling fixes per discussion on the ext4-list see thread
referened in the Link tag. --tytso]
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-16-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Previously, s_md_lock was used to protect s_mb_free_pending during
modifications, while smp_mb() ensured fresh reads, so s_md_lock just
guarantees the atomicity of s_mb_free_pending. Thus we optimized it by
converting s_mb_free_pending into an atomic variable, thereby eliminating
s_md_lock and minimizing lock contention. This also prepares for future
lockless merging of free extents.
Following this modification, s_md_lock is exclusively responsible for
managing insertions and deletions within s_freed_data_list, along with
operations involving list_splice.
Performance test data follows:
Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.
|CPU: Kunpeng 920 | P80 | P1 |
|Memory: 512GB |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 19628 | 20043 (+2.1%) | 320885 | 314331 (-2.0%) |
|mb_optimize_scan=1 | 7129 | 7290 (+2.2%) | 321275 | 324226 (+0.9%) |
|CPU: AMD 9654 * 2 | P96 | P1 |
|Memory: 1536GB |------------------------|-------------------------|
|960GB SSD (1GB/s) | base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 53760 | 54999 (+2.3%) | 213145 | 214380 (+0.5%) |
|mb_optimize_scan=1 | 12716 | 13497 (+6.1%) | 215262 | 216276 (+0.4%) |
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-9-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Remove the superfluous "find_".
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-8-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Since nobody has used these EXT4_MB_HINT flags for ages,
let's remove them.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-7-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When allocating data blocks, if the first try (goal allocation) fails and
stream allocation is on, it tries a global goal starting from the last
group we used (s_mb_last_group). This helps cluster large files together
to reduce free space fragmentation, and the data block contiguity also
accelerates write-back to disk.
However, when multiple processes allocate blocks, having just one global
goal means they all fight over the same group. This drastically lowers
the chances of extents merging and leads to much worse file fragmentation.
To mitigate this multi-process contention, we now employ multiple global
goals, with the number of goals being the minimum between the number of
possible CPUs and one-quarter of the filesystem's total block group count.
To ensure a consistent goal for each inode, we select the corresponding
goal by taking the inode number modulo the total number of goals.
Performance test data follows:
Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.
|CPU: Kunpeng 920 | P80 | P1 |
|Memory: 512GB |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 9636 | 19628 (+103%) | 337597 | 320885 (-4.9%) |
|mb_optimize_scan=1 | 4834 | 7129 (+47.4%) | 341440 | 321275 (-5.9%) |
|CPU: AMD 9654 * 2 | P96 | P1 |
|Memory: 1536GB |------------------------|-------------------------|
|960GB SSD (1GB/s) | base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 22341 | 53760 (+140%) | 219707 | 213145 (-2.9%) |
|mb_optimize_scan=1 | 9177 | 12716 (+38.5%) | 215732 | 215262 (+0.2%) |
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-6-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
After we optimized the block group lock, we found another lock
contention issue when running will-it-scale/fallocate2 with multiple
processes. The fallocate's block allocation and the truncate's block
release were fighting over the s_md_lock. The problem is, this lock
protects totally different things in those two processes: the list of
freed data blocks (s_freed_data_list) when releasing, and where to start
looking for new blocks (mb_last_group) when allocating.
Now we only need to track s_mb_last_group and no longer need to track
s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
two are consistent. Since s_mb_last_group is merely a hint and doesn't
require strong synchronization, READ_ONCE/WRITE_ONCE is sufficient.
Besides, the s_mb_last_group data type only requires ext4_group_t
(i.e., unsigned int), rendering unsigned long superfluous.
Performance test data follows:
Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.
|CPU: Kunpeng 920 | P80 | P1 |
|Memory: 512GB |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 4821 | 9636 (+99.8%) | 314065 | 337597 (+7.4%) |
|mb_optimize_scan=1 | 4784 | 4834 (+1.04%) | 316344 | 341440 (+7.9%) |
|CPU: AMD 9654 * 2 | P96 | P1 |
|Memory: 1536GB |------------------------|-------------------------|
|960GB SSD (1GB/s) | base | patched | base | patched |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 15371 | 22341 (+45.3%) | 205851 | 219707 (+6.7%) |
|mb_optimize_scan=1 | 6101 | 9177 (+50.4%) | 207373 | 215732 (+4.0%) |
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-5-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Since stream allocation does not use ac->ac_f_ex.fe_start, it is set to -1
by default, so the no longer needed sbi->s_mb_last_start is removed.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-4-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
In ext4_mb_regular_allocator(), after the call to ext4_mb_find_by_goal()
fails to achieve the inode goal, allocation continues with the stream
allocation global goal. Currently, hits for both are combined in
sbi->s_bal_goals, hindering accurate optimization.
This commit separates global goal hits into sbi->s_bal_stream_goals. Since
stream allocation doesn't use ac->ac_g_ex.fe_start, set fe_start to -1.
This prevents stream allocations from being counted in s_bal_goals. Also
clear EXT4_MB_HINT_TRY_GOAL to avoid calling ext4_mb_find_by_goal again.
After adding `stream_goal_hits`, `/proc/fs/ext4/sdx/mb_stats` will show:
mballoc:
reqs: 840347
success: 750992
groups_scanned: 1230506
cr_p2_aligned_stats:
hits: 21531
groups_considered: 411664
extents_scanned: 21531
useless_loops: 0
bad_suggestions: 6
cr_goal_fast_stats:
hits: 111222
groups_considered: 1806728
extents_scanned: 467908
useless_loops: 0
bad_suggestions: 13
cr_best_avail_stats:
hits: 36267
groups_considered: 1817631
extents_scanned: 156143
useless_loops: 0
bad_suggestions: 204
cr_goal_slow_stats:
hits: 106396
groups_considered: 5671710
extents_scanned: 22540056
useless_loops: 123747
cr_any_free_stats:
hits: 138071
groups_considered: 724692
extents_scanned: 23615593
useless_loops: 585
extents_scanned: 46804261
goal_hits: 1307
stream_goal_hits: 236317
len_goal_hits: 155549
2^n_hits: 21531
breaks: 225096
lost: 35062
buddies_generated: 40/40
buddies_time_used: 48004
preallocated: 5962467
discarded: 4847560
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-3-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When ext4 allocates blocks, we used to just go through the block groups
one by one to find a good one. But when there are tons of block groups
(like hundreds of thousands or even millions) and not many have free space
(meaning they're mostly full), it takes a really long time to check them
all, and performance gets bad. So, we added the "mb_optimize_scan" mount
option (which is on by default now). It keeps track of some group lists,
so when we need a free block, we can just grab a likely group from the
right list. This saves time and makes block allocation much faster.
But when multiple processes or containers are doing similar things, like
constantly allocating 8k blocks, they all try to use the same block group
in the same list. Even just two processes doing this can cut the IOPS in
half. For example, one container might do 300,000 IOPS, but if you run two
at the same time, the total is only 150,000.
Since we can already look at block groups in a non-linear way, the first
and last groups in the same list are basically the same for finding a block
right now. Therefore, add an ext4_try_lock_group() helper function to skip
the current group when it is locked by another process, thereby avoiding
contention with other processes. This helps ext4 make better use of having
multiple block groups.
Also, to make sure we don't skip all the groups that have free space
when allocating blocks, we won't try to skip busy groups anymore when
ac_criteria is CR_ANY_FREE.
Performance test data follows:
Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.
|CPU: Kunpeng 920 | P80 |
|Memory: 512GB |-------------------------|
|960GB SSD (0.5GB/s)| base | patched |
|-------------------|-------|-----------------|
|mb_optimize_scan=0 | 2667 | 4821 (+80.7%) |
|mb_optimize_scan=1 | 2643 | 4784 (+81.0%) |
|CPU: AMD 9654 * 2 | P96 |
|Memory: 1536GB |-------------------------|
|960GB SSD (1GB/s) | base | patched |
|-------------------|-------|-----------------|
|mb_optimize_scan=0 | 3450 | 15371 (+345%) |
|mb_optimize_scan=1 | 3209 | 6101 (+90.0%) |
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-2-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
There was a lot of common code in the codepaths used to convert an
inline directory and to creaet a new directory. To address this,
rename ext4_init_dot_dotdot() to ext4_init_dirblock() and then move
common code into that function.
This reduces the lines of code count in fs/ext4/inline.c and
fs/ext4/namei.c, as well as reducing the size of their object files.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://patch.msgid.link/20250712181249.434530-3-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
In environments with a page size of 64KB, the maximum size of a folio
can reach up to 128MB. Consequently, during the write-back of folios,
the 'rsv_blocks' will be overestimated to 1,577, which can make
pressure on the journal space where the journal is small. This can
easily exceed the limit of a single transaction. Besides, an excessively
large folio is meaningless and will instead increase the overhead of
traversing the bhs within the folio. Therefore, limit the maximum order
of a folio to 2048 filesystem blocks.
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: Joseph Qi <jiangqi903@gmail.com>
Closes: https://lore.kernel.org/linux-ext4/CA+G9fYsyYQ3ZL4xaSg1-Tt5Evto7Zd+hgNWZEa9cQLbahA1+xg@mail.gmail.com/
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-12-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
After ext4 supports large folios, the semantics of reserving credits in
pages is no longer applicable. In most scenarios, reserving credits in
extents is sufficient. Therefore, introduce ext4_chunk_trans_extent()
to replace ext4_writepage_trans_blocks(). move_extent_per_page() is the
only remaining location where we are still processing extents in pages.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-10-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Now that we expose struct file_attr as our uapi struct rename all the
internal struct to struct file_kattr to clearly communicate that it is a
kernel internal struct. This is similar to struct mount_{k}attr and
others.
Link: https://lore.kernel.org/20250703-restlaufzeit-baurecht-9ed44552b481@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|