| Age | Commit message (Collapse) | Author | Files | Lines |
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"Features:
- Reduce pipe->mutex contention by pre-allocating pages outside the
lock in anon_pipe_write().
anon_pipe_write() called alloc_page() once per page while holding
pipe->mutex. The allocation can sleep doing direct reclaim and runs
memcg charging, which extends the critical section and stalls any
concurrent reader on the same mutex. Now up to 8 pages are
pre-allocated before the mutex is taken, leftovers are recycled
into the per-pipe tmp_page[] cache before unlock, and any remainder
is released after unlock, keeping the allocator out of the critical
section on both sides. On a writers x readers sweep with 64KB
writes against a 1 MB pipe throughput improves 6-28% and average
write latency drops 5-22%; under memory pressure - when the cost of
holding the mutex across reclaim is highest - throughput improves
21-48% and latency drops 17-33%. The microbenchmark is added to
selftests.
- uaccess/sockptr: fix the ignored_trailing logic in
copy_struct_to_user() to behave as documented and the usize check
in copy_struct_from_sockptr() for user pointers, and add
copy_struct_{from,to}_bounce_buffer() and copy_struct_to_sockptr()
helpers for upcoming users (IPPROTO_SMBDIRECT, IPPROTO_QUIC).
- bpf: add a sleepable bpf_real_inode() kfunc that resolves the real
inode backing a dentry via d_real_inode(). On overlayfs the inode
attached to the dentry doesn't carry the underlying device
information; this is used by the filesystem restriction BPF program
that was merged into systemd.
- docs: add guidelines for submitting new filesystems, motivated by
the maintenance burden abandoned and untestable filesystems impose
on VFS developers, blocking infrastructure work like folio
conversions and iomap migration.
Fixes:
- libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo()
and drop the now-redundant assignments in callers. This began as a
one-line dma-buf fix for a path_noexec() warning; a pseudo
filesystem has no reason not to set SB_I_NOEXEC. All init_pseudo()
callers were audited: the only visible effect is on dma-buf where
SB_I_NOEXEC silences the warning.
- Handle set_blocksize() failures in legacy filesystems (bfs, hpfs,
qnx4, jfs, befs, affs, isofs, minix, ntfs3, omfs). Mounting a
device with a sector size > PAGE_SIZE crashed roughly half of them;
the rest had the same missing error handling pattern. Plus a
follow-up releasing the superblock buffer_head when setting the
minix v3 block size fails.
- mount: honour SB_NOUSER in the new mount API.
- fs/fcntl: fix a SOFTIRQ-unsafe lock order in fasync signaling by
switching the process-group paths of send_sigio() and send_sigurg()
from read_lock(&tasklist_lock) to RCU, matching the single-PID
path.
- vfs: add an FS_USERNS_DELEGATABLE flag and set it for NFS, fixing
delegated NFS mounts (fsopen() in a container with the mount
performed by a privileged daemon) that broke when non-init
s_user_ns was tied to FS_USERNS_MOUNT.
- selftests/namespaces: fix a hang in nsid_test where an unreaped
grandchild kept the TAP pipe write-end open, a waitpid(-1) race in
listns_efault_test, and a false FAIL on kernels without listns()
where the tests should SKIP.
- filelock: fix the break_lease() stub signature for
CONFIG_FILE_LOCKING=n.
- init/initramfs_test: wait for the async initramfs unpacking before
running; the test and do_populate_rootfs() share the parser state.
- fs/coredump: reduce redundant log noise in
validate_coredump_safety().
- iomap: pass the correct length to fserror_report_io() in
__iomap_write_begin().
- backing-file: fix the backing_file_open() kerneldoc.
Cleanups:
- initramfs: refactor the cpio hex header parsing to use hex2bin()
instead of the hand-rolled simple_strntoul() which is reverted, and
extend the initramfs KUnit tests to cover header fields with 0x
prefixes.
- Replace __get_free_pages() and friends with kmalloc()/kzalloc()
across quota, proc, ocfs2/dlm, nilfs2, nfs, nfsd, libfs, jfs, jbd2,
isofs, fuse, select, namespace, configfs, binfmt_misc, bfs, and the
do_mounts init code - part of the larger work of replacing page
allocator calls with kmalloc().
- Use clear_and_wake_up_bit() in unlock_buffer() and
journal_end_buffer_io_sync() instead of open-coding the sequence.
- Drop unused VFS exports: unexport drop_super_exclusive(), remove
start_removing_user_path_at(), and fold __start_removing_path()
into start_removing_path().
- fs/read_write: narrow the __kernel_write() export with
EXPORT_SYMBOL_FOR_MODULES().
- vfs: uapi: retire octal and hex constants in favor of (1 << n) for
the O_ flags. Finding a free bit for a new flag across the
architectures was needlessly hard with the mixed bases.
- dcache: add extra sanity checks of dead dentries in dentry_free()
via a new DENTRY_WARN_ONCE() that also prints d_flags.
- iov_iter: use kmemdup_array() in dup_iter() to harden the
allocation against multiplication overflow.
- fs/pipe: write to ->poll_usage only once.
- vfs: remove an always-taken if-branch in find_next_fd().
- dcache: use kmalloc_flex() for struct external_name in __d_alloc().
- namei: use QSTR() instead of QSTR_INIT() in path_pts().
- sync_file_range: delete dead S_ISLNK code.
- Comment fixes: retire a stale comment in fget_task_next() and fix
assorted spelling mistakes"
* tag 'vfs-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (73 commits)
backing-file: fix backing_file_open() kerneldoc parameter
iomap: pass the correct len to fserror_report_io in __iomap_write_begin
vfs: add FS_USERNS_DELEGATABLE flag and set it for NFS
filelock: fix break_lease() stub signature for CONFIG_FILE_LOCKING=n
vfs: uapi: retire octal and hex numbers in favor of (1 << n) for O_ flags
bpf: add bpf_real_inode() kfunc
fs/read_write: Do not export __kernel_write() to the entire world
libfs: drop redundant SB_I_NOEXEC/SB_I_NODEV in init_pseudo() callers
libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo()
mount: honour SB_NOUSER in the new mount API
fs/fcntl: fix SOFTIRQ-unsafe lock order in fasync signaling
selftests/pipe: add pipe_bench microbenchmark
fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
fs: retire stale comment in fget_task_next()
fs: fix spelling mistakes in comment
bfs: replace get_zeroed_page() with kzalloc()
binfmt_misc: replace __get_free_page() with kmalloc()
configfs: replace __get_free_pages() with kzalloc()
fs/namespace: use __getname() to allocate mntpath buffer
fs/select: replace __get_free_page() with kmalloc()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull iomap updates from Christian Brauner:
- Add the vfs infrastructure required to implement fs-verity support
for XFS with a post-EOF merkle tree: fsverity generates and stores a
zero-block hash, and iomap learns to verify data on buffered reads,
to handle fsverity during writeback via the new IOMAP_F_FSVERITY
flag, and to write fsverity metadata through iomap_fsverity_write().
- Skip the memset of the iomap in iomap_iter() once the iteration is
done. In high-IOPS scenarios (4k randread NVMe polling via io_uring)
the pointless memset wasted memory write bandwidth; this improves
IOPS by about 5% on ext4 and xfs.
- Add balance_dirty_pages_ratelimited() to iomap_zero_iter(), aligning
it with iomap_write_iter(). This prepares for the exFAT iomap
conversion where zeroing beyond valid_size can trigger large-scale
zeroing operations that caused memory pressure without throttling.
- Remove the over-strict inline data boundary check. If a filesystem
provides a valid inline_data pointer and length there is no reason to
require that inline data must not cross a page boundary.
- Don't make REQ_POLLED imply REQ_NOWAIT, matching the earlier
equivalent block layer fix: there are valid cases to poll for I/O
completion without REQ_NOWAIT, and REQ_NOWAIT for file system writes
is currently not supported as writes aren't idempotent.
- Introduce IOMAP_F_ZERO_TAIL for filesystems that maintain a separate
valid data length (exFAT, NTFS). For a write starting at or beyond
valid_size, __iomap_write_begin() now zeroes only the tail portion of
the block while preserving valid data before it, instead of leaving
stale data in the page cache. The flag is also added to the iomap
trace event strings.
* tag 'vfs-7.2-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
iomap: Add IOMAP_F_ZERO_TAIL flag to trace event strings
iomap: introduce iomap_fsverity_write() for writing fsverity metadata
iomap: teach iomap to read files with fsverity
iomap: introduce IOMAP_F_FSVERITY and teach writeback to handle fsverity
fsverity: generate and store zero-block hash
iomap: introduce IOMAP_F_ZERO_TAIL flag
iomap: don't make REQ_POLLED imply REQ_NOWAIT
iomap: remove over-strict inline data boundary check
iomap: add dirty page control to iomap_zero_iter
iomap: avoid memset iomap when iter is done
|
|
len is size of the (larger) write request, plen is the range for which
the read failed here.
Fixes: a9d573ee88af ("iomap: report file I/O errors to the VFS")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260610050642.1906695-1-hch@lst.de
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Fix error handling in ovl_cache_get()
- Tighten access checks for exited tasks in pidfd_getfd()
- Fix selftests leak in __wait_for_test()
- Limit FUSE_NOTIFY_RETRIEVE to uptodate folios
- Reject fuse_notify() pagecache ops on directories
- Clear JOBCTL_PENDING_MASK for caller in zap_other_threads()
- Fix failure to unlock in nfsd4_create_file()
- Fix pointer arithmetic in qnx6 directory iteration
- Fix UAF due to unlocked ->mnt_ns read in may_decode_fh()
- Avoid potential null folio->mapping deref during iomap error
reporting
* tag 'vfs-7.1-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
iomap: avoid potential null folio->mapping deref during error reporting
fhandle: fix UAF due to unlocked ->mnt_ns read in may_decode_fh()
fs/qnx6: fix pointer arithmetic in directory iteration
VFS: fix possible failure to unlock in nfsd4_create_file()
signal: clear JOBCTL_PENDING_MASK for caller in zap_other_threads()
fuse: reject fuse_notify() pagecache ops on directories
fuse: limit FUSE_NOTIFY_RETRIEVE to uptodate folios
selftests: harness: fix pidfd leak in __wait_for_test
pidfd: refuse access to tasks that have started exiting harder
ovl: keep err zero after successful ovl_cache_get()
|
|
Add IOMAP_F_ZERO_TAIL to the flag string mapping in iomap trace
events. This allows the new flag to be properly displayed in
ftrace output when iomap operations use it.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Link: https://patch.msgid.link/20260603144031.7370-1-linkinjeon@kernel.org
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
This is just a wrapper around iomap_file_buffered_write() to create
necessary iterator over metadata.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Link: https://patch.msgid.link/20260520123722.405752-10-aalbersh@kernel.org
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
Obtain fsverity info for folios with file data and fsverity metadata.
Filesystem can pass vi down to ioend and then to fsverity for
verification. This is different from other filesystems ext4, f2fs, btrfs
supporting fsverity, these filesystems don't need fsverity_info for
reading fsverity metadata. While reading merkle tree iomap requires
fsverity info to synthesize hashes for zeroed data block.
fsverity metadata has two kinds of holes - ones in merkle tree and one
after fsverity descriptor.
Merkle tree holes are blocks full of hashes of zeroed data blocks. These
are not stored on the disk but synthesized on the fly. This saves a bit
of space for sparse files. Due to this iomap also need to lookup
fsverity_info for folios with fsverity metadata. ->vi has a hash of the
zeroed data block which will be used to fill the merkle tree block.
The hole past descriptor is interpreted as end of metadata region. As we
don't have EOF here we use this hole as an indication that rest of the
folio is empty. This patch marks rest of the folio beyond fsverity
descriptor as uptodate.
For file data, fsverity needs to verify consistency of the whole file
against the root hash, hashes of holes are included in the merkle tree.
Verify them too.
Issue reading of fsverity merkle tree on the fsverity inodes. This way
metadata will be available at I/O completion time.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Link: https://patch.msgid.link/20260520123722.405752-9-aalbersh@kernel.org
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
This flag indicates that I/O is for fsverity metadata.
In the write path skip i_size check and i_size updates as metadata is
past EOF. In writeback don't update i_size and continue writeback if
even folio is beyond EOF. In read path don't zero fsverity folios, again
they are past EOF.
The iomap_block_needs_zeroing() is also called from write path. For
folios of larger order we don't want to zero out pages in the folio as
these could contain other merkle tree blocks. For fsverity, filesystem
will request to read PAGE_SIZE memory regions. For data folios, iomap
will zero the rest of the folio for anything which is beyond EOF. We
don't want this for fsverity folios.
Christian Brauner <brauner@kernel.org> says:
Changed IOMAP_F_FSVERITY from (1U << 10) to (1U << 11) to avoid colliding
with IOMAP_F_ZERO_TAIL, which already uses (1U << 10).
Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
Link: https://patch.msgid.link/20260520123722.405752-8-aalbersh@kernel.org
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
In filesystems that maintain a separate Valid Data Length, such as exFAT
and NTFS, a partial write may start at or beyond the current valid_size and
extend it. In this case, the region after the previous valid_size but
within the same filesystem block is considered unwritten.
This patch introduces IOMAP_F_ZERO_TAIL. When this flag is set in iomap,
__iomap_write_begin() will zero only the tail portion while preserving any
valid data before it in the same block.
Without this tail zeroing, stale data in the unwritten portion of the block
can remain in the page cache. Subsequent reads can then return incorrect
contents from that region.
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Link: https://patch.msgid.link/20260518114705.9601-2-linkinjeon@kernel.org
Acked-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
When a buffered read fails, iomap_finish_folio_read() reports the error
with fserror_report_io(folio->mapping->host, ...). This is called after
ifs->read_bytes_pending has been decremented by the bytes attempted to
be read.
For a folio split across multiple read completions, the folio is only
guaranteed to stay locked while read_bytes_pending > 0. Once
iomap_finish_folio_read() decrements read_bytes_pending, another
in-flight read can complete and end the read on the folio, which unlocks
it. This allows truncate logic to run and detach the folio (set
folio->mapping to NULL). The error reporting path then can dereference a
NULL folio->mapping. As reported by Sam Sun, this is the race that can
occur:
CPU0: failed completion CPU1: final completion CPU2: truncate
----------------------- ---------------------- --------------
read_bytes_pending -= len
finished = false
/* preempted before
fserror_report_io() */
read_bytes_pending -= len
finished = true
folio_end_read()
truncate clears
folio->mapping
fserror_report_io(
folio->mapping->host, ...)
^ NULL deref
Fix this by reporting the error first before decrementing
ifs->read_bytes_pending.
Fixes: a9d573ee88af ("iomap: report file I/O errors to the VFS")
Cc: stable@vger.kernel.org
Reported-by: Sam Sun <samsun1006219@gmail.com>
Closes: https://lore.kernel.org/linux-fsdevel/CAEkJfYPhWdd59RKmuNLJg-bkypHz7xiOwaWyNVu3A8CUqQCnvg@mail.gmail.com/
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20260604011858.2297561-1-joannelkoong@gmail.com
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
As described in commit 2bc057692599 ("block: don't make REQ_POLLED imply
REQ_NOWAIT"), which fixed the same issue for the block device node, there
are valid cases to poll for I/O completion without REQ_NOWAIT.
Additionally, sing REQ_NOWAIT for file system writes is currently not
supported as file systems writes are not idempotent and would need a
retry of just the bio and not the entire operation to be fully supported.
Switch iomap to set REQ_POLLED and remove the now unused bio_set_polled
helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260518062917.506483-1-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
When bouncing for block size > PAGE_SIZE file systems that require
file system block size alignment (e.g. zoned XFS), the bio needs to
be big enough to fit an entire block.
Fixes: 8dd5e7c75d7b ("block: add helpers to bounce buffer an iov_iter into bios")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Link: https://patch.msgid.link/20260507050153.1298375-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The current iomap_inline_data_valid() check ensures that inline data
does not cross a PAGE_SIZE boundary. However, this is an unnecessarily
strict constraint. If a filesystem provides a valid iomap::inline_data
pointer and iomap::length, we should trust that the caller has mapped
sufficient memory for the range, even if it spans across page boundaries.
Removing this check allows filesystems to point directly to their
internal data structures without forced page-alignment or additional
redundant allocations. This remove iomap_inline_data_valid() and
its callers in buffered and direct I/O paths.
Suggested-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Link: https://patch.msgid.link/20260511141151.6021-1-linkinjeon@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
This patch prepares the iomap framework for exFAT's upcoming migration to
iomap. During testing of the exFAT iomap branch with xfstests generic/299 on
a VM with 8GB RAM and a 40GB disk, system unresponsiveness was observed.
iomap_zero_iter() lacked dirty page throttling, which could cause memory
pressure when exFAT's valid_size mechanism triggers large-scale zeroing
operations during writes beyond valid_size.
Align iomap_zero_iter() with iomap_write_iter() by adding
balance_dirty_pages_ratelimited() to throttle dirty page generation during
large zeroing operations
Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
Link: https://patch.msgid.link/20260511094007.728011-1-chizhiling@163.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
When iomap_iter() finishes its iteration (returns <= 0), it is no longer
necessary to memset the entire iomap and srcmap structures.
In high-IOPS scenarios (like 4k randread NVMe polling with io_uring),
where the majority of I/Os complete in a single extent map, this wasted
memory write bandwidth, as the caller will just discard the iterator.
Use this command to test:
taskset -c 30 ./t/io_uring -p1 -d512 -b4096 -s32 -c32 -F1 -B1 -R1 -X1
-n1 -P1 /mnt/testfile
IOPS improve about 5% on ext4 and XFS.
However, we MUST still call iomap_iter_reset_iomap() to release the
folio_batch if IOMAP_F_FOLIO_BATCH is set, otherwise we leak page
references. Therefore, split the cleanup logic: always release the
folio_batch, but skip the memset() when ret <= 0.
Signed-off-by: Fengnan Chang <changfengnan@bytedance.com>
Link: https://patch.msgid.link/20260420061630.62077-1-changfengnan@bytedance.com
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Pull fscrypt updates from Eric Biggers:
- Various cleanups for the interface between fs/crypto/ and
filesystems, from Christoph Hellwig
- Simplify and optimize the implementation of v1 key derivation by
using the AES library instead of the crypto_skcipher API
* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux:
fscrypt: use AES library for v1 key derivation
ext4: use a byte granularity cursor in ext4_mpage_readpages
fscrypt: pass a real sector_t to fscrypt_zeroout_range
fscrypt: pass a byte length to fscrypt_zeroout_range
fscrypt: pass a byte offset to fscrypt_zeroout_range
fscrypt: pass a byte length to fscrypt_zeroout_range_inline_crypt
fscrypt: pass a byte offset to fscrypt_zeroout_range_inline_crypt
fscrypt: pass a byte offset to fscrypt_set_bio_crypt_ctx
fscrypt: pass a byte offset to fscrypt_mergeable_bio
fscrypt: pass a byte offset to fscrypt_generate_dun
fscrypt: move fscrypt_set_bio_crypt_ctx_bh to buffer.c
ext4, fscrypt: merge fscrypt_mergeable_bio_bh into io_submit_need_new_bio
ext4: factor out a io_submit_need_new_bio helper
ext4: open code fscrypt_set_bio_crypt_ctx_bh
ext4: initialize the write hint in io_submit_init_bio
|
|
Pull xfs updates from Carlos Maiolino:
"There aren't any new features.
The whole series is just a collection of bug fixes and code
refactoring. There is some new information added a couple new
tracepoints, new data added to mountstats, but no big changes"
* tag 'xfs-merge-7.1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (41 commits)
xfs: fix number of GC bvecs
xfs: untangle the open zones reporting in mountinfo
xfs: expose the number of open zones in sysfs
xfs: reduce special casing for the open GC zone
xfs: streamline GC zone selection
xfs: refactor GC zone selection helpers
xfs: rename xfs_zone_gc_iter_next to xfs_zone_gc_iter_irec
xfs: put the open zone later xfs_open_zone_put
xfs: add a separate tracepoint for stealing an open zone for GC
xfs: delay initial open of the GC zone
xfs: fix a resource leak in xfs_alloc_buftarg()
xfs: handle too many open zones when mounting
xfs: refactor xfs_mount_zones
xfs: fix integer overflow in busy extent sort comparator
xfs: fix integer overflow in deferred intent sort comparators
xfs: fold xfs_setattr_size into xfs_vn_setattr_size
xfs: remove a duplicate assert in xfs_setattr_size
xfs: return default quota limits for IDs without a dquot
xfs: start gc on zonegc_low_space attribute updates
xfs: don't decrement the buffer LRU count for in-use buffers
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs i_ino updates from Christian Brauner:
"For historical reasons, the inode->i_ino field is an unsigned long,
which means that it's 32 bits on 32 bit architectures. This has caused
a number of filesystems to implement hacks to hash a 64-bit identifier
into a 32-bit field, and deprives us of a universal identifier field
for an inode.
This changes the inode->i_ino field from an unsigned long to a u64.
This shouldn't make any material difference on 64-bit hosts, but
32-bit hosts will see struct inode grow by at least 4 bytes. This
could have effects on slabcache sizes and field alignment.
The bulk of the changes are to format strings and tracepoints, since
the kernel itself doesn't care that much about the i_ino field. The
first patch changes some vfs function arguments, so check that one out
carefully.
With this change, we may be able to shrink some inode structures. For
instance, struct nfs_inode has a fileid field that holds the 64-bit
inode number. With this set of changes, that field could be
eliminated. I'd rather leave that sort of cleanups for later just to
keep this simple"
* tag 'vfs-7.1-rc1.kino' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
nilfs2: fix 64-bit division operations in nilfs_bmap_find_target_in_group()
EVM: add comment describing why ino field is still unsigned long
vfs: remove externs from fs.h on functions modified by i_ino widening
treewide: fix missed i_ino format specifier conversions
ext4: fix signed format specifier in ext4_load_inode trace event
treewide: change inode->i_ino from unsigned long to u64
nilfs2: widen trace event i_ino fields to u64
f2fs: widen trace event i_ino fields to u64
ext4: widen trace event i_ino fields to u64
zonefs: widen trace event i_ino fields to u64
hugetlbfs: widen trace event i_ino fields to u64
ext2: widen trace event i_ino fields to u64
cachefiles: widen trace event i_ino fields to u64
vfs: widen trace event i_ino fields to u64
net: change sock.sk_ino and sock_i_ino() to u64
audit: widen ino fields to u64
vfs: widen inode hash/lookup functions to u64
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs integrity updates from Christian Brauner:
"This adds support to generate and verify integrity information (aka
T10 PI) in the file system, instead of the automatic below the covers
support that is currently used.
The implementation is based on refactoring the existing block layer PI
code to be reusable for this use case, and then adding relatively
small wrappers for the file system use case. These are then used in
iomap to implement the semantics, and wired up in XFS with a small
amount of glue code.
Compared to the baseline this does not change performance for writes,
but increases read performance up to 15% for 4k I/O, with the benefit
decreasing with larger I/O sizes as even the baseline maxes out the
device quickly on my older enterprise SSD"
* tag 'vfs-7.1-rc1.integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
xfs: support T10 protection information
iomap: support T10 protection information
iomap: support ioends for buffered reads
iomap: add a bioset pointer to iomap_read_folio_ops
ntfs3: remove copy and pasted iomap code
iomap: allow file systems to hook into buffered read bio submission
iomap: only call into ->submit_read when there is a read_ctx
iomap: pass the iomap_iter to ->submit_read
iomap: refactor iomap_bio_read_folio_range
block: pass a maxlen argument to bio_iov_iter_bounce
block: add fs_bio_integrity helpers
block: make max_integrity_io_size public
block: prepare generation / verification helpers for fs usage
block: add a bdev_has_integrity_csum helper
block: factor out a bio_integrity_setup_default helper
block: factor out a bio_integrity_action helper
|
|
Zorro Lang reported the following lockdep splat:
"While running fstests xfs/556 on kernel 7.0.0-rc4+ (HEAD=04a9f1766954),
a lockdep warning was triggered indicating an inconsistent lock state
for sb->s_type->i_lock_key.
"The deadlock might occur because iomap_read_end_io (called from a
hardware interrupt completion path) invokes fserror_report, which then
calls igrab. igrab attempts to acquire the i_lock spinlock. However,
the i_lock is frequently acquired in process context with interrupts
enabled. If an interrupt occurs while a process holds the i_lock, and
that interrupt handler calls fserror_report, the system deadlocks.
"I hit this warning several times by running xfs/556 (mostly) or
generic/648 on xfs. More details refer to below console log."
along with this dmesg, for which I've cleaned up the stacktraces:
run fstests xfs/556 at 2026-03-18 20:05:30
XFS (sda3): Mounting V5 Filesystem 396e9164-c45a-4e05-be9d-b38c2c5c6477
XFS (sda3): Ending clean mount
XFS (sda3): Unmounting Filesystem 396e9164-c45a-4e05-be9d-b38c2c5c6477
XFS (sda3): Mounting V5 Filesystem bf3f89c3-3c45-4650-a9c7-744f39c0191e
XFS (sda3): Ending clean mount
XFS (sda3): Unmounting Filesystem bf3f89c3-3c45-4650-a9c7-744f39c0191e
XFS (dm-0): Mounting V5 Filesystem bf3f89c3-3c45-4650-a9c7-744f39c0191e
XFS (dm-0): Ending clean mount
device-mapper: table: 253:0: adding target device (start sect 209 len 1) caused an alignment inconsistency
device-mapper: table: 253:0: adding target device (start sect 210 len 62914350) caused an alignment inconsistency
buffer_io_error: 6 callbacks suppressed
Buffer I/O error on dev dm-0, logical block 209, async page read
Buffer I/O error on dev dm-0, logical block 209, async page read
XFS (dm-0): Unmounting Filesystem bf3f89c3-3c45-4650-a9c7-744f39c0191e
XFS (dm-0): Mounting V5 Filesystem bf3f89c3-3c45-4650-a9c7-744f39c0191e
XFS (dm-0): Ending clean mount
================================
WARNING: inconsistent lock state
7.0.0-rc4+ #1 Tainted: G S W
--------------------------------
inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
od/2368602 [HC1[1]:SC0[0]:HE0:SE1] takes:
ff1100069f2b4a98 (&sb->s_type->i_lock_key#31){?.+.}-{3:3}, at: igrab+0x28/0x1a0
{HARDIRQ-ON-W} state was registered at:
__lock_acquire+0x40d/0xbd0
lock_acquire.part.0+0xbd/0x260
_raw_spin_lock+0x37/0x80
unlock_new_inode+0x66/0x2a0
xfs_iget+0x67b/0x7b0 [xfs]
xfs_mountfs+0xde4/0x1c80 [xfs]
xfs_fs_fill_super+0xe86/0x17a0 [xfs]
get_tree_bdev_flags+0x312/0x590
vfs_get_tree+0x8d/0x2f0
vfs_cmd_create+0xb2/0x240
__do_sys_fsconfig+0x3d8/0x9a0
do_syscall_64+0x13a/0x1520
entry_SYSCALL_64_after_hwframe+0x76/0x7e
irq event stamp: 3118
hardirqs last enabled at (3117): [<ffffffffb54e4ad8>] _raw_spin_unlock_irq+0x28/0x50
hardirqs last disabled at (3118): [<ffffffffb54b84c9>] common_interrupt+0x19/0xe0
softirqs last enabled at (3040): [<ffffffffb290ca28>] handle_softirqs+0x6b8/0x950
softirqs last disabled at (3023): [<ffffffffb290ce4d>] __irq_exit_rcu+0xfd/0x250
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&sb->s_type->i_lock_key#31);
<Interrupt>
lock(&sb->s_type->i_lock_key#31);
*** DEADLOCK ***
1 lock held by od/2368602:
#0: ff1100069f2b4b58 (&sb->s_type->i_mutex_key#19){++++}-{4:4}, at: xfs_ilock+0x324/0x4b0 [xfs]
stack backtrace:
CPU: 15 UID: 0 PID: 2368602 Comm: od Kdump: loaded Tainted: G S W 7.0.0-rc4+ #1 PREEMPT(full)
Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN
Hardware name: Dell Inc. PowerEdge R660/0R5JJC, BIOS 2.1.5 03/14/2024
Call Trace:
<IRQ>
dump_stack_lvl+0x6f/0xb0
print_usage_bug.part.0+0x230/0x2c0
mark_lock_irq+0x3ce/0x5b0
mark_lock+0x1cb/0x3d0
mark_usage+0x109/0x120
__lock_acquire+0x40d/0xbd0
lock_acquire.part.0+0xbd/0x260
_raw_spin_lock+0x37/0x80
igrab+0x28/0x1a0
fserror_report+0x127/0x2d0
iomap_finish_folio_read+0x13c/0x280
iomap_read_end_io+0x10e/0x2c0
clone_endio+0x37e/0x780 [dm_mod]
blk_update_request+0x448/0xf00
scsi_end_request+0x74/0x750
scsi_io_completion+0xe9/0x7c0
_scsih_io_done+0x6ba/0x1ca0 [mpt3sas]
_base_process_reply_queue+0x249/0x15b0 [mpt3sas]
_base_interrupt+0x95/0xe0 [mpt3sas]
__handle_irq_event_percpu+0x1f0/0x780
handle_irq_event+0xa9/0x1c0
handle_edge_irq+0x2ef/0x8a0
__common_interrupt+0xa0/0x170
common_interrupt+0xb7/0xe0
</IRQ>
<TASK>
asm_common_interrupt+0x26/0x40
RIP: 0010:_raw_spin_unlock_irq+0x2e/0x50
Code: 0f 1f 44 00 00 53 48 8b 74 24 08 48 89 fb 48 83 c7 18 e8 b5 73 5e fd 48 89 df e8 ed e2 5e fd e8 08 78 8f fd fb bf 01 00 00 00 <e8> 8d 56 4d fd 65 8b 05 46 d5 1d 03 85 c0 74 06 5b c3 cc cc cc cc
RSP: 0018:ffa0000027d07538 EFLAGS: 00000206
RAX: 0000000000000c2d RBX: ffffffffb6614bc8 RCX: 0000000000000080
RDX: 0000000000000000 RSI: ffffffffb6306a01 RDI: 0000000000000001
RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
R10: ffffffffb75efc67 R11: 0000000000000001 R12: ff1100015ada0000
R13: 0000000000000083 R14: 0000000000000002 R15: ffffffffb6614c10
folio_wait_bit_common+0x407/0x780
filemap_update_page+0x8e7/0xbd0
filemap_get_pages+0x904/0xc50
filemap_read+0x320/0xc20
xfs_file_buffered_read+0x2aa/0x380 [xfs]
xfs_file_read_iter+0x263/0x4a0 [xfs]
vfs_read+0x6cb/0xb70
ksys_read+0xf9/0x1d0
do_syscall_64+0x13a/0x1520
Zorro's diagnosis makes sense, so the solution is to kick the failed
read handling to a workqueue much like we added for writeback ioends in
commit 294f54f849d846 ("fserror: fix lockdep complaint when igrabbing
inode").
Cc: Zorro Lang <zlang@redhat.com>
Link: https://lore.kernel.org/linux-xfs/20260319194303.efw4wcu7c4idhthz@doltdoltdolt/
Fixes: a9d573ee88af98 ("iomap: report file I/O errors to the VFS")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Link: https://patch.msgid.link/20260323210017.GL6223@frogsfrogsfrogs
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Now that the zero range pagecache flush is purely isolated to
providing zeroing correctness in this case, we can remove it and
replace it with the folio batch mechanism that is used for handling
unwritten extents.
This is still slightly odd in that XFS reports a hole vs. a mapping
that reflects the COW fork extents, but that has always been the
case in this situation and so a separate issue. We drop the iomap
warning that assumes the folio batch is always associated with
unwritten mappings, but this is mainly a development assertion as
otherwise the core iomap fbatch code doesn't care much about the
mapping type if it's handed the set of folios to process.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
iomap zero range has a wart in that it also flushes dirty pagecache
over hole mappings (rather than only unwritten mappings). This was
included to accommodate a quirk in XFS where COW fork preallocation
can exist over a hole in the data fork, and the associated range is
reported as a hole. This is because the range actually is a hole,
but XFS also has an optimization where if COW fork blocks exist for
a range being written to, those blocks are used regardless of
whether the data fork blocks are shared or not. For zeroing, COW
fork blocks over a data fork hole are only relevant if the range is
dirty in pagecache, otherwise the range is already considered
zeroed.
The easiest way to deal with this corner case is to flush the
pagecache to trigger COW remapping into the data fork, and then
operate on the updated on-disk state. The problem is that ext4
cannot accommodate a flush from this context due to being a
transaction deadlock vector.
Outside of the hole quirk, ext4 can avoid the flush for zero range
by using the recently introduced folio batch lookup mechanism for
unwritten mappings. Therefore, take the next logical step and lift
the hole handling logic into the XFS iomap_begin handler. iomap will
still flush on unwritten mappings without a folio batch, and XFS
will flush and retry mapping lookups in the case where it would
otherwise report a hole with dirty pagecache during a zero range.
Note that this is intended to be a fairly straightforward lift and
otherwise not change behavior. Now that the flush exists within XFS,
follow on patches can further optimize it.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Commit aa35dd5cbc06 ("iomap: fix invalid folio access after
folio_end_read()") partially addressed invalid folio access for folios
without an ifs attached, but it did not handle the case where
1 << inode->i_blkbits matches the folio size but is different from the
granularity used for the IO, which means IO can be submitted for less
than the full folio for the !ifs case.
In this case, the condition:
if (*bytes_submitted == folio_len)
ctx->cur_folio = NULL;
in iomap_read_folio_iter() will not invalidate ctx->cur_folio, and
iomap_read_end() will still be called on the folio even though the IO
helper owns it and will finish the read on it.
Fix this by unconditionally invalidating ctx->cur_folio for the !ifs
case.
Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/linux-fsdevel/b3dfe271-4e3d-4922-b618-e73731242bca@wdc.com/
Fixes: b2f35ac4146d ("iomap: add caller-provided callbacks for read and readahead")
Cc: stable@vger.kernel.org
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Link: https://patch.msgid.link/20260317203935.830549-1-joannelkoong@gmail.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Add support for generating / verifying protection information in iomap.
This is done by hooking into the bio submission and then using the
generic PI helpers. Compared to just using the block layer auto PI
this extends the protection envelope and also prepares for eventually
passing through PI from userspace at least for direct I/O.
To generate or verify PI, the file system needs to set the
IOMAP_F_INTEGRITY flag on the iomap for the request, and ensure the
ioends are used for all integrity I/O. Additionally the file system
must defer read I/O completions to user context so that the guard
tag validation isn't run from interrupt context.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260223132021.292832-16-hch@lst.de
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Support using the ioend structure to defer I/O completion for
buffered reads by calling into the buffered read I/O completion
handler from iomap_finish_ioend.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260223132021.292832-15-hch@lst.de
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Optionally allocate the bio from the bioset provided in
iomap_read_folio_ops. If no bioset is provided, fs_bio_set is still
used, which is the standard bioset for file systems.
Based on a patch from Goldwyn Rodrigues <rgoldwyn@suse.com>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260223132021.292832-14-hch@lst.de
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
File systems such as btrfs have additional operations with bios such as
verifying data checksums. Allow file systems to hook into submission
of the bio to allow for this processing by replacing the direct
submit_bio call in iomap_read_alloc_bio with a call into ->submit_read
and exporting iomap_read_alloc_bio. Also add a new field to
struct iomap_read_folio_ctx to track the file logic offset of the current
read context.
Based on a patch from Goldwyn Rodrigues <rgoldwyn@suse.com>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260223132021.292832-12-hch@lst.de
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Move the NULL check into the callers to simplify the callees.
Fuse was missing this before, but has a constant read_ctx that is
never NULL or changed, so no change here either.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260223132021.292832-11-hch@lst.de
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
This provides additional context for file systems.
Rename the fuse instance to match the method name while we're at it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260223132021.292832-10-hch@lst.de
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Split out the logic to allocate a new bio and only keep the fast path
that adds more data to an existing bio in iomap_bio_read_folio_range.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260223132021.292832-9-hch@lst.de
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Bring in the shared branch with the block layer.
* 'for-7.1/block-integrity' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
block: pass a maxlen argument to bio_iov_iter_bounce
block: add fs_bio_integrity helpers
block: make max_integrity_io_size public
block: prepare generation / verification helpers for fs usage
block: add a bdev_has_integrity_csum helper
block: factor out a bio_integrity_setup_default helper
block: factor out a bio_integrity_action helper
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Logical offsets into an inode are usually expressed as bytes in the VFS.
Switch fscrypt_set_bio_crypt_ctx to that convention.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20260302141922.370070-9-hch@lst.de
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
|
|
Allow the file system to limit the size processed in a single
bounce operation. This is needed when generating integrity data
so that the size of a single integrity segment can't overflow.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
On 32-bit architectures, unsigned long is only 32 bits wide, which
causes 64-bit inode numbers to be silently truncated. Several
filesystems (NFS, XFS, BTRFS, etc.) can generate inode numbers that
exceed 32 bits, and this truncation can lead to inode number collisions
and other subtle bugs on 32-bit systems.
Change the type of inode->i_ino from unsigned long to u64 to ensure that
inode numbers are always represented as 64-bit values regardless of
architecture. Update all format specifiers treewide from %lu/%lx to
%llu/%llx to match the new type, along with corresponding local variable
types.
This is the bulk treewide conversion. Earlier patches in this series
handled trace events separately to allow trace field reordering for
better struct packing on 32-bit.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20260304-iino-u64-v3-12-2257ad83d372@kernel.org
Acked-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Update VFS-layer trace event definitions to use u64 instead of
ino_t/unsigned long for inode number fields. Update TP_printk format
strings to use %llu/%llx to match the widened field type. Remove
now-unnecessary (unsigned long) casts since __entry->ino is already
u64.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20260304-iino-u64-v3-4-2257ad83d37 |