aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2026-01-28Merge tag 'scrub-syzbot-fixes-7.0_2026-01-25' of ↵Carlos Maiolino23-181/+115
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-7.0-merge xfs: syzbot fixes for online fsck [3/3] Fix various syzbot complaints about scrub that Jiaming Zhang found. With a bit of luck, this should all go splendidly. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-28Merge tag 'attr-pptr-speedup-7.0_2026-01-25' of ↵Carlos Maiolino6-17/+157
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-7.0-merge xfs: improve shortform attr performance [2/3] Improve performance of the xattr (and parent pointer) code when the attr structure is in short format and we can therefore perform all updates in a single transaction. Avoiding the attr intent code brings a very nice speedup in those operations. With a bit of luck, this should all go splendidly. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-28Merge tag 'attr-leaf-freemap-fixes-7.0_2026-01-25' of ↵Carlos Maiolino3-63/+155
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-7.0-merge xfs: fix problems in the attr leaf freemap code [1/3] Running generic/753 for hours revealed data corruption problems in the attr leaf block space management code. Under certain circumstances, freemap entries are left with zero size but a nonzero offset. If that offset happens to be the same offset as the end of the entries array during an attr set operation, the leaf entry table expansion will push the freemap record offset upwards without checking for overlap with any other freemap entries. If there happened to be a second freemap entry overlapping with the newly allocated leaf entry space, then the next attr set operation might find that space and overwrite the leaf entry, thereby corrupting the leaf block. Fix this by zeroing the freemap offset any time we set the size to zero. If a subsequent attr set operation finds no space in the freemap, it will compact the block and regenerate the freemaps. With a bit of luck, this should all go splendidly. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-28Merge tag 'health-monitoring-7.0_2026-01-20' of ↵Carlos Maiolino31-26/+3044
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-7.0-merge xfs: autonomous self healing of filesystems [v7] This patchset builds new functionality to deliver live information about filesystem health events to userspace. This is done by creating an anonymous file that can be read() for events by userspace programs. Events are captured by hooking various parts of XFS and iomap so that metadata health failures, file I/O errors, and major changes in filesystem state (unmounts, shutdowns, etc.) can be observed by programs. When an event occurs, the hook functions queue an event object to each event anonfd for later processing. Programs must have CAP_SYS_ADMIN to open the anonfd and there's a maximum event lag to prevent resource overconsumption. The events themselves can be read() from the anonfd as C structs for the xfs_healer daemon. In userspace, we create a new daemon program that will read the event objects and initiate repairs automatically. This daemon is managed entirely by systemd and will not block unmounting of the filesystem unless repairs are ongoing. They are auto-started by a starter service that uses fanotify. This patchset depends on the new fserror code that Christian Brauner has tentatively accepted for Linux 7.0: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs-7.0.fserror v7: more cleanups of the media verification ioctl, improve comments, and reuse the bio v6: fix pi-breaking bugs, make verify failures trigger health reports and filter bio status flags better v5: add verify-media ioctl, collapse small helper funcs with only one caller v4: drop multiple client support so we can make direct calls into healthmon instead of chasing pointers and doing indirect calls v3: drag out of rfc status With a bit of luck, this should all go splendidly. Conflicts: This merge required an update on files: - fs/xfs/xfs_healthmon.c - fs/xfs/xfs_verify_media.c Such change was required because a parallel developement changed XFS header file xfs.h naming to xfs_platform.h, so the merge required to update those includes in both files above Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-28isofs: support full length file names (255 instead of 253)Shawn Landden1-1/+1
Linux file names are in principle limited only to PATH_MAX (which is 4096) but the code in get_rock_ridge_filename() limits them to 253 characters. As mentioned by Jan Kara, the Rockridge standard to ECMA119/ISO9660 has no limit of file name length, but this limits file names to the traditional 255 NAME_MAX value. Signed-off-by: Shawn Landden <slandden@gmail.com> Link: https://patch.msgid.link/CA+49okq0ouJvAx0=txR_gyNKtZj55p3Zw4MB8jXZsGr4bEGjRA@mail.gmail.com Signed-off-by: Jan Kara <jack@suse.cz>
2026-01-28erofs: mark inodes without acls in erofs_read_inode()Gao Xiang3-1/+26
Similar to commit 91ef18b567da ("ext4: mark inodes without acls in __ext4_iget()"), the ACL state won't be read when the file owner performs a lookup, and the RCU fast path for lookups won't work because the ACL state remains unknown. If there are no extended attributes, or if the xattr filter indicates that no ACL xattr is present, call cache_no_acl() directly. Reviewed-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-01-27fs/ntfs3: Fix slab-out-of-bounds read in DeleteIndexEntryRootJiasheng Jiang1-0/+3
In the 'DeleteIndexEntryRoot' case of the 'do_action' function, the entry size ('esize') is retrieved from the log record without adequate bounds checking. Specifically, the code calculates the end of the entry ('e2') using: e2 = Add2Ptr(e1, esize); It then calculates the size for memmove using 'PtrOffset(e2, ...)', which subtracts the end pointer from the buffer limit. If 'esize' is maliciously large, 'e2' exceeds the used buffer size. This results in a negative offset which, when cast to size_t for memmove, interprets as a massive unsigned integer, leading to a heap buffer overflow. This commit adds a check to ensure that the entry size ('esize') strictly fits within the remaining used space of the index header before performing memory operations. Fixes: b46acd6a6a62 ("fs/ntfs3: Add NTFS journal") Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2026-01-26fat: avoid parent link count underflow in rmdirZhiyu Zhang2-2/+12
Corrupted FAT images can leave a directory inode with an incorrect i_nlink (e.g. 2 even though subdirectories exist). rmdir then unconditionally calls drop_nlink(dir) and can drive i_nlink to 0, triggering the WARN_ON in drop_nlink(). Add a sanity check in vfat_rmdir() and msdos_rmdir(): only drop the parent link count when it is at least 3, otherwise report a filesystem error. Link: https://lkml.kernel.org/r/20260101111148.1437-1-zhiyuzhang999@gmail.com Fixes: 9a53c3a783c2 ("[PATCH] r/o bind mounts: unlink: monitor i_nlink") Signed-off-by: Zhiyu Zhang <zhiyuzhang999@gmail.com> Reported-by: Zhiyu Zhang <zhiyuzhang999@gmail.com> Closes: https://lore.kernel.org/linux-fsdevel/aVN06OKsKxZe6-Kv@casper.infradead.org/T/#t Tested-by: Zhiyu Zhang <zhiyuzhang999@gmail.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-26ocfs2: add check for free bits before allocation in ocfs2_move_extent()Deepanshu Kartikey1-1/+6
Add a check to verify the group descriptor has enough free bits before attempting allocation in ocfs2_move_extent(). This prevents a kernel BUG_ON crash in ocfs2_block_group_set_bits() when the move_extents ioctl is called on a crafted or corrupted filesystem. The existing validation in ocfs2_validate_gd_self() only checks static metadata consistency (bg_free_bits_count <= bg_bits) when the descriptor is first read from disk. However, during move_extents operations, multiple allocations can exhaust the free bits count below the requested allocation size, triggering BUG_ON(le16_to_cpu(bg->bg_free_bits_count) < num_bits). The debug trace shows the issue clearly: - Block group 32 validated with bg_free_bits_count=427 - Repeated allocations decreased count: 427 -> 171 -> 43 -> ... -> 1 - Final request for 2 bits with only 1 available triggers BUG_ON By adding an early check in ocfs2_move_extent() right after ocfs2_find_victim_alloc_group(), we return -ENOSPC gracefully instead of crashing the kernel. This also avoids unnecessary work in ocfs2_probe_alloc_group() and __ocfs2_move_extent() when the allocation will fail. Link: https://lkml.kernel.org/r/20260104133504.14810-1-kartikey406@gmail.com Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Reported-by: syzbot+7960178e777909060224@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=7960178e777909060224 Link: https://lore.kernel.org/all/20251231115801.293726-1-kartikey406@gmail.com/T/ [v1] Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-26ocfs2: adjust function name referenceJulia Lawall1-1/+1
There is no function dlm_mast_regions(). However, dlm_match_regions() is passed the buffer "local", which it uses internally, so it seems like dlm_match_regions() was intended. Link: https://lkml.kernel.org/r/20251230142513.95467-1-Julia.Lawall@inria.fr Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27f2fs: decrease maximum flush retry count in f2fs_enable_checkpoint()Chao Yu2-1/+3
It's rare case that sync_inodes_sb() always skips to flush some drity datas, so it's enough to give extra three more chances to flush data. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-27f2fs: optimize NAT block loading during checkpoint writeYongpeng Yang1-1/+13
Under stress tests with frequent metadata operations, checkpoint write time can become excessively long. Analysis shows that the slowdown is caused by synchronous, one-by-one reads of NAT blocks during checkpoint processing. The issue can be reproduced with the following workload: 1. seq 1 650000 | xargs -P 16 -n 1 touch 2. sync # avoid checkpoint write during deleting 3. delete 1 file every 455 files 4. echo 3 > /proc/sys/vm/drop_caches 5. sync # trigger checkpoint write This patch submits read I/O for all NAT blocks required in the __flush_nat_entry_set() phase in advance, reducing the overhead of synchronous waiting for individual NAT block reads. The NAT block flush latency before and after the change is as below: | |NAT blocks accessed|NAT blocks read|Flush time (ms)| |-------------|-------------------|---------------|---------------| |Before change|1205 |1191 |158 | |After change |1264 |1242 |11 | With a similar number of NAT blocks accessed and read from disk, adding NAT block readahead reduces the total NAT block flush time by more than 90%. Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-27f2fs: change size parameter of __has_cursum_space() to unsigned intYongpeng Yang1-1/+1
All callers of __has_cursum_space() pass an unsigned int value as the size parameter. Change the parameter type to unsigned int accordingly. Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-27f2fs: add write latency stats for NAT and SIT blocks in f2fs_write_checkpointYongpeng Yang2-2/+6
This patch adds separate write latency accounting for NAT and SIT blocks in f2fs_write_checkpoint(). Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-27f2fs: pin files do not require sbi->writepages lock for orderingYongpeng Yang1-0/+2
For pinned files, the file mapping is already established before writing, and since the writes are in IPU, there is no need to acquire the sbi->writepages lock to guarantee write ordering. Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-27f2fs: fix to show simulate_lock_timeout correctlyChao Yu1-1/+2
Commit d36de29f4bb5 ("f2fs: sysfs: introduce inject_lock_timeout") introduces a bug as below, fix it. cat /sys/fs/f2fs/vdx/inject_lock_timeout s/fs/f2fs/vdx/inject_lock_timeout: Invalid argument Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-27f2fs: introduce FAULT_SKIP_WRITEChao Yu3-0/+6
In order to simulate skipped write during enable_checkpoint(). Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-27f2fs: check skipped write in f2fs_enable_checkpoint()Chao Yu4-4/+55
This patch introduces sbi->nr_pages[F2FS_SKIPPED_WRITE] to record any skipped write during data flush in f2fs_enable_checkpoint(). So in the loop of data flush, if there is any skipped write in previous flush, let's retry sync_inode_sb(), otherwise, all dirty data written before f2fs_enable_checkpoint() should have been persisted, then break the retry loop. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-26Merge tag 'vfs-6.19-rc8.fixes' of ↵Linus Torvalds13-39/+75
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix the the buggy conversion of fuse_reverse_inval_entry() introduced during the creation rework - Disallow nfs delegation requests for directories by setting simple_nosetlease() - Require an opt-in for getting readdir flag bits outside of S_DT_MASK set in d_type - Fix scheduling delayed writeback work by only scheduling when the dirty time expiry interval is non-zero and cancel the delayed work if the interval is set to zero - Use rounded_jiffies_interval for dirty time work - Check the return value of sb_set_blocksize() for romfs - Wait for batched folios to be stable in __iomap_get_folio() - Use private naming for fuse hash size - Fix the stale dentry cleanup to prevent a race that causes a UAF * tag 'vfs-6.19-rc8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: vfs: document d_dispose_if_unused() fuse: shrink once after all buckets have been scanned fuse: clean up fuse_dentry_tree_work() fuse: add need_resched() before unlocking bucket fuse: make sure dentry is evicted if stale fuse: fix race when disposing stale dentries fuse: use private naming for fuse hash size writeback: use round_jiffies_relative for dirtytime_work iomap: wait for batched folios to be stable in __iomap_get_folio romfs: check sb_set_blocksize() return value docs: clarify that dirtytime_expire_seconds=0 disables writeback writeback: fix 100% CPU usage when dirtytime_expire_interval is 0 readdir: require opt-in for d_type flags vboxsf: don't allow delegations to be set on directories ceph: don't allow delegations to be set on directories gfs2: don't allow delegations to be set on directories 9p: don't allow delegations to be set on directories smb/client: properly disallow delegations on directories nfs: properly disallow delegation requests on directories fuse: fix conversion of fuse_reverse_inval_entry() to start_removing()
2026-01-26xdrgen: Add enum value validation to generated decodersChuck Lever2-20/+87
XDR enum decoders generated by xdrgen do not verify that incoming values are valid members of the enum. Incoming out-of-range values from malicious or buggy peers propagate through the system unchecked. Add validation logic to generated enum decoders using a switch statement that explicitly lists valid enumerator values. The compiler optimizes this to a simple range check when enum values are dense (contiguous), while correctly rejecting invalid values for sparse enums with gaps in their value ranges. The --no-enum-validation option on the source subcommand disables this validation when not needed. The minimum and maximum fields in _XdrEnum, which were previously unused placeholders for a range-based validation approach, have been removed since the switch-based validation handles both dense and sparse enums correctly. Because the new mechanism results in substantive changes to generated code, existing .x files are regenerated. Unrelated white space and semicolon changes in the generated code are due to recent commit 1c873a2fd110 ("xdrgen: Don't generate unnecessary semicolon") and commit 38c4df91242b ("xdrgen: Address some checkpatch whitespace complaints"). Reviewed-by: NeilBrown <neil@brown.name> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-01-26nfsd: fix return error code for nfsd_map_name_to_[ug]idAnthony Iliopoulos1-0/+4
idmap lookups can time out while the cache is waiting for a userspace upcall reply. In that case cache_check() returns -ETIMEDOUT to callers. The nfsd_map_name_to_[ug]id functions currently proceed with attempting to map the id to a kuid despite a potentially temporary failure to perform the idmap lookup. This results in the code returning the error NFSERR_BADOWNER which can cause client operations to return to userspace with failure. Fix this by returning the failure status before attempting kuid mapping. This will return NFSERR_JUKEBOX on idmap lookup timeout so that clients can retry the operation instead of aborting it. Fixes: 65e10f6d0ab0 ("nfsd: Convert idmap to use kuids and kgids") Cc: stable@vger.kernel.org Signed-off-by: Anthony Iliopoulos <ailiop@suse.com> Reviewed-by: NeilBrown <neil@brown.name> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-01-26nfsd: never defer requests during idmap lookupAnthony Iliopoulos3-8/+58
During v4 request compound arg decoding, some ops (e.g. SETATTR) can trigger idmap lookup upcalls. When those upcall responses get delayed beyond the allowed time limit, cache_check() will mark the request for deferral and cause it to be dropped. This prevents nfs4svc_encode_compoundres from being executed, and thus the session slot flag NFSD4_SLOT_INUSE never gets cleared. Subsequent client requests will fail with NFSERR_JUKEBOX, given that the slot will be marked as in-use, making the SEQUENCE op fail. Fix this by making sure that the RQ_USEDEFERRAL flag is always clear during nfs4svc_decode_compoundargs(), since no v4 request should ever be deferred. Fixes: 2f425878b6a7 ("nfsd: don't use the deferral service, return NFS4ERR_DELAY") Signed-off-by: Anthony Iliopoulos <ailiop@suse.com> Reviewed-by: NeilBrown <neil@brown.name> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-01-26NFSD: fix setting FMODE_NOCMTIME in nfs4_open_delegationOlga Kornievskaia1-1/+2
fstests generic/215 and generic/407 were failing because the server wasn't updating mtime properly. When deleg attribute support is not compiled in and thus no attribute delegation was given, the server was skipping updating mtime and ctime because FMODE_NOCMTIME was uncoditionally set for the write delegation. Fixes: e5e9b24ab8fa ("nfsd: freeze c/mtime updates with outstanding WRITE_ATTRS delegation") Cc: stable@vger.kernel.org Signed-off-by: Olga Kornievskaia <okorniev@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-01-26nfsd: fix nfs4_file refcount leak in nfsd_get_dir_deleg()Jeff Layton1-1/+4
Claude pointed out that there is a nfs4_file refcount leak in nfsd_get_dir_deleg(). Ensure that the reference to "fp" is released before returning. Fixes: 8b99f6a8c116 ("nfsd: wire up GET_DIR_DELEGATION handling") Cc: stable@vger.kernel.org Cc: Chris Mason <clm@meta.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-01-26nfsd: use workqueue enable/disable APIs for v4_end_grace syncNeilBrown2-14/+9
"nfsd: provide locking for v4_end_grace" introduced a client_tracking_active flag protected by nn->client_lock to prevent the laundromat from being scheduled before client tracking initialization or after shutdown begins. That commit is suitable for backporting to LTS kernels that predate commit 86898fa6b8cd ("workqueue: Implement disable/enable for (delayed) work items"). However, the workqueue subsystem in recent kernels provides enable_delayed_work() and disable_delayed_work_sync() for this purpose. Using this mechanism enable us to remove the client_tracking_active flag and associated spinlock operations while preserving the same synchronization guarantees, which is a cleaner long-term approach. Signed-off-by: NeilBrown <neil@brown.name> Tested-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-01-26NFS: NFSERR_INVAL is not defined by NFSv2Chuck Lever2-2/+2
A documenting comment in include/uapi/linux/nfs.h claims incorrectly that NFSv2 defines NFSERR_INVAL. There is no such definition in either RFC 1094 or https://pubs.opengroup.org/onlinepubs/9629799/chap7.htm NFS3ERR_INVAL is introduced in RFC 1813. NFSD returns NFSERR_INVAL for PROC_GETACL, which has no specification (yet). However, nfsd_map_status() maps nfserr_symlink and nfserr_wrong_type to nfserr_inval, which does not align with RFC 1094. This logic was introduced only recently by commit 438f81e0e92a ("nfsd: move error choice for incorrect object types to version-specific code."). Given that we have no INVAL or SERVERFAULT status in NFSv2, probably the only choice is NFSERR_IO. Fixes: 438f81e0e92a ("nfsd: move error choice for incorrect object types to version-specific code.") Reviewed-by: NeilBrown <neil@brown.name> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-01-26nfsd: prefix notification in nfsd4_finalize_deleg_timestamps() with "nfsd: "Jeff Layton1-1/+1
Make it distinct that this message comes from nfsd. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-01-26locks: ensure vfs_test_lock() never returns FILE_LOCK_DEFERREDNeilBrown2-7/+14
FILE_LOCK_DEFERRED can be returned when creating or removing a lock, but not when testing for a lock. This support was explicitly removed in Commit 09802fd2a8ca ("lockd: rip out deferred lock handling from testlock codepath") However the test in nlmsvc_testlock() suggests that it *can* be returned, only nlm cannot handle it. To aid clarity, remove the test and instead put a similar test and warning in vfs_test_lock(). If the impossible happens, convert FILE_LOCK_DEFERRED to -EIO. Signed-off-by: NeilBrown <neil@brown.name> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-01-26NFSD: Add instructions on how to deal with xdrgen filesChuck Lever1-1/+9
xdrgen requires a number of Python packages on the build system. We don't want to add these to the kernel build dependency list, which is long enough already. The generated files are generated manually using $ cd fs/nfsd && make xdrgen whenever the .x files are modified, then they are checked into the kernel repo so others do not need to rebuild them. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-01-26NFSD: Clean up nfsd4_check_open_attributes()Chuck Lever1-19/+21
The @rqstp parameter was introduced in commit 3c8e03166ae2 ("NFSv4: do exact check about attribute specified") but has never been used. Reduce indentation in callers to improve readability. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-01-26gfs2: Fix slab-use-after-free in qd_putAndreas Gruenbacher1-0/+1
Commit a475c5dd16e5 ("gfs2: Free quota data objects synchronously") started freeing quota data objects during filesystem shutdown instead of putting them back onto the LRU list, but it failed to remove these objects from the LRU list, causing LRU list corruption. This caused use-after-free when the shrinker (gfs2_qd_shrink_scan) tried to access already-freed objects on the LRU list. Fix this by removing qd objects from the LRU list before freeing them in qd_put(). Initial fix from Deepanshu Kartikey <kartikey406@gmail.com>. Fixes: a475c5dd16e5 ("gfs2: Free quota data objects synchronously") Reported-by: syzbot+046b605f01802054bff0@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=046b605f01802054bff0 Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2026-01-26gfs2: Introduce glock_{type,number,sbd} helpersAndreas Gruenbacher12-112/+124
Introduce glock_type(), glock_number(), and glock_sbd() helpers for accessing a glock's type, number, and super block pointer more easily. Created with Coccinelle using the following semantic patch: @@ struct gfs2_glock *gl; @@ - gl->gl_name.ln_type + glock_type(gl) @@ struct gfs2_glock *gl; @@ - gl->gl_name.ln_number + glock_number(gl) @@ struct gfs2_glock *gl; @@ - gl->gl_name.ln_sbd + glock_sbd(gl) glock_sbd() is a macro because it is used with const as well as non-const struct gfs2_glock * arguments. Instances in macro definitions, particularly in tracepoint definitions, replaced by hand. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2026-01-26gfs2: gfs2_glock_hold cleanupAndreas Gruenbacher1-2/+2
Use lockref_get_not_dead() instead of an unguarded __lockref_is_dead() check. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2026-01-26gfs: Use fixed GL_GLOCK_MIN_HOLD timeAndreas Gruenbacher1-1/+1
GL_GLOCK_MIN_HOLD represents the minimum time (in jiffies) that a glock should be held before being eligible for release. It is currently defined as 10, meaning that the duration depends on the timer interrupt frequency (CONFIG_HZ). Change that time to a constant 10ms independent of CONFIG_HZ. On CONFIG_HZ=1000 systems, the value remains the same. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2026-01-26gfs2: Fix gfs2_log_get_bio argument typeAndreas Gruenbacher1-3/+3
Fix the type of gfs2_log_get_bio()'s op argument: callers pass in a blk_opf_t value and the function passes that value on as a blk_opf_t value, so the current argument type makes no sense. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2026-01-26gfs2: gfs2_chain_bio start sector fixAndreas Gruenbacher1-3/+3
Pass the start sector into gfs2_chain_bio(): the new bio isn't necessarily contiguous with the previous one. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2026-01-26gfs2: Initialize bio->bi_opf earlyAndreas Gruenbacher3-22/+26
Pass the right blk_opf_t value to bio_alloc() so that ->bi_ops is initialized correctly and doesn't have to be changed later. Adjust the call chain to pass that value through to where it is needed (and only there). Add a separate blk_opf_t argument to gfs2_chain_bio() instead of copying the value from the previous bio. Fixes: 8a157e0a0aa5 ("gfs2: Fix use of bio_chain") Reported-by: syzbot+f6539d4ce3f775aee0cc@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=f6539d4ce3f775aee0cc Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2026-01-26gfs2: Rename gfs2_log_submit_{bio -> write}Andreas Gruenbacher3-6/+6
Rename gfs2_log_submit_bio() to gfs2_log_submit_write(): this function isn't used for submitting log reads. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2026-01-26gfs2: Do not cancel internal demote requestsAndreas Gruenbacher3-15/+35
Trying to change the state of a glock may result in a "conversion deadlock" error, indicating that the requested state transition would cause a deadlock. In this case, we unlock and retry the state change. It makes no sense to try canceling those unlock requests, but the current code is not aware of that. In addition, if a locking request is canceled shortly after it is made, the cancelation request can currently overtake the locking request. This may result in deadlocks. Fix both of these bugs by repurposing the GLF_PENDING_REPLY flag into a GLF_MAY_CANCEL flag which is set only when the current locking request can be canceled. When trying to cancel a locking request in gfs2_glock_dq(), wait for this flag to be set. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2026-01-26gfs2: run_queue cleanupAndreas Gruenbacher1-13/+7
In run_queue(), instead of always setting the GLF_LOCK flag, only set it when the flag is actually needed. This avoids having to undo the flag setting later. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2026-01-26gfs2: Retries missing in gfs2_{rename,exchange}Andreas Gruenbacher3-14/+43
Fix a bug in gfs2's asynchronous glock handling for rename and exchange operations. The original async implementation from commit ad26967b9afa ("gfs2: Use async glocks for rename") mentioned that retries were needed but never implemented them, causing operations to fail with -ESTALE instead of retrying on timeout. Also makes the waiting interruptible. In addition, the timeouts used were too high for situations in which timing out is a rare but expected scenario. Switch to shorter timeouts with randomization and exponentional backoff. Fixes: ad26967b9afa ("gfs2: Use async glocks for rename") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2026-01-26gfs2: glock cancelation flag fixAndreas Gruenbacher1-0/+2
When an asynchronous glock holder is dequeued that hasn't been granted yet (HIF_HOLDER not set) and no dlm operation is in progress on behalf of that holder (GLF_LOCK not set), the dequeuing takes place in __gfs2_glock_dq(). There, we are not clearing the HIF_WAIT flag and waking up waiters. Fix that. This bug prevents the same holder from being enqueued later (gfs2_glock_nq()) without first reinitializing it (gfs2_holder_reinit()). The code doesn't currently use this pattern, but this will change in the next commit. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2026-01-24xfs: check for deleted cursors when revalidating two btreesDarrick J. Wong2-3/+32
The free space and inode btree repair functions will rebuild both btrees at the same time, after which it needs to evaluate both btrees to confirm that the corruptions are gone. However, Jiaming Zhang ran syzbot and produced a crash in the second xchk_allocbt call. His root-cause analysis is as follows (with minor corrections): In xrep_revalidate_allocbt(), xchk_allocbt() is called twice (first for BNOBT, second for CNTBT). The cause of this issue is that the first call nullified the cursor required by the second call. Let's first enter xrep_revalidate_allocbt() via following call chain: xfs_file_ioctl() -> xfs_ioc_scrubv_metadata() -> xfs_scrub_metadata() -> `sc->ops->repair_eval(sc)` -> xrep_revalidate_allocbt() xchk_allocbt() is called twice in this function. In the first call: /* Note that sc->sm->sm_type is XFS_SCRUB_TYPE_BNOPT now */ xchk_allocbt() -> xchk_btree() -> `bs->scrub_rec(bs, recp)` -> xchk_allocbt_rec() -> xchk_allocbt_xref() -> xchk_allocbt_xref_other() since sm_type is XFS_SCRUB_TYPE_BNOBT, pur is set to &sc->sa.cnt_cur. Kernel called xfs_alloc_get_rec() and returned -EFSCORRUPTED. Call chain: xfs_alloc_get_rec() -> xfs_btree_get_rec() -> xfs_btree_check_block() -> (XFS_IS_CORRUPT || XFS_TEST_ERROR), the former is false and the latter is true, return -EFSCORRUPTED. This should be caused by ioctl$XFS_IOC_ERROR_INJECTION I guess. Back to xchk_allocbt_xref_other(), after receiving -EFSCORRUPTED from xfs_alloc_get_rec(), kernel called xchk_should_check_xref(). In this function, *curpp (points to sc->sa.cnt_cur) is nullified. Back to xrep_revalidate_allocbt(), since sc->sa.cnt_cur has been nullified, it then triggered null-ptr-deref via xchk_allocbt() (second call) -> xchk_btree(). So. The bnobt revalidation failed on a cross-reference attempt, so we deleted the cntbt cursor, and then crashed when we tried to revalidate the cntbt. Therefore, check for a null cntbt cursor before that revalidation, and mark the repair incomplete. Also we can ignore the second tree entirely if the first tree was rebuilt but is already corrupt. Apply the same fix to xrep_revalidate_iallocbt because it has the same problem. Cc: r772577952@gmail.com Link: https://lore.kernel.org/linux-xfs/CANypQFYU5rRPkTy=iG5m1Lp4RWasSgrHXAh3p8YJojxV0X15dQ@mail.gmail.com/T/#m520c7835fad637eccf843c7936c200589427cc7e Cc: <stable@vger.kernel.org> # v6.8 Fixes: dbfbf3bdf639a2 ("xfs: repair inode btrees") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Jiaming Zhang <r772577952@gmail.com>
2026-01-24xfs: fix UAF in xchk_btree_check_block_ownerDarrick J. Wong1-2/+5
We cannot dereference bs->cur when trying to determine if bs->cur aliases bs->sc->sa.{bno,rmap}_cur after the latter has been freed. Fix this by sampling before type before any freeing could happen. The correct temporal ordering was broken when we removed xfs_btnum_t. Cc: r772577952@gmail.com Cc: <stable@vger.kernel.org> # v6.9 Fixes: ec793e690f801d ("xfs: remove xfs_btnum_t") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Jiaming Zhang <r772577952@gmail.com>
2026-01-24xfs: check return value of xchk_scrub_create_subordDarrick J. Wong3-1/+7
Fix this function to return NULL instead of a mangled ENOMEM, then fix the callers to actually check for a null pointer and return ENOMEM. Most of the corrections here are for code merged between 6.2 and 6.10. Cc: r772577952@gmail.com Cc: <stable@vger.kernel.org> # v6.12 Fixes: 1a5f6e08d4e379 ("xfs: create subordinate scrub contexts for xchk_metadata_inode_subtype") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Jiaming Zhang <r772577952@gmail.com>
2026-01-24xfs: only call xf{array,blob}_destroy if we have a valid pointerDarrick J. Wong5-9/+24
Only call the xfarray and xfblob destructor if we have a valid pointer, and be sure to null out that pointer afterwards. Note that this patch fixes a large number of commits, most of which were merged between 6.9 and 6.10. Cc: r772577952@gmail.com Cc: <stable@vger.kernel.org> # v6.12 Fixes: ab97f4b1c03075 ("xfs: repair AGI unlinked inode bucket lists") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Jiaming Zhang <r772577952@gmail.com>
2026-01-23et4: allow zeroout when doing written to unwritten splitOjaswin Mujoo2-9/+122
Currently, when we are doing an extent split and convert operation of written to unwritten extent (example, as done by ZERO_RANGE), we don't allow the zeroout fallback in case the extent tree manipulation fails. This is mostly because zeroout might take unsually long and the fact that this code path is more tolerant to failures than endio. Since we have zeroout machinery in place, we might as well use it hence lift this restriction. To mitigate zeroout taking too long respect the max zeroout limit here so that the operation finishes relatively fast. Also, add kunit tests for this case. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/1c3349020b8e098a63f293b84bc8a9b56011cef4.1769149131.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-23ext4: refactor split and convert extentsOjaswin Mujoo1-165/+112
ext4_split_convert_extents() has been historically prone to subtle bugs and inconsistent behavior due to the way all the various flags interact with the extent split and conversion process. For example, callers like ext4_convert_unwritten_extents_endio() and convert_initialized_extents() needed to open code extent conversion despite passing CONVERT or CONVERT_UNWRITTEN flags because ext4_split_convert_extents() wasn't performing the conversion. Hence, refactor ext4_split_convert_extents() to clearly enforce the semantics of each flag. The major changes here are: * Clearly separate the split and convert process: * ext4_split_extent() and ext4_split_extent_at() are now only responsible to perform the split. * ext4_split_convert_extents() is now responsible to perform extent conversion after calling ext4_split_extent() for splitting. * This helps get rid of all the MARK_UNWRIT* flags. * Clearly enforce the semantics of flags passed to ext4_split_convert_extents(): * EXT4_GET_BLOCKS_CONVERT: Will convert the split extent to written * EXT4_GET_BLOCKS_CONVERT_UNWRITTEN: Will convert the split extent to unwritten * Modify all callers to enforce the above semantics. * Use ext4_split_convert_extents() instead of ext4_split_extents() in ext4_ext_convert_to_initialized() for uniformity. * Now that ext4_split_convert_extents() is handling caching to es, we dont need to do it in ext4_split_extent_zeroout(). * Cleanup all callers open coding the conversion logic. Further, modify kuniy tests to pass flags based on the new semantics. >From an end user point of view, we should not see any changes in behavior of ext4. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/2084a383d69ceefbaa293b8fcf725365eca0a349.1769149131.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-23ext4: refactor zeroout path and handle all casesOjaswin Mujoo1-98/+185
Currently, zeroout is used as a fallback in case we fail to split/convert extents in the "traditional" modify-the-extent-tree way. This is essential to mitigate failures in critical paths like extent splitting during endio. However, the logic is very messy and not easy to follow. Further, the fragile use of various flags has made it prone to errors. Refactor zeroout out logic by moving it up to ext4_split_extents(). Further, zeroout correctly based on the type of conversion we want, ie: - unwritten to written: Zeroout everything around the mapped range. - written to unwritten: Zeroout only the mapped range. Also, ext4_ext_convert_to_initialized() now passes EXT4_GET_BLOCKS_CONVERT to make the intention clear. Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/e1b51dedeca7c0b1f702141d91edfe4230560e7b.1769149131.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-23ext4: propagate flags to ext4_convert_unwritten_extents_endio()Ojaswin Mujoo1-6/+3
Currently, callers like ext4_convert_unwritten_extents() pass EXT4_EX_NOCACHE flag to avoid caching extents however this is not respected by ext4_convert_unwritten_extents_endio(). Hence, modify it to accept flags from the caller and to pass the flags on to other extent manipulation functions it calls. This makes sure the NOCACHE flag is respected throughout the code path. Also, since the caller already passes METADATA_NOFAIL and CONVERT flags we don't need to explicitly pass it anymore. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://patch.msgid.link/7c2139e0ad32c49c19b194f72219e15d613de284.1769149131.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>