aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2026-01-23erofs: make z_erofs_crypto[] staticFerry Meng1-1/+1
Reduce the scope of 'z_erofs_crypto[]' that is not used outside of 'decompressor_crypto.c'. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202512102025.4mWeBSsf-lkp@intel.com/ Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-01-23erofs: Use %pe format specifier for error pointersFerry Meng1-2/+2
%pe will print a symbolic error name (e.g,. -ENOMEM), opposed to the raw errno (e.g,. -12) produced by PTR_ERR(). Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2026-01-22NFS/localio: switch nfs_local_do_read and nfs_local_do_write to return voidMike Snitzer1-19/+13
Both nfs_local_do_read and nfs_local_do_write only return 0 at the end, so switch them to returning void. Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-22NFS/localio: remove -EAGAIN handling in nfs_local_doio()Mike Snitzer1-2/+0
Handling -EAGAIN in nfs_local_doio() was introduced with commit 0978e5b85fc08 (nfs_do_local_{read,write} were made to have negative checks for correspoding iter method) but commit e43e9a3a3d66 since eliminated the possibility for this -EAGAIN early return. So remove nfs_local_doio()'s -EAGAIN handling that calls nfs_localio_disable_client() -- while it should never happen from nfs_do_local_{read,write} this particular -EAGAIN handling is now "dead" and so it has become a liability. Fixes: e43e9a3a3d66 ("nfs/localio: refactor iocb initialization") Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-22NFS/localio: use GFP_NOIO and non-memreclaim workqueue in nfs_local_commitMike Snitzer1-3/+8
nfslocaliod_workqueue is a non-memreclaim workqueue (it isn't initialized with WQ_MEM_RECLAIM), see commit b9f5dd57f4a5 ("nfs/localio: use dedicated workqueues for filesystem read and write"). Use nfslocaliod_workqueue for LOCALIO's SYNC work. Also, set PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO in nfs_local_fsync_work. Fixes: b9f5dd57f4a5 ("nfs/localio: use dedicated workqueues for filesystem read and write") Signed-off-by: Mike Snitzer <snitzer@hammerspace.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-22NFS/localio: prevent direct reclaim recursion into NFS via nfs_writepagesMike Snitzer1-0/+15
LOCALIO is an NFS loopback mount optimization that avoids using the network for READ, WRITE and COMMIT if the NFS client and server are determined to be on the same system. But because LOCALIO is still fundamentally "just NFS loopback mount" it is susceptible to recursion deadlock via direct reclaim, e.g.: NFS LOCALIO down to XFS and then back into NFS via nfs_writepages. Fix LOCALIO's potential for direct reclaim deadlock by ensuring that all its page cache allocations are done from GFP_NOFS context. Thanks to Ben Coddington for pointing out commit ad22c7a043c2 ("xfs: prevent stack overflows from page cache allocation"). Reported-by: John Cagle <john.cagle@hammerspace.com> Tested-by: Allen Lu <allen.lu@hammerspace.com> Suggested-by: Benjamin Coddington <bcodding@hammerspace.com> Fixes: 70ba381e1a43 ("nfs: add LOCALIO support") Signed-off-by: Mike Snitzer <snitzer@hammerspace.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-22NFS/localio: Cleanup the nfs_local_pgio_done() parametersTrond Myklebust1-9/+5
Remove the redundant 'force' parameter. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-22NFS/localio: Handle short writes by retryingTrond Myklebust1-17/+47
The current code for handling short writes in localio just truncates the I/O and then sets an error. While that is close to how the ordinary NFS code behaves, it does mean there is a chance the data that got written is lost because it isn't persisted. To fix this, change localio so that the upper layers can direct the behaviour to persist any unstable data by rewriting it, and then continuing writing until an ENOSPC is hit. Fixes: 70ba381e1a43 ("nfs: add LOCALIO support") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-21btrfs: zlib: fix the folio leak on S390 hardware accelerationQu Wenruo1-0/+1
[BUG] After commit aa60fe12b4f4 ("btrfs: zlib: refactor S390x HW acceleration buffer preparation"), we no longer release the folio of the page cache of folio returned by btrfs_compress_filemap_get_folio() for S390 hardware acceleration path. [CAUSE] Before that commit, we call kumap_local() and folio_put() after handling each folio. Although the timing is not ideal (it release previous folio at the beginning of the loop, and rely on some extra cleanup out of the loop), it at least handles the folio release correctly. Meanwhile the refactored code is easier to read, it lacks the call to release the filemap folio. [FIX] Add the missing folio_put() for copy_data_into_buffer(). CC: linux-s390@vger.kernel.org # 6.18+ Fixes: aa60fe12b4f4 ("btrfs: zlib: refactor S390x HW acceleration buffer preparation") Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-01-21btrfs: do not strictly require dirty metadata threshold for metadata writepagesQu Wenruo3-26/+2
[BUG] There is an internal report that over 1000 processes are waiting at the io_schedule_timeout() of balance_dirty_pages(), causing a system hang and trigger a kernel coredump. The kernel is v6.4 kernel based, but the root problem still applies to any upstream kernel before v6.18. [CAUSE] From Jan Kara for his wisdom on the dirty page balance behavior first. This cgroup dirty limit was what was actually playing the role here because the cgroup had only a small amount of memory and so the dirty limit for it was something like 16MB. Dirty throttling is responsible for enforcing that nobody can dirty (significantly) more dirty memory than there's dirty limit. Thus when a task is dirtying pages it periodically enters into balance_dirty_pages() and we let it sleep there to slow down the dirtying. When the system is over dirty limit already (either globally or within a cgroup of the running task), we will not let the task exit from balance_dirty_pages() until the number of dirty pages drops below the limit. So in this particular case, as I already mentioned, there was a cgroup with relatively small amount of memory and as a result with dirty limit set at 16MB. A task from that cgroup has dirtied about 28MB worth of pages in btrfs btree inode and these were practically the only dirty pages in that cgroup. So that means the only way to reduce the dirty pages of that cgroup is to writeback the dirty pages of btrfs btree inode, and only after that those processes can exit balance_dirty_pages(). Now back to the btrfs part, btree_writepages() is responsible for writing back dirty btree inode pages. The problem here is, there is a btrfs internal threshold that if the btree inode's dirty bytes are below the 32M threshold, it will not do any writeback. This behavior is to batch as much metadata as possible so we won't write back those tree blocks and then later re-COW them again for another modification. This internal 32MiB is higher than the existing dirty page size (28MiB), meaning no writeback will happen, causing a deadlock between btrfs and cgroup: - Btrfs doesn't want to write back btree inode until more dirty pages - Cgroup/MM doesn't want more dirty pages for btrfs btree inode Thus any process touching that btree inode is put into sleep until the number of dirty pages is reduced. Thanks Jan Kara a lot for the analysis of the root cause. [ENHANCEMENT] Since kernel commit b55102826d7d ("btrfs: set AS_KERNEL_FILE on the btree_inode"), btrfs btree inode pages will only be charged to the root cgroup which should have a much larger limit than btrfs' 32MiB threshold. So it should not affect newer kernels. But for all current LTS kernels, they are all affected by this problem, and backporting the whole AS_KERNEL_FILE may not be a good idea. Even for newer kernels I still think it's a good idea to get rid of the internal threshold at btree_writepages(), since for most cases cgroup/MM has a better view of full system memory usage than btrfs' fixed threshold. For internal callers using btrfs_btree_balance_dirty() since that function is already doing internal threshold check, we don't need to bother them. But for external callers of btree_writepages(), just respect their requests and write back whatever they want, ignoring the internal btrfs threshold to avoid such deadlock on btree inode dirty page balancing. CC: stable@vger.kernel.org CC: Jan Kara <jack@suse.cz> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-01-21Merge tag 'for-6.19-rc6-tag' of ↵Linus Torvalds5-2/+73
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - protect reading super block vs setting block size externally (found by syzbot) - make sure no transaction is started in read-only mode even with some rescue mount option combinations - fix checksum calculation of backup super blocks when block-group-tree is enabled - more extensive mount-time checks of device items that could be left after device replace and attempting degraded mount - fix build warning with -Wmaybe-uninitialized on loongarch64-gcc 12 * tag 'for-6.19-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: add extra device item checks at mount btrfs: fix missing fields in superblock backup with BLOCK_GROUP_TREE btrfs: reject new transactions if the fs is fully read-only btrfs: sync read disk super and set block size btrfs: fix Wmaybe-uninitialized warning in replay_one_buffer()
2026-01-21xfs: fix incorrect context handling in xfs_trans_rollWenwu Hou2-11/+6
The memalloc_nofs_save() and memalloc_nofs_restore() calls are incorrectly paired in xfs_trans_roll. Call path: xfs_trans_alloc() __xfs_trans_alloc() // tp->t_pflags = memalloc_nofs_save(); xfs_trans_set_context() ... xfs_defer_trans_roll() xfs_trans_roll() xfs_trans_dup() // old_tp->t_pflags = 0; xfs_trans_switch_context() __xfs_trans_commit() xfs_trans_free() // memalloc_nofs_restore(tp->t_pflags); xfs_trans_clear_context() The code passes 0 to memalloc_nofs_restore() when committing the original transaction, but memalloc_nofs_restore() should always receive the flags returned from the paired memalloc_nofs_save() call. Before commit 3f6d5e6a468d ("mm: introduce memalloc_flags_{save,restore}"), calling memalloc_nofs_restore(0) would unset the PF_MEMALLOC_NOFS flag, which could cause memory allocation deadlocks[1]. Fortunately, after that commit, memalloc_nofs_restore(0) does nothing, so this issue is currently harmless. Fixes: 756b1c343333 ("xfs: use current->journal_info for detecting transaction recursion") Link: https://lore.kernel.org/linux-xfs/20251104131857.1587584-1-leo.lilong@huawei.com [1] Signed-off-by: Wenwu Hou <hwenwur@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21xfs: always allocate the free zone with the lowest indexHans Holmberg2-31/+17
Zones in the beginning of the address space are typically mapped to higer bandwidth tracks on HDDs than those at the end of the address space. So, in stead of allocating zones "round robin" across the whole address space, always allocate the zone with the lowest index. This increases average write bandwidth for overwrite workloads when less than the full capacity is being used. At ~50% utilization this improves bandwidth for a random file overwrite benchmark with 128MiB files and 256MiB zone capacity by 30%. Running the same benchmark with small 2-8 MiB files at 67% capacity shows no significant difference in performance. Due to heavy fragmentation the whole zone range is in use, greatly limiting the number of free zones with high bw. Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21xfs: promote metadata directories and large block supportDarrick J. Wong3-14/+0
Large block support was merged upstream in 6.12 (Dec 2024) and metadata directories was merged in 6.13 (Jan 2025). We've not received any serious complaints about the ondisk formats of these two features in the past year, so let's remove the experimental warnings. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21xfs: use blkdev_get_zone_info to simplify zone reportingChristoph Hellwig1-64/+50
Unwind the callback based programming model by querying the cached zone information using blkdev_get_zone_info. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21xfs: check that used blocks are smaller than the write pointerChristoph Hellwig1-0/+7
Any used block must have been written, this reject used blocks > write pointer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21xfs: split and refactor zone validationChristoph Hellwig3-114/+68
Currently xfs_zone_validate mixes validating the software zone state in the XFS realtime group with validating the hardware state reported in struct blk_zone and deriving the write pointer from that. Move all code that works on the realtime group to xfs_init_zone, and only keep the hardware state validation in xfs_zone_validate. This makes the code more clear, and allows for better reuse in userspace. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21xfs: pass the write pointer to xfs_init_zoneChristoph Hellwig1-29/+37
Move the two methods to query the write pointer out of xfs_init_zone into the callers, so that xfs_init_zone doesn't have to bother with the blk_zone structure and instead operates purely at the XFS realtime group level. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21xfs: add a xfs_rtgroup_raw_size helperChristoph Hellwig1-0/+15
Add a helper to figure the on-disk size of a group, accounting for the XFS_SB_FEAT_INCOMPAT_ZONE_GAPS feature if needed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21xfs: add missing forward declaration in xfs_zones.hDamien Le Moal1-0/+1
Add the missing forward declaration for struct blk_zone in xfs_zones.h. This avoids headaches with the order of header file inclusion to avoid compilation errors. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21xfs: remove xfs_attr_leaf_hasnameChristoph Hellwig1-51/+24
The calling convention of xfs_attr_leaf_hasname() is problematic, because it returns a NULL buffer when xfs_attr3_leaf_read fails, a valid buffer when xfs_attr3_leaf_lookup_int returns -ENOATTR or -EEXIST, and a non-NULL buffer pointer for an already released buffer when xfs_attr3_leaf_lookup_int fails with other error values. Fix this by simply open coding xfs_attr_leaf_hasname in the callers, so that the buffer release code is done by each caller of xfs_attr3_leaf_read. Cc: stable@vger.kernel.org # v5.19+ Fixes: 07120f1abdff ("xfs: Add xfs_has_attr and subroutines") Reported-by: Mark Tinguely <mark.tinguely@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21xfs: mark data structures corrupt on EIO and ENODATADarrick J. Wong3-0/+8
I learned a few things this year: first, blk_status_to_errno can return ENODATA for critical media errors; and second, the scrub code doesn't mark data structures as corrupt on ENODATA or EIO. Currently, scrub failing to capture these errors isn't all that impactful -- the checking code will exit to userspace with EIO/ENODATA, and xfs_scrub will log a complaint and exit with nonzero status. Most people treat fsck tools failing as a sign that the fs is corrupt, but online fsck should mark the metadata bad and keep moving. Cc: stable@vger.kernel.org # v4.15 Fixes: 4700d22980d459 ("xfs: create helpers to record and deal with scrub problems") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21xfs: rework zone GC buffer managementChristoph Hellwig1-47/+59
The double buffering where just one scratch area is used at a time does not efficiently use the available memory. It was originally implemented when GC I/O could happen out of order, but that was removed before upstream submission to avoid fragmentation. Now that all GC I/Os are processed in order, just use a number of buffers as a simple ring buffer. For a synthetic benchmark that fills 256MiB HDD zones and punches out holes to free half the space this leads to a decrease of GC time by a little more than 25%. Thanks to Hans Holmberg <hans.holmberg@wdc.com> for testing and benchmarking. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21xfs: use bio_reuse in the zone GC codeChristoph Hellwig1-6/+1
Replace our somewhat fragile code to reuse the bio, which caused a regression in the past with the block layer bio_reuse helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21xfs: directly include xfs_platform.hChristoph Hellwig194-204/+193
The xfs.h header conflicts with the public xfs.h in xfsprogs, leading to a spurious difference in all shared libxfs files that have to include libxfs_priv.h in userspace. Directly include xfs_platform.h so that we can add a header of the same name to xfsprogs and remove this major annoyance for the shared code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21xfs: move the remaining content from xfs.h to xfs_platform.hChristoph Hellwig2-17/+16
Move the global defines from xfs.h to xfs_platform.h to prepare for removing xfs.h. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21xfs: include global headers first in xfs_platform.hChristoph Hellwig1-14/+10
Ensure we have all kernel headers included by the time we do our own thing, just like the rest of the tree. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21xfs: rename xfs_linux.h to xfs_platform.hChristoph Hellwig2-4/+4
Rename xfs_linux.h to prepare for including including it directly from source files including those shared with xfsprogs. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21xfs: factor out a xlog_write_space_advance helperChristoph Hellwig1-12/+20
Add a new xlog_write_space_advance that returns the current place in the iclog that data is written to, and advances the various counters by the amount taken from xlog_write_iovec, and also use it xlog_write_partial, which open codes the counter adjustments, but misses the asserts. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21xfs: improve the iclog space assert in xlog_write_iovecChristoph Hellwig1-1/+1
We need enough space for the length we copy into the iclog, not just some space, so tighten up the check a bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21xfs: add a xlog_write_space_left helperChristoph Hellwig1-9/+13
Various places check how much space is left in the current iclog, add a helper for that. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21xfs: improve the calling convention for the xlog_write helpersChristoph Hellwig1-110/+77
The xlog_write chain passes around the same seven variables that are often passed by reference. Add a xlog_write_data structure to contain them to improve code generation and readability. This change increases the generated code size by about 140 bytes for my x86_64 build, which is hopefully worth the much easier to follow code: $ size fs/xfs/xfs_log.o* text data bss dec hex filename 29300 1730 176 31206 79e6 fs/xfs/xfs_log.o 29160 1730 176 31066 795a fs/xfs/xfs_log.o.old Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21xfs: regularize iclog space accounting in xlog_write_partialChristoph Hellwig1-3/+4
When xlog_write_partial splits a log region over multiple iclogs, it has to include the continuation ophder in the length requested for the new iclog. Currently is simply adds that to the request, which makes the accounting of the used space below look slightly different from the other users of iclog space that decrement it. To prepare for more code sharing, add the ophdr size to the len variable that tracks the number of bytes still are left in this xlog_write operation before the calling xlog_write_get_more_iclog_space, and then decrement it later when consuming that space. This changes the value of len when xlog_write_get_more_iclog_space returns an error, but as nothing looks at len in that case the difference doesn't matter. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21xfs: move struct xfs_log_vec to xfs_log_priv.hChristoph Hellwig2-12/+12
The log_vec is a private type for the log/CIL code and should not be exposed to anything else. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21xfs: move struct xfs_log_iovec to xfs_log_priv.hChristoph Hellwig2-7/+6
This structure is now only used by the core logging and CIL code. Also remove the unused typedef. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21xfs: improve the ->iop_format interfaceChristoph Hellwig14-194/+180
Export a higher level interface to format log items. The xlog_format_buf structure is hidden inside xfs_log_cil.c and only accessed using two helpers (and a wrapper build on top), hiding details of log iovecs from the log items. This also allows simply using an index into lv_iovecp instead of keeping a cursor vec. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21xfs: set lv_bytes in xlog_write_one_vecChristoph Hellwig1-2/+3
lv_bytes is mostly just use by the CIL code, but has crept into the low-level log writing code to decide on a full or partial iclog write. Ensure it is valid even for the special log writes that don't go through the CIL by initializing it in xlog_write_one_vec. Note that even without this fix, the checkpoint commits would never trigger a partial iclog write, as they have no payload beyond the opheader. The unmount record on the other hand could in theory trigger a an overflow of the iclog, but given that is has never been seen in the wild this has probably been masked by the small size of it and the fact that the unmount process does multiple log forces before writing the unmount record and we thus usually operate on an empty or almost empty iclog. Fixes: 110dc24ad2ae ("xfs: log vector rounding leaks log space") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-21xfs: add a xlog_write_one_vec helperChristoph Hellwig3-24/+24
Add a wrapper for xlog_write for the two callers who need to build a log_vec and add it to a single-entry chain instead of duplicating the code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2026-01-20fat: remove unused parameterLalit Shankar Chowdhury1-2/+2
Remove unused inode parameter from fat_cache_alloc(). Link: https://lkml.kernel.org/r/20251201214403.90604-2-lalitshankarch@gmail.com Signed-off-by: Lalit Shankar Chowdhury <lalitshankarch@gmail.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20kernel.h: drop hex.h and update all hex.h usersRandy Dunlap13-0/+13
Remove <linux/hex.h> from <linux/kernel.h> and update all users/callers of hex.h interfaces to directly #include <linux/hex.h> as part of the process of putting kernel.h on a diet. Removing hex.h from kernel.h means that 36K C source files don't have to pay the price of parsing hex.h for the roughly 120 C source files that need it. This change has been build-tested with allmodconfig on most ARCHes. Also, all users/callers of <linux/hex.h> in the entire source tree have been updated if needed (if not already #included). Link: https://lkml.kernel.org/r/20251215005206.2362276-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20ocfs2: annotate more flexible array members with __counted_by_le()Dmitry Antipov2-5/+3
Annotate flexible array members of 'struct ocfs2_local_alloc' and 'struct ocfs2_inline_data' with '__counted_by_le()' attribute to improve array bounds checking when CONFIG_UBSAN_BOUNDS is enabled, and prefer the convenient 'memset()' over an explicit loop to simplify 'ocfs2_clear_local_alloc()'. Link: https://lkml.kernel.org/r/20251021105518.119953-1-dmantipov@yandex.ru Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20ocfs2: fix oob in __ocfs2_find_pathEdward Adam Davis1-4/+5
syzbot constructed a corrupted image, which resulted in el->l_count from the b-tree extent block being 0. Since the length of the l_recs array depends on l_count, reading its member e_blkno triggered the out-of-bounds access reported by syzbot in [1]. The loop terminates when l_count is 0, similar to when next_free is 0. [1] UBSAN: array-index-out-of-bounds in fs/ocfs2/alloc.c:1838:11 index 0 is out of range for type 'struct ocfs2_extent_rec[] __counted_by(l_count)' (aka 'struct ocfs2_extent_rec[]') Call Trace: __ocfs2_find_path+0x606/0xa40 fs/ocfs2/alloc.c:1838 ocfs2_find_leaf+0xab/0x1c0 fs/ocfs2/alloc.c:1946 ocfs2_get_clusters_nocache+0x172/0xc60 fs/ocfs2/extent_map.c:418 ocfs2_get_clusters+0x505/0xa70 fs/ocfs2/extent_map.c:631 ocfs2_extent_map_get_blocks+0x202/0x6a0 fs/ocfs2/extent_map.c:678 ocfs2_read_virt_blocks+0x286/0x930 fs/ocfs2/extent_map.c:1001 ocfs2_read_dir_block fs/ocfs2/dir.c:521 [inline] ocfs2_find_entry_el fs/ocfs2/dir.c:728 [inline] ocfs2_find_entry+0x3e4/0x2090 fs/ocfs2/dir.c:1120 ocfs2_find_files_on_disk+0xdf/0x310 fs/ocfs2/dir.c:2023 ocfs2_lookup_ino_from_name+0x52/0x100 fs/ocfs2/dir.c:2045 _ocfs2_get_system_file_inode fs/ocfs2/sysfile.c:136 [inline] ocfs2_get_system_file_inode+0x326/0x770 fs/ocfs2/sysfile.c:112 ocfs2_init_global_system_inodes+0x319/0x660 fs/ocfs2/super.c:461 ocfs2_initialize_super fs/ocfs2/super.c:2196 [inline] ocfs2_fill_super+0x4432/0x65b0 fs/ocfs2/super.c:993 get_tree_bdev_flags+0x40e/0x4d0 fs/super.c:1691 vfs_get_tree+0x92/0x2a0 fs/super.c:1751 fc_mount fs/namespace.c:1199 [inline] Link: https://lkml.kernel.org/r/tencent_4D99464FA28D9225BE0DBA923F5DF6DD8C07@qq.com Signed-off-by: Edward Adam Davis <eadavis@qq.com> Reported-by: syzbot+151afab124dfbc5f15e6@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=151afab124dfbc5f15e6 Reviewed-by: Heming Zhao <heming.zhao@suse.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20ocfs2: add validate function for slot map blocksPrithvi Tambewagh1-2/+25
When the filesystem is being mounted, the kernel panics while the data regarding slot map allocation to the local node, is being written to the disk. This occurs because the value of slot map buffer head block number, which should have been greater than or equal to `OCFS2_SUPER_BLOCK_BLKNO` (evaluating to 2) is less than it, indicative of disk metadata corruption. This triggers BUG_ON(bh->b_blocknr < OCFS2_SUPER_BLOCK_BLKNO) in ocfs2_write_block(), causing the kernel to panic. This is fixed by introducing function ocfs2_validate_slot_map_block() to validate slot map blocks. It first checks if the buffer head passed to it is up to date and valid, else it panics the kernel at that point itself. Further, it contains an if condition block, which checks if `bh->b_blocknr` is lesser than `OCFS2_SUPER_BLOCK_BLKNO`; if yes, then ocfs2_error is called, which prints the error log, for debugging purposes, and the return value of ocfs2_error() is returned. If the if condition is false, value 0 is returned by ocfs2_validate_slot_map_block(). This function is used as validate function in calls to ocfs2_read_blocks() in ocfs2_refresh_slot_info() and ocfs2_map_slot_buffers(). Link: https://lkml.kernel.org/r/20251215184600.13147-1-activprithvi@gmail.com Signed-off-by: Prithvi Tambewagh <activprithvi@gmail.com> Reported-by: syzbot+c818e5c4559444f88aa0@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=c818e5c4559444f88aa0 Tested-by: <syzbot+c818e5c4559444f88aa0@syzkaller.appspotmail.com> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20ocfs2: adjust ocfs2_xa_remove_entry() to match UBSAN boundary checksDmitry Antipov1-2/+3
After introducing 2f26f58df041 ("ocfs2: annotate flexible array members with __counted_by_le()"), syzbot has reported the following issue: UBSAN: array-index-out-of-bounds in fs/ocfs2/xattr.c:1955:3 index 2 is out of range for type 'struct ocfs2_xattr_entry[] __counted_by(xh_count)' (aka 'struct ocfs2_xattr_entry[]') ... Call Trace: <TASK> dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120 ubsan_epilogue+0xa/0x40 lib/ubsan.c:233 __ubsan_handle_out_of_bounds+0xe9/0xf0 lib/ubsan.c:455 ocfs2_xa_remove_entry+0x36d/0x3e0 fs/ocfs2/xattr.c:1955 ... To address this issue, 'xh_entries[]' member removal should be performed before actually changing 'xh_count', thus making sure that all array accesses matches the boundary checks performed by UBSAN. Link: https://lkml.kernel.org/r/20251211155949.774485-1-dmantipov@yandex.ru Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Reported-by: syzbot+cf96bc82a588a27346a8@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=cf96bc82a588a27346a8 Reviewed-by: Heming Zhao <heming.zhao@suse.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20ocfs2: validate inline data i_size during inode readDeepanshu Kartikey1-6/+19
When reading an inode from disk, ocfs2_validate_inode_block() performs various sanity checks but does not validate the size of inline data. If the filesystem is corrupted, an inode's i_size can exceed the actual inline data capacity (id_count). This causes ocfs2_dir_foreach_blk_id() to iterate beyond the inline data buffer, triggering a use-after-free when accessing directory entries from freed memory. In the syzbot report: - i_size was 1099511627576 bytes (~1TB) - Actual inline data capacity (id_count) is typically <256 bytes - A garbage rec_len (54648) caused ctx->pos to jump out of bounds - This triggered a UAF in ocfs2_check_dir_entry() Fix by adding a validation check in ocfs2_validate_inode_block() to ensure inodes with inline data have i_size <= id_count. This catches the corruption early during inode read and prevents all downstream code from operating on invalid data. Link: https://lkml.kernel.org/r/20251212052132.16750-1-kartikey406@gmail.com Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Reported-by: syzbot+c897823f699449cc3eb4@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=c897823f699449cc3eb4 Tested-by: syzbot+c897823f699449cc3eb4@syzkaller.appspotmail.com Link: https://lore.kernel.org/all/20251211115231.3560028-1-kartikey406@gmail.com/T/ [v1] Link: https://lore.kernel.org/all/20251212040400.6377-1-kartikey406@gmail.com/T/ [v2] Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20ocfs2: validate i_refcount_loc when refcount flag is setDeepanshu Kartikey1-0/+7
Add validation in ocfs2_validate_inode_block() to check that if an inode has OCFS2_HAS_REFCOUNT_FL set, it must also have a valid i_refcount_loc. A corrupted filesystem image can have this inconsistent state, which later triggers a BUG_ON in ocfs2_remove_refcount_tree() when the inode is being wiped during unlink. Catch this corruption early during inode validation to fail gracefully instead of crashing the kernel. Link: https://lkml.kernel.org/r/20251212055826.20929-1-kartikey406@gmail.com Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Reported-by: syzbot+6d832e79d3efe1c46743@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=6d832e79d3efe1c46743 Tested-by: syzbot+6d832e79d3efe1c46743@syzkaller.appspotmail.com Link: https://lore.kernel.org/all/20251208084407.3021466-1-kartikey406@gmail.com/T/ [v1] Link: https://lore.kernel.org/all/20251212045646.9988-1-kartikey406@gmail.com/T/ [v2] Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20ocfs2: constify struct configfs_item_operations and configfs_group_operationsChristophe JAILLET2-6/+6
'struct configfs_item_operations' and 'configfs_group_operations' are not modified in this driver. Constifying these structures moves some data to a read-only section, so increases overall security, especially when the structure holds some function pointers. On a x86_64, with allmodconfig, as an example: Before: ====== text data bss dec hex filename 74011 19312 5280 98603 1812b fs/ocfs2/cluster/heartbeat.o After: ===== text data bss dec hex filename 74171 19152 5280 98603 1812b fs/ocfs2/cluster/heartbeat.o Link: https://lkml.kernel.org/r/7c7c00ba328e5e514d8debee698154039e9640dd.1765708880.git.christophe.jaillet@wanadoo.fr Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20ocfs2: detect released suballocator BG for fh_to_[dentry|parent]Heming Zhao2-11/+21
After ocfs2 gained the ability to reclaim suballocator free block group (BGs), a suballocator block group may be released. This change causes the xfstest case generic/426 to fail. generic/426 expects return value -ENOENT or -ESTALE, but the current code triggers -EROFS. Call stack before ocfs2 gained the ability to reclaim bg: ocfs2_fh_to_dentry //or ocfs2_fh_to_parent ocfs2_get_dentry + ocfs2_test_inode_bit | ocfs2_test_suballoc_bit | + ocfs2_read_group_descriptor //Since ocfs2 never releases the bg, | | //the bg block was always found. | + *res = ocfs2_test_bit //unlink was called, and the bit is zero | + if (!set) //because the above *res is 0 status = -ESTALE //the generic/426 expected return value Current call stack that triggers -EROFS: ocfs2_get_dentry ocfs2_test_inode_bit ocfs2_test_suballoc_bit ocfs2_read_group_descriptor + if reading a released bg, validation fails and triggers -EROFS How to fix: Since the read BG is already released, we must avoid triggering -EROFS. With this commit, we use ocfs2_read_hint_group_descriptor() to detect the released BG block. This approach quietly handles this type of error and returns -EINVAL, which triggers the caller's existing conversion path to -ESTALE. [dan.carpenter@linaro.org: fix uninitialized variable] Link: https://lkml.kernel.org/r/dc37519fd2470909f8c65e26c5131b8b6dde2a5c.1766043917.git.dan.carpenter@linaro.org Link: https://lkml.kernel.org/r/20251212074505.25962-3-heming.zhao@suse.com Signed-off-by: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Su Yue <glass.su@suse.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20ocfs2: give ocfs2 the ability to reclaim suballocator free bgHeming Zhao1-9/+299
Patch series "ocfs2: give ocfs2 the ability to reclaim suballocator free bg", v6. This patch (of 2): The current ocfs2 code can't reclaim suballocator block group space. In some cases, this causes ocfs2 to hold onto a lot of space. For example, when creating lots of small files, the space is held/managed by the '//inode_alloc'. After the user deletes all the small files, the space never returns to the '//global_bitmap'. This issue prevents ocfs2 from providing the needed space even when there is enough free space in a small ocfs2 volume. This patch gives ocfs2 the ability to reclaim suballocator free space when the block group is freed. For performance reasons, this patch keeps the first suballocator block group active. Link: https://lkml.kernel.org/r/20251212074505.25962-2-heming.zhao@suse.com Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Su Yue <glass.su@suse.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20mm/block/fs: remove laptop_modeJohannes Weiner3-13/+1