aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2026-01-20ocfs2: give ocfs2 the ability to reclaim suballocator free bgHeming Zhao1-9/+299
Patch series "ocfs2: give ocfs2 the ability to reclaim suballocator free bg", v6. This patch (of 2): The current ocfs2 code can't reclaim suballocator block group space. In some cases, this causes ocfs2 to hold onto a lot of space. For example, when creating lots of small files, the space is held/managed by the '//inode_alloc'. After the user deletes all the small files, the space never returns to the '//global_bitmap'. This issue prevents ocfs2 from providing the needed space even when there is enough free space in a small ocfs2 volume. This patch gives ocfs2 the ability to reclaim suballocator free space when the block group is freed. For performance reasons, this patch keeps the first suballocator block group active. Link: https://lkml.kernel.org/r/20251212074505.25962-2-heming.zhao@suse.com Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Su Yue <glass.su@suse.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20mm/block/fs: remove laptop_modeJohannes Weiner3-13/+1
Laptop mode was introduced to save battery, by delaying and consolidating writes and thereby maximize the time rotating hard drives wouldn't have to spin. Luckily, rotating hard drives, with their high spin-up times and power draw, are a thing of the past for battery-powered devices. Reclaim has also since changed to not write single filesystem pages anymore, and regular filesystem writeback is lumpy by design. The juice doesn't appear worth the squeeze anymore. The footprint of the feature is small, but nevertheless it's a complicating factor in mm, block, filesystems. Developers don't think about it, and it likely hasn't been tested with new reclaim and writeback changes in years. Let's sunset it. Keep the sysctl with a deprecation warning around for a few more cycles, but remove all functionality behind it. [akpm@linux-foundation.org: fix Documentation/admin-guide/laptops/index.rst] Link: https://lkml.kernel.org/r/20251216185201.GH905277@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20mm: introduce generic lazy_mmu helpersKevin Brodsky1-2/+2
The implementation of the lazy MMU mode is currently entirely arch-specific; core code directly calls arch helpers: arch_{enter,leave}_lazy_mmu_mode(). We are about to introduce support for nested lazy MMU sections. As things stand we'd have to duplicate that logic in every arch implementing lazy_mmu - adding to a fair amount of logic already duplicated across lazy_mmu implementations. This patch therefore introduces a new generic layer that calls the existing arch_* helpers. Two pair of calls are introduced: * lazy_mmu_mode_enable() ... lazy_mmu_mode_disable() This is the standard case where the mode is enabled for a given block of code by surrounding it with enable() and disable() calls. * lazy_mmu_mode_pause() ... lazy_mmu_mode_resume() This is for situations where the mode is temporarily disabled by first calling pause() and then resume() (e.g. to prevent any batching from occurring in a critical section). The documentation in <linux/pgtable.h> will be updated in a subsequent patch. No functional change should be introduced at this stage. The implementation of enable()/resume() and disable()/pause() is currently identical, but nesting support will change that. Most of the call sites have been updated using the following Coccinelle script: @@ @@ { ... - arch_enter_lazy_mmu_mode(); + lazy_mmu_mode_enable(); ... - arch_leave_lazy_mmu_mode(); + lazy_mmu_mode_disable(); ... } @@ @@ { ... - arch_leave_lazy_mmu_mode(); + lazy_mmu_mode_pause(); ... - arch_enter_lazy_mmu_mode(); + lazy_mmu_mode_resume(); ... } A couple of notes regarding x86: * Xen is currently the only case where explicit handling is required for lazy MMU when context-switching. This is purely an implementation detail and using the generic lazy_mmu_mode_* functions would cause trouble when nesting support is introduced, because the generic functions must be called from the current task. For that reason we still use arch_leave() and arch_enter() there. * x86 calls arch_flush_lazy_mmu_mode() unconditionally in a few places, but only defines it if PARAVIRT_XXL is selected, and we are removing the fallback in <linux/pgtable.h>. Add a new fallback definition to <asm/pgtable.h> to keep things building. Link: https://lkml.kernel.org/r/20251215150323.2218608-8-kevin.brodsky@arm.com Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: David Hildenbrand (Red Hat) <david@kernel.org> Cc: David S. Miller <davem@davemloft.net> Cc: David Woodhouse <dwmw2@infradead.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Juegren Gross <jgross@suse.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20xfs: add media verification ioctlDarrick J. Wong6-0/+590
Add a new privileged ioctl so that xfs_scrub can ask the kernel to verify the media of the devices backing an xfs filesystem, and have any resulting media errors reported to fsnotify and xfs_healer. To accomplish this, the kernel allocates a folio between the base page size and 1MB, and issues read IOs to a gradually incrementing range of one of the storage devices underlying an xfs filesystem. If any error occurs, that raw error is reported to the calling process. If the error happens to be one of the ones that the kernel considers indicative of data loss, then it will also be reported to xfs_healthmon and fsnotify. Driving the verification from the kernel enables xfs (and by extension xfs_scrub) to have precise control over the size and error handling of IOs that are issued to the underlying block device, and to emit notifications about problems to other relevant kernel subsystems immediately. Note that the caller is also allowed to reduce the size of the IO and to ask for a relaxation period after each IO. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-20xfs: check if an open file is on the health monitored fsDarrick J. Wong2-1/+45
Create a new ioctl for the healthmon file that checks that a given fd points to the same filesystem that the healthmon file is monitoring. This allows xfs_healer to check that when it reopens a mountpoint to perform repairs, the file that it gets matches the filesystem that generated the corruption report. (Note that xfs_healer doesn't maintain an open fd to a filesystem that it's monitoring so that it doesn't pin the mount.) Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-20xfs: allow toggling verbose logging on the health monitoring fileDarrick J. Wong1-0/+44
Make it so that we can reconfigure the health monitoring device by calling the XFS_IOC_HEALTH_MONITOR ioctl on it. As of right now we can only toggle the verbose flag, but this is less annoying than having to closing the monitor fd and reopen it. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-20xfs: convey file I/O errors to the health monitorDarrick J. Wong6-0/+198
Connect the fserror reporting to the health monitor so that xfs can send events about file I/O errors to the xfs_healer daemon. These events are entirely informational because xfs cannot regenerate user data, so hopefully the fsnotify I/O error event gets noticed by the relevant management systems. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-20xfs: convey externally discovered fsdax media errors to the health monitorDarrick J. Wong6-5/+148
Connect the fsdax media failure notification code to the health monitor so that xfs can send events about that to the xfs_healer daemon. Later on we'll add the ability for the xfs_scrub media scan (phase 6) to report the errors that it finds to the kernel so that those are also logged by xfs_healer. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-20xfs: convey filesystem shutdown events to the health monitorDarrick J. Wong5-1/+121
Connect the filesystem shutdown code to the health monitor so that xfs can send events about that to the xfs_healer daemon. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-20xfs: convey metadata health events to the health monitorDarrick J. Wong6-2/+511
Connect the filesystem metadata health event collection system to the health monitor so that xfs can send events to xfs_healer as it collects information. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-20xfs: convey filesystem unmount events to the health monitorDarrick J. Wong4-2/+43
In xfs_healthmon_unmount, send events to xfs_healer so that it knows that nothing further can be done for the filesystem. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-20xfs: create event queuing, formatting, and discovery infrastructureDarrick J. Wong5-6/+768
Create the basic infrastructure that we need to report health events to userspace. We need a compact form for recording critical information about an event and queueing them; a means to notice that we've lost some events; and a means to format the events into something that userspace can handle. Make the kernel export C structures via read(). In a previous iteration of this new subsystem, I wanted to explore data exchange formats that are more flexible and easier for humans to read than C structures. The thought being that when we want to rev (or worse, enlarge) the event format, it ought to be trivially easy to do that in a way that doesn't break old userspace. I looked at formats such as protobufs and capnproto. These look really nice in that extending the wire format is fairly easy, you can give it a data schema and it generates the serialization code for you, handles endianness problems, etc. The huge downside is that neither support C all that well. Too hard, and didn't want to port either of those huge sprawling libraries first to the kernel and then again to xfsprogs. Then I thought, how about JSON? Javascript objects are human readable, the kernel can emit json without much fuss (it's all just strings!) and there are plenty of interpreters for python/rust/c/etc. There's a proposed schema format for json, which means that xfs can publish a description of the events that kernel will emit. Userspace consumers (e.g. xfsprogs/xfs_healer) can embed the same schema document and use it to validate the incoming events from the kernel, which means it can discard events that it doesn't understand, or garbage being emitted due to bugs. However, json has a huge crutch -- javascript is well known for its vague definitions of what are numbers. This makes expressing a large number rather fraught, because the runtime is free to represent a number in nearly any way it wants. Stupider ones will truncate values to word size, others will roll out doubles for uint52_t (yes, fifty-two) with the resulting loss of precision. Not good when you're dealing with discrete units. It just so happens that python's json library is smart enough to see a sequence of digits and put them in a u64 (at least on x86_64/aarch64) but an actual javascript interpreter (pasting into Firefox) isn't necessarily so clever. It turns out that none of the proposed json schemas were ever ratified even in an open-consensus way, so json blobs are still just loosely structured blobs. The parsing in userspace was also noticeably slow and memory-consumptive. Hence only the C interface survives. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-20xfs: start creating infrastructure for health monitoringDarrick J. Wong8-0/+317
Start creating helper functions and infrastructure to pass filesystem health events to a health monitoring file. Since this is an administrative interface, we only support a single health monitor process per filesystem, so we don't need to use anything fancy such as notifier chains (== tons of indirect calls). Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2026-01-20Revert "f2fs: add timeout in f2fs_enable_checkpoint()"Jaegeuk Kim2-13/+4
This reverts commit 4bc347779698b5e67e1514bab105c2c083e55502. Let's apply a better approach to flush the only dirty pages committed by user to avoid the delay caused by unncessary incoming ones. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-20NFS: make nfs_mark_return_unreferenced_delegations less aggressiveChristoph Hellwig1-7/+17
Currently nfs_mark_return_unreferenced_delegations marks all open but not referenced delegations (i.e., those were found by a previous pass) as return on close, which means that we'll return them on close without a way out. Replace this with only iterating delegations that are on the LRU list, and avoid delegations that are in use by an open files to avoid this. Delegations that were never referenced while open still are be prime candidates for return from the LRU if the number of delegations is over the watermark, or otherwise will be returned by the next nfs_mark_return_unreferenced_delegations pass after they are closed. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-20NFS: return delegations from the end of a LRU when over the watermarkChristoph Hellwig2-3/+59
Directly returning delegations on close when over the watermark is rather suboptimal as these delegations are much more likely to be reused than those that have been unused for a long time. Switch to returning unused delegations from a new LRU list when we are above the threshold and there are reclaimable delegations instead. Pass over referenced delegations during the first pass to give delegations that aren't in active used by frequently used for stat() or similar another chance to not be instantly reclaimed. This scheme works the same as the referenced flags in the VFS inode and dentry caches. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-20NFS: add a separate delegation return listChristoph Hellwig4-84/+81
Searching for returnable delegations in the per-server delegations list can be very expensive. While commit e04bbf6b1bbe ("NFS: Avoid quadratic search when freeing delegations.") reduced the overhead a bit, the fact that all the non-returnable delegations have to be searched limits the amount of optimizations that can be done. Fix this by introducing a separate list that only contains delegations scheduled for return. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-20NFS: reformat nfs_mark_delegation_revokedChristoph Hellwig1-6/+7
Remove a level of indentation for the main code path. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-20NFS: use a local RCU critical section in nfs_start_delegation_returnChristoph Hellwig1-6/+5
Nested RCU critical sections are fine and very cheap. Have a local one in nfs_start_delegation_return so that the function is self-contained and to prepare for simplifying the callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-20NFS: use refcount_inc_not_zero nfs_start_delegation_returnChristoph Hellwig1-12/+10
Using the unconditional reference increment means we can take a reference to a delegation already in the RCU grace period, which could cause a use after free under very unlikely conditions. Switch to use refcount_inc_not_zero instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-20NFS: don't consume a delegation reference in nfs_end_delegation_returnChristoph Hellwig1-22/+24
All callers now hold references to the delegation as part of the lookup, removing the need for an extra reference for those that are actually returned which is then dropped in nfs_end_delegation_return. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-20NFS: take a delegation reference in nfs4_get_valid_delegationChristoph Hellwig4-46/+56
Currently most work on struct nfs_delegation happens directly under RCU protection. This is generally fine, despite that long RCU sections are not good for performance. But for operations later taking a reference to the delegation to perform blocking work, refcount_inc is used, which can be racy against dropping the last reference and thus lead to use after frees in extremely rare cases. Fix this by taking a reference in nfs4_get_valid_delegation using refcount_inc_not_zero so that the callers have a stabilized reference they can work with and can be moved outside the RCU critical section. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-20NFS: simplify the detached delegation check in update_open_stateidChristoph Hellwig1-2/+1
When nfs_detach_delegation_locked detaches a delegation from an inode, it clears both nfsi->delegation and delegation->inode. Use the later in update_open_stateid to check for a detached inode, as that avoids an extra local variable, and removes the need for a RCU derefernence as we already hold the lock in the delegation. This prepares for removing the surrounding RCU critical section. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-20NFS: move the deleg_cur check out of nfs_detach_delegation_lockedChristoph Hellwig1-8/+7
nfs_inode_set_delegation as the only direct caller of nfs_detach_delegation_locked already check this under cl_lock, so don't repeat it. Replace the lockdep coverage for the lock that was implicitly provided by the rcu_dereference_protected call that is removed with an explicit lockdep assert to keep the coverage. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-20NFS: return bool from nfs_detach_delegation{,_locked}Christoph Hellwig1-13/+14
nfs_detach_delegation always returns either the passed in delegation or NULL, simplify this to a bool return. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-20NFS: move delegation lookup into can_open_delegatedChristoph Hellwig1-32/+33
Keep the delegation handling in a single place, and just return the stateid in an optional argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-20NFS: rewrite nfs_delegations_present in terms of nr_active_delegationsChristoph Hellwig1-1/+1
Renewal only cares for active delegations and not revoked ones. Replace the list empty check with reading the active delegation counter to implement this. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-20NFS: remove nfs_free_delegationChristoph Hellwig1-11/+8
Open code nfs_free_delegation in the callers, because having a "free" function that wraps a revoke and put operation is a bit confusing, especially when the __free version does the actual freeing triggered by the last put. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-20NFS: open code nfs_delegation_need_returnChristoph Hellwig1-17/+7
There is only a single caller, and the function can be condensed into a single if statement, making it more clear what is being tested there. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-20NFS: remove NFS_DELEGATION_INODE_FREEINGChristoph Hellwig3-13/+2
This essentially reverts commit 6f9449be53f3 ("NFS: Fix a soft lockup in the delegation recovery code") because the code walking the per-server delegation list has been fixed to just skip inodes for which nfs_delegation_grab_inode fails, instead of having to restart the entire series in commit f92214e4c312 ("NFS: Avoid unnecessary rescanning of the per-server delegation list"). Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-20NFS: drop the _locked postfix from nfs_start_delegation_returnChristoph Hellwig1-6/+6
Now that nfs_start_delegation_return_locked is gone, and we have RCU locking asserts, drop the extra postfix. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-20NFS: assert rcu_read_lock is held in nfs_start_delegation_return_lockedChristoph Hellwig1-4/+7
And clean up the dereference of the delegation a bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-20NFS: remove nfs_start_delegation_returnChristoph Hellwig1-20/+12
There is only one caller, so fold it into that. With that, nfs_start_delegation_return Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-20NFS: remove nfs_inode_detach_delegationChristoph Hellwig1-22/+15
Fold it into the only caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-20NFS: remove the NULL inode check in nfs4_inode_return_delegation_on_closeChristoph Hellwig1-2/+0
The only caller dereferences a field in the inode just before calling nfs4_inode_return_delegation_on_close. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-20NFS: remove nfs_client_mark_return_all_delegationsChristoph Hellwig1-11/+7
Fold nfs_client_mark_return_all_delegations into nfs_expire_all_delegations, which is the only caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-20NFS: remove nfs_client_mark_return_unused_delegation_typesChristoph Hellwig1-14/+3
nfs_client_mark_return_unused_delegation_types is only called by nfs_expire_unused_delegation_types, so merge the two. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-20NFS: remove __nfs_client_for_each_serverChristoph Hellwig1-11/+3
__nfs_client_for_each_server is only called by nfs_client_for_each_server, so merge the two. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-20fs/dlm/dir: remove unuse variable count_matchAlex Shi1-3/+3
The variable was never used after introduced. Better to comment it if we want to keep the info. fs/dlm/dir.c:65:26: error: variable 'count_match' set but not used [-Werror,-Wunused-but-set-variable] 65 | unsigned int count = 0, count_match = 0, count_bad = 0, count_add = 0; | ^ 1 error generated. Signed-off-by: Alex Shi <alexs@kernel.org> Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2026-01-20dlm: Constify struct configfs_item_operations and configfs_group_operationsChristophe JAILLET1-8/+8
'struct configfs_item_operations' and 'configfs_group_operations' are not modified in this driver. Constifying these structures moves some data to a read-only section, so increases overall security, especially when the structure holds some function pointers. On a x86_64, with allmodconfig, as an example: Before: ====== text data bss dec hex filename 29436 12952 384 42772 a714 fs/dlm/config.o After: ===== text data bss dec hex filename 30076 12312 384 42772 a714 fs/dlm/config.o Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2026-01-20fs/dlm: use list_add_tail() instead of open-coding list insertionShaurya Rane1-5/+1
Replace the manual list pointer manipulation in add_ordered_member() with the standard list_add_tail() helper. The original code explicitly updated ->prev and ->next pointers to insert @newlist before @tmp, which is exactly what list_add_tail(newlist, tmp) provides. Using the list macro improves readability, removes a source of potential pointer bugs, and satisfies the existing FIXME requesting conversion to the list helpers. No functional change in the ordering logic for DLM members. Signed-off-by: Shaurya Rane <ssrane_b23@ee.vjti.ac.in> Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2026-01-20dlm: validate length in dlm_search_rsb_treeEzrak1e1-1/+2
The len parameter in dlm_dump_rsb_name() is not validated and comes from network messages. When it exceeds DLM_RESNAME_MAXLEN, it can cause out-of-bounds write in dlm_search_rsb_tree(). Add length validation to prevent potential buffer overflow. Signed-off-by: Ezrak1e <ezrakiez@gmail.com> Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2026-01-20dlm: fix recovery pending middle conversionAlexander Aring1-18/+1
During a workload involving conversions between lock modes PR and CW, lock recovery can create a "conversion deadlock" state between locks that have been recovered. When this occurs, kernel warning messages are logged, e.g. "dlm: WARN: pending deadlock 1e node 0 2 1bf21" "dlm: receive_rcom_lock_args 2e middle convert gr 3 rq 2 remote 2 1e" After this occurs, the deadlocked conversions both appear on the convert queue of the resource being locked, and the conversion requests do not complete. Outside of recovery, conversions that would produce a deadlock are resolved immediately, and return -EDEADLK. The locks are not placed on the convert queue in the deadlocked state. To fix this problem, an lkb under conversion between PR/CW is rebuilt during recovery on a new master's granted queue, with the currently granted mode, rather than being rebuilt on the new master's convert queue, with the currently granted mode and the newly requested mode. The in-progress convert is then resent to the new master after recovery, so the conversion deadlock will be processed outside of the recovery context and handled as described above. Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>
2026-01-20btrfs: add extra device item checks at mountQu Wenruo3-0/+48
[BUG] There is a bug report where after a dev-replace, the replace source device with devid 4 is properly erased (dump tree shows it's the old devid 4), but the target device is still using devid 0. When the user tries to mount the fs degraded, the mount failed with the following errors: BTRFS: device fsid 84a1ed4a-365c-45c3-a9ee-a7df525dc3c9 devid 5 transid 1394395 /dev/sda (8:0) scanned by btrfs (261) BTRFS: device fsid 84a1ed4a-365c-45c3-a9ee-a7df525dc3c9 devid 6 transid 1394395 /dev/sde (8:64) scanned by btrfs (261) BTRFS: device fsid 84a1ed4a-365c-45c3-a9ee-a7df525dc3c9 devid 0 transid 1394395 /dev/sdd (8:48) scanned by btrfs (261) BTRFS: device fsid 84a1ed4a-365c-45c3-a9ee-a7df525dc3c9 devid 3 transid 1394395 /dev/sdf (8:80) scanned by btrfs (261) BTRFS info (device sdd): first mount of filesystem 84a1ed4a-365c-45c3-a9ee-a7df525dc3c9 BTRFS info (device sdd): using crc32c (crc32c-intel) checksum algorithm BTRFS warning (device sdd): devid 4 uuid 01e2081c-9c2a-4071-b9f4-e1b27e571ff5 is missing BTRFS info (device sdd): bdev <missing disk> errs: wr 84994544, rd 15567, flush 65872, corrupt 0, gen 0 BTRFS info (device sdd): bdev /dev/sdd errs: wr 71489901, rd 0, flush 30001, corrupt 0, gen 0 BTRFS error (device sdd): replace without active item, run 'device scan --forget' on the target device BTRFS error (device sdd): failed to init dev_replace: -117 BTRFS error (device sdd): open_ctree failed: -117 [CAUSE] The devid 0 didn't get its devid updated is its own problem, here I'm only focusing on the mount failure itself. The mount is not caused by the missing device, as the fs has RAID1C3 for metadata and RAID10 for data, thus is completely able to tolerate one missing device. The device tree shows the dev-replace has properly finished: item 7 key (0 DEV_REPLACE 0) itemoff 15931 itemsize 72 src devid -1 cursor left 11091821199360 cursor right 11091821199360 mode ALWAYS state FINISHED write errors 0 uncorrectable read errors 0 ^^^^^^^^ And the chunk tree shows there is no devid 0: leaf 37980736602112 items 23 free space 12548 generation 1394388 owner CHUNK_TREE leaf 37980736602112 flags 0x1(WRITTEN) backref revision 1 fs uuid 84a1ed4a-365c-45c3-a9ee-a7df525dc3c9 chunk uuid d074c661-6311-4570-b59f-a5c83fd37f8e item 0 key (DEV_ITEMS DEV_ITEM 3) itemoff 16185 itemsize 98 devid 3 total_bytes 20000588955648 bytes_used 8282877984768 io_align 4096 io_width 4096 sector_size 4096 type 0 generation 0 start_offset 0 dev_group 0 seek_speed 0 bandwidth 0 uuid 0d596b69-fb0d-4031-b4af-a301d0868b8b fsid 84a1ed4a-365c-45c3-a9ee-a7df525dc3c9 ... Which shows the first device is devid 3. But there is indeed /dev/sdd with devid 0: superblock: bytenr=65536, device=/dev/sdd --------------------------------------------------------- csum_type 0 (crc32c) csum_size 4 csum 0xd4bed87e [match] bytenr 65536 flags 0x1 ( WRITTEN ) magic _BHRfS_M [match] fsid 84a1ed4a-365c-45c3-a9ee-a7df525dc3c9 ... uuid_tree_generation 1394388 dev_item.uuid ee6532ad-5442-45f7-87fb-7703e29ed934 dev_item.fsid 84a1ed4a-365c-45c3-a9ee-a7df525dc3c9 [match] dev_item.type 0 dev_item.total_bytes 20000588955648 dev_item.bytes_used 8292541661184 dev_item.io_align 0 dev_item.io_width 0 dev_item.sector_size 0 dev_item.devid 0 <<< So this means device scan will register sdd as devid 0 into the fs, then during btrfs_init_dev_replace(), we located the replace progress item, found the previous replace is finished, but we still need to check if the dev-replace target device (devid 0) exists. If that device exists, we error out showing that error message. But to be honest the end user may not really remember which device is the replace target device, thus not sure what to do in the next step. [ENHANCEMENT] To make the error more obvious, and tell the end user which devices should be unregistered: - Introduce BTRFS_DEV_STATE_ITEM_FOUND flag During device item read from the chunk tree, set the flag for each found device item. - Verify there is no device without the above flag during mount Even missing device should have that flag set. If we found a device without that flag set, it means it's an unexpected one and should be rejected. - More detailed error message on what to do next This will show all unexpected devices and tell the end user to use 'btrfs dev scan --forget' to forget them or remove them before mount. There is an example dmesg where a device of a valid filesystem is modified to have devid 0, then try degraded mount: BTRFS info (device dm-6): first mount of filesystem 7c873869-844c-4b39-bd75-a96148bf4656 BTRFS info (device dm-6): using crc32c checksum algorithm BTRFS warning (device dm-6): devid 3 uuid b4a9f35b-db42-4ac4-b55a-cbf81d3b9683 is missing BTRFS error (device dm-6): devid 0 path /dev/mapper/test-scratch3 is registered but not found in chunk tree BTRFS error (device dm-6): please remove above devices or use 'btrfs device scan --forget <dev>' to unregister them before mount BTRFS error (device dm-6): open_ctree failed: -117 Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-01-20btrfs: fix missing fields in superblock backup with BLOCK_GROUP_TREEMark Harmstone1-1/+1
When the BLOCK_GROUP_TREE compat_ro flag is set, the extent root and csum root fields are getting missed. This is because EXTENT_TREE_V2 treated these differently, and when they were split off this special-casing was mistakenly assigned to BGT rather than the rump EXTENT_TREE_V2. There's no reason why the existence of the block group tree should mean that we don't record the details of the last commit's extent root and csum root. Fix the code in backup_super_roots() so that the correct check gets made. Fixes: 1c56ab991903 ("btrfs: separate BLOCK_GROUP_TREE compat RO flag from EXTENT_TREE_V2") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-01-20btrfs: reject new transactions if the fs is fully read-onlyQu Wenruo2-0/+21
[BUG] There is a bug report where a heavily fuzzed fs is mounted with all rescue mount options, which leads to the following warnings during unmount: BTRFS: Transaction aborted (error -22) Modules linked in: CPU: 0 UID: 0 PID: 9758 Comm: repro.out Not tainted 6.19.0-rc5-00002-gb71e635feefc #7 PREEMPT(full) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:find_free_extent_update_loop fs/btrfs/extent-tree.c:4208 [inline] RIP: 0010:find_free_extent+0x52f0/0x5d20 fs/btrfs/extent-tree.c:4611 Call Trace: <TASK> btrfs_reserve_extent+0x2cd/0x790 fs/btrfs/extent-tree.c:4705 btrfs_alloc_tree_block+0x1e1/0x10e0 fs/btrfs/extent-tree.c:5157 btrfs_force_cow_block+0x578/0x2410 fs/btrfs/ctree.c:517 btrfs_cow_block+0x3c4/0xa80 fs/btrfs/ctree.c:708 btrfs_search_slot+0xcad/0x2b50 fs/btrfs/ctree.c:2130 btrfs_truncate_inode_items+0x45d/0x2350 fs/btrfs/inode-item.c:499 btrfs_evict_inode+0x923/0xe70 fs/btrfs/inode.c:5628 evict+0x5f4/0xae0 fs/inode.c:837 __dentry_kill+0x209/0x660 fs/dcache.c:670 finish_dput+0xc9/0x480 fs/dcache.c:879 shrink_dcache_for_umount+0xa0/0x170 fs/dcache.c:1661 generic_shutdown_super+0x67/0x2c0 fs/super.c:621 kill_anon_super+0x3b/0x70 fs/super.c:1289 btrfs_kill_super+0x41/0x50 fs/btrfs/super.c:2127 deactivate_locked_super+0xbc/0x130 fs/super.c:474 cleanup_mnt+0x425/0x4c0 fs/namespace.c:1318 task_work_run+0x1d4/0x260 kernel/task_work.c:233 exit_task_work include/linux/task_work.h:40 [inline] do_exit+0x694/0x22f0 kernel/exit.c:971 do_group_exit+0x21c/0x2d0 kernel/exit.c:1112 __do_sys_exit_group kernel/exit.c:1123 [inline] __se_sys_exit_group kernel/exit.c:1121 [inline] __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1121 x64_sys_call+0x2210/0x2210 arch/x86/include/generated/asm/syscalls_64.h:232 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xe8/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x44f639 Code: Unable to access opcode bytes at 0x44f60f. RSP: 002b:00007ffc15c4e088 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 00000000004c32f0 RCX: 000000000044f639 RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000001 RBP: 0000000000000001 R08: ffffffffffffffc0 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004c32f0 R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001 </TASK> Since rescue mount options will mark the full fs read-only, there should be no new transaction triggered. But during unmount we will evict all inodes, which can trigger a new transaction, and triggers warnings on a heavily corrupted fs. [CAUSE] Btrfs allows new transaction even on a read-only fs, this is to allow log replay happen even on read-only mounts, just like what ext4/xfs do. However with rescue mount options, the fs is fully read-only and cannot be remounted read-write, thus in that case we should also reject any new transactions. [FIX] If we find the fs has rescue mount options, we should treat the fs as error, so that no new transaction can be started. Reported-by: Jiaming Zhang <r772577952@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CANypQFYw8Nt8stgbhoycFojOoUmt+BoZ-z8WJOZVxcogDdwm=Q@mail.gmail.com/ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-01-20btrfs: sync read disk super and set block sizeEdward Adam Davis1-0/+2
When the user performs a btrfs mount, the block device is not set correctly. The user sets the block size of the block device to 0x4000 by executing the BLKBSZSET command. Since the block size change also changes the mapping->flags value, this further affects the result of the mapping_min_folio_order() calculation. Let's analyze the following two scenarios: Scenario 1: Without executing the BLKBSZSET command, the block size is 0x1000, and mapping_min_folio_order() returns 0; Scenario 2: After executing the BLKBSZSET command, the block size is 0x4000, and mapping_min_folio_order() returns 2. do_read_cache_folio() allocates a folio before the BLKBSZSET command is executed. This results in the allocated folio having an order value of 0. Later, after BLKBSZSET is executed, the block size increases to 0x4000, and the mapping_min_folio_order() calculation result becomes 2. This leads to two undesirable consequences: 1. filemap_add_folio() triggers a VM_BUG_ON_FOLIO(folio_order(folio) < mapping_min_folio_order(mapping)) assertion. 2. The syzbot report [1] shows a null pointer dereference in create_empty_buffers() due to a buffer head allocation failure. Synchronization should be established based on the inode between the BLKBSZSET command and read cache page to prevent inconsistencies in block size or mapping flags before and after folio allocation. [1] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] RIP: 0010:create_empty_buffers+0x4d/0x480 fs/buffer.c:1694 Call Trace: folio_create_buffers+0x109/0x150 fs/buffer.c:1802 block_read_full_folio+0x14c/0x850 fs/buffer.c:2403 filemap_read_folio+0xc8/0x2a0 mm/filemap.c:2496 do_read_cache_folio+0x266/0x5c0 mm/filemap.c:4096 do_read_cache_page mm/filemap.c:4162 [inline] read_cache_page_gfp+0x29/0x120 mm/filemap.c:4195 btrfs_read_disk_super+0x192/0x500 fs/btrfs/volumes.c:1367 Reported-by: syzbot+b4a2af3000eaa84d95d5@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=b4a2af3000eaa84d95d5 Signed-off-by: Edward Adam Davis <eadavis@qq.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-01-20fs/namei: fix kernel-doc markup for dentry_createJay Winston1-2/+2
O_ is interpreted as a broken hyperlink target. Escape _ with a backslash. The asterisk in "struct file *" is interpreted as an opening emphasis string that never closes. Replace double quotes with rST backticks. Change "a ERR_PTR" to "an ERR_PTR". Signed-off-by: Jay Winston <jaybenjaminwinston@gmail.com> Link: https://patch.msgid.link/20260118110401.2651-1-jaybenjaminwinston@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-01-19ext4: use optimized mballoc scanning regardless of inode formatJan Kara1-2/+0
Currently we don't used mballoc optimized scanning (using max free extent order and avg free extent order group lists) for inodes with indirect block based format. This is confusing for users and I don't see a good reason for that. Even with indirect block based inode format we can spend big amount of time searching for free blocks for large filesystems with fragmented free space. To add to the confusion before commit 077d0c2c78df ("ext4: make mb_optimize_scan performance mount option work with extents") optimized scanning was applied *only* to indirect block based inodes so that commit appears as a performance regression to some users. Just use optimized scanning whenever it is enabled by mount options. Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org Link: https://patch.msgid.link/20260114182836.14120-4-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2026-01-19ext4: always allocate blocks only from groups inode can useJan Kara1