aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
8 daysbtrfs: tracepoints: trace transaction states during commit phaseFilipe Manana2-4/+21
Currently the trace event is fired only when a transaction is fully complete (its state is TRANS_STATE_COMPLETED). However during a transaction commit we go through several states and as soon as the state reaches TRANS_STATE_UNBLOCKED, another transaction can start. Therefore it's useful to track every transaction state changed during the commit of a transaction, so that we can see if a new transaction is started before the current one is completed. Add the transaction state to the transaction commit event and call the event everytime we change the transaction state during commit. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: tracepoints: add trace event for the start of a new transactionFilipe Manana2-0/+19
While tracing it's useful to know not just when a transaction is committed or aborted, but also when a new one is started. So add a trace event for transaction starts. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: tracepoints: add trace event for transaction abortsFilipe Manana2-0/+23
While tracing it's useful to know not just when a transaction is committed but also when one is aborted. So add a trace event for transaction aborts. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: tracepoints: add in_fsync field to transaction commit eventFilipe Manana1-1/+4
Include the in_fsync value from the transaction handle so that we can know if a transaction commit was triggered by a fsync call. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: tracepoints: pass a transaction handle to transaction commit eventFilipe Manana2-6/+7
The transaction commit tracepoint prints fs_info->generation as if it were the ID of the committed transaction but this does not always match that ID. This is because the trace point is called in the transaction commit path after the transaction is in the TRANS_STATE_COMPLETED state, which means another transaction may have already started (which can happen as soon as the transaction state was set to TRANS_STATE_UNBLOCKED), in which case fs_info->generation was incremented and does not correspond to the committed transaction anymore. So fix this by passing a transaction handle to the trace event instead of fs_info. This will also allow later for the trace event to dump other useful information about the transaction. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: remove call to transaction commit trace in btrfs_cleanup_transaction()Filipe Manana1-1/+0
We are not committing a transaction there, plus in subsequent patches we want to change the argument for the trace event to be a transaction handle instead of fs_info and in this context we don't have a transaction handle (struct btrfs_trans_handle, only a struct btrfs_transaction). So remove the call to the trace point. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: remove call to transaction commit trace in warn_about_uncommitted_trans()Filipe Manana1-1/+0
We are not committing a transaction there, plus in subsequent patches we want to change the argument for the trace event to be a transaction handle instead of fs_info and in this context we don't have a transaction handle (struct btrfs_trans_handle, only a struct btrfs_transaction). So remove the call to the trace point. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: tracepoints: remove pointless root field from transaction commit eventFilipe Manana1-5/+1
A transaction commit is global, not per root, and we are currently always emitting a root id field matching the root tree for no good reason at all, causing confusion for no reason at all. So remove the root field. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: tracepoints: remove double negation in finish ordered extent eventFilipe Manana1-1/+1
There is no need to add a double negation (!!) to the update field because the field has a boolean type. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: tree-checker: add more cross checks for free space treeQu Wenruo1-7/+60
This introduces extra checks using the previous key. If there is a previous key, we can do extra validations: - The previous key is FREE_SPACE_INFO This means the current extent/bitmap should be inside the free space info key range. And matches the type of the free space info. - The previous key is FREE_SPACE_EXTENT or FREE_SPACE_BITMAP In that case both the current and previous key should belong to the same block group. Thus the key type must match, and no overlap between the two keys. These extra checks are inspired by the recently added type checks during free space tree loading. The new tree-checker checks will allow earlier detection, but the loading time checks are still needed, as the tree-checker checks are still inside the same leaf, not matching per-bg level checks. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: tree-checker: ensure free space tree entries won't overflowQu Wenruo1-0/+14
Add an extra check to ensure the free space extent/bitmap and space info keys won't overflow. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: tree-checker: extract the shared key check for free space entriesQu Wenruo1-23/+23
Currently both check_free_space_extent() and check_free_space_bitmap() share a very common validation on the keys. Extract them into a helper, check_free_space_common_key(), and change the output string ("extent" or "bitmap") depending on the key type. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: remove folio ordered flag and subpage bitmapQu Wenruo6-91/+7
Btrfs has an internal flag/subpage bitmap called ordered, which is to indicate that a block has corresponding ordered extent covering it. However this requires extra synchronization between the inode ordered tree, and the folio flag/subpage bitmap, not to mention we need to maintain the extra folio flag with subpage bitmap. As a step to align btrfs_folio_state more closely to iomap_folio_state, remove the btrfs specific ordered flag/bitmap. This will also save us 64 bytes for the bitmap of a huge folio. Since we're here, also update the ASCII graph of the bitmap, as there are only 3 sub-bitmaps now, show all sub-bitmaps directly. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: remove folio_test_ordered() usageQu Wenruo2-13/+0
This involves: - The ASSERT() inside end_bbio_data_write() It's only an ASSERT() and it has never been triggered as far as I know. - btrfs_migrate_folio() Since all folio_test_ordered() usage will be removed, there is no need to copy the folio ordered flag. - The ASSERT() inside btrfs_invalidate_folio() This one has its usefulness as it indeed caught some bugs during development. But that's the last user and will not be worth the folio flag or the subpage bitmap. This will allow btrfs to finally remove the ordered flags. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: use dirty flag to check if an ordered extent needs to be truncatedQu Wenruo1-8/+13
Currently there are only two folio ordered flag users: - extent_writepage_io() To ensure the folio range has an ordered extent covering it. This is from the legacy COW fixup mechanism, which is already removed and only a simple check is left. - btrfs_invalidate_folio() This is to avoid race with end_bbio_data_write(), where btrfs_finish_ordered_extent() will be called to handle the OE finishing. But for btrfs_invalidate_folio() we have already waited for the folio writeback to finish, and locked the folio. This means we can use the dirty flag to check if a range is already submitted or not. If the OE range is not dirty, it means the range has been submitted and its dirty flag was cleared. And since we have already waited for writeback, the endio function will handle the OE finishing. Thus if the range is not dirty, we must skip the range. If the OE range is dirty, it means we have allocated an ordered extent but have not yet submitted the range. And that's exactly the case where we need to truncate the ordered extent. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: unify folio dirty flag clearingQu Wenruo3-13/+26
Currently during folio writeback, we call folio_clear_dirty_for_io() before extent_writepage(), which causes folio dirty flag to be cleared, but without touching the subpage bitmaps. This works fine for the bio submission path, as we always call btrfs_folio_clear_dirty() to clear the subpage bitmap. But this is far from consistent, thus this patch is going to unify the behavior to always use btrfs_folio_clear_dirty() helper to clear both folio flag and subpage bitmap. This involves: - Replace folio_clear_dirty_for_io() with folio_test_dirty() There is only one call site calling folio_clear_dirty_for_io() outside of subpage.c, that's inside extent_write_cache_pages() just before extent_writepage(). - Make btrfs_invalidate_folio() clear dirty range for the whole folio The function btrfs_invalidate_folio() is also called during extent_writepage(). If we had a folio completely beyond isize, we call folio_invalidate() -> btrfs_invalidate_folio() to free the folio. Since we no longer have folio_clear_dirty_for_io() to clear the folio dirty flag, we must manually clear the folio dirty flag for the to-be-invalidated folio, and also clear the PAGECACHE_TAG_DIRTY tag. The tag clearing is done using a new helper, btrfs_clear_folio_dirty_tag(), which is almost the same as the old btree_clear_folio_dirty_tag(), but with minor improvements including: * Remove the folio_test_dirty() check We have already done an ASSERT(). * Add an ASSERT() to make sure folio is mapped - Add extra ASSERT()s before clearing folio private During development I hit dirty folios without the private flag set, and that caused a lot of ASSERT()s. The reason is that btrfs_invalidate_folio() is relying on the dirty flag being cleared when it's called from extent_writepage(). Add extra ASSERT()s inside clear_folio_extent_mapped() to catch wild dirty/writeback flags. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: detect dirty blocks without an ordered extent more reliablyQu Wenruo1-31/+54
Currently btrfs detects dirty folio which doesn't have an ordered extent at extent_writepage_io(), but that is not ideal: - The check is not handling all dirty blocks We can have multiple blocks inside a large folio, but the whole folio is marked ordered as long as there is one ordered extent in the range. We can still hit cases where some dirty blocks do not have corresponding ordered extents. Instead of checking the folio ordered flags, do the check at alloc_new_bio(), where we're already searching for ordered extents for writebacks. If we didn't find an ordered extent, we should already give an error message and notify the caller there is something wrong. This allows us to check every block that goes through submit_extent_folio(). With this new and more reliable check, we can remove the old check. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: remove locked subpage bitmapQu Wenruo2-45/+10
Currently there are two members inside btrfs_folio_state that are related to locked bitmap: - locked sub-bitmap inside btrfs_folio_state::bitmaps[] The enum btrfs_bitmap_nr_locked determines the sub-bitmap. - btrfs_folio_state::nr_locked Which records how many blocks are locked inside the folio. The locked sub-bitmap is a btrfs specific per-block tracking mechanism, which is mostly for async-submission, utilized by compressed writes. The sub-bitmap itself is a super set of nr_locked, as it can provide a more reliable tracking. But the sub-bitmap itself can be pretty large for the incoming huge folio, 2M sized folio for 4K page size, meaning 512 bits for one sub-bitmap. Furthermore, in the long run compression will be reworked to get rid of async-submission completely, there is not much need for a full sub-bitmap to track the locked status. This patch removes the locked sub-bitmap and only relies on @nr_locked atomic to do the tracking. This can also save 64 bytes from btrfs_folio_state::bitmaps[] for a huge folio. This will reduce some safety checks, as previously if a block is not locked, btrfs_folio_end_lock()/btrfs_folio_end_lock_bitmap() will find out that, and skip reducing @nr_locked for that block, and avoid under-flow. But this safety net itself shouldn't be necessary in the first place. If we're unlocking a block that is not locked, it's a bug in the logic, and we should catch it, not silently ignoring it. Thus I believe the removal of the extra safety net should not be a problem. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: tree-checker: validate names in ROOT_REF and ROOT_BACKREFZhang Cen2-6/+40
ROOT_REF and ROOT_BACKREF items contain a struct btrfs_root_ref followed by the subvolume name. Several readers assume that this layout is already valid and then use the on-disk name length directly. A corrupted item can therefore make those readers address bytes outside the item, and BTRFS_IOC_GET_SUBVOL_INFO can copy too many bytes into its fixed-size UAPI name buffer. Validate ROOT_REF and ROOT_BACKREF items in tree-checker before any reader uses them. Reject records that do not contain a non-empty name, whose name_len does not exactly describe the remaining item payload, or whose name exceeds BTRFS_NAME_LEN. For BTRFS_IOC_GET_SUBVOL_INFO, copy only the validated on-disk name_len instead of deriving the copy length from the item size. The ioctl result is zeroed when allocated. That leaves the existing trailing zero byte untouched. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Zhang Cen <rollkingzzc@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: free-space-tree: reject mismatched extent and bitmap itemsZhang Cen2-5/+33
btrfs_load_free_space_tree() reads FREE_SPACE_INFO once and then chooses the bitmap or extent loader for all following free-space records until the next FREE_SPACE_INFO item. Those loaders currently enforce the selected record type only with ASSERT(). On production builds without CONFIG_BTRFS_ASSERT, a malformed free-space tree can therefore be decoded in the wrong mode. An EXTENT item can reach btrfs_free_space_test_bit() as bitmap data, while a BITMAP item can be added as a full free extent. The latter corrupts the in-memory free-space cache and the former can read beyond the item payload. Sanitizer validation reported: general protection fault Call trace: assert_eb_folio_uptodate() (fs/btrfs/extent_io.c:4134) extent_buffer_test_bit() (?:?) btrfs_free_space_test_bit() (fs/btrfs/free-space-tree.c:518) srso_alias_return_thunk() (arch/x86/include/asm/nospec-branch.h:375) __entry_text_end() (?:?) __asan_memcpy() (mm/kasan/shadow.c:103) read_extent_buffer() (?:?) load_free_space_bitmaps() (fs/btrfs/free-space-tree.c:1548) btrfs_get_32() (fs/btrfs/free-space-tree.c:?) btrfs_set_16() (fs/btrfs/free-space-tree.c:?) kmem_cache_alloc_noprof() (?:?) btrfs_load_free_space_tree() (fs/btrfs/free-space-tree.c:1685) load_free_space_tree_for_test() (?:?) rcu_disable_urgency_upon_qs() (kernel/rcu/tree.c:721) vprintk_emit() (?:?) __up_write() (kernel/locking/rwsem.c:1401) clone_commit_root_for_test() (?:?) test_extent_as_bitmap_mode_mismatch() (?:?) kmem_cache_free() (?:?) btrfs_free_path() (fs/btrfs/free-space-tree.c:1449) __add_block_group_free_space() (fs/btrfs/free-space-tree.c:20) run_test() (?:?) do_raw_spin_unlock() (?:?) btrfs_test_free_space_tree() (fs/btrfs/tests/free-space-tree-tests.c:547) btrfs_test_qgroups() (fs/btrfs/tests/qgroup-tests.c:462) btrfs_run_sanity_tests() (fs/btrfs/free-space-tree.c:?) init_btrfs_fs() (fs/btrfs/super.c:2690) do_one_initcall() (init/main.c:1382) __kasan_kmalloc() (?:?) rcu_is_watching() (?:?) do_initcalls() (init/main.c:1457) kernel_init_freeable() (init/main.c:1674) kernel_init() (init/main.c:1584) ret_from_fork() (?:?) __switch_to() (?:?) ret_from_fork_asm() (?:?) Validate every post-info key before decoding it. Reject keys whose type does not match the mode selected by FREE_SPACE_INFO, and reject keys whose range extends past the block group, returning -EUCLEAN instead of feeding the wrong record type to the bitmap or extent decoder. Also reject zero-length FREE_SPACE_EXTENT items in tree-checker, matching the existing FREE_SPACE_BITMAP zero-length check. This keeps the loader range check simple and prevents a zero-length extent item from being a valid on-disk free-space record. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Zhang Cen <rollkingzzc@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: use on stack backref iterator in build_backref_tree()David Sterba3-24/+15
The iterator is used only once and within build_backref_tree() so we can avoid one allocation and place it on stack. Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: remove fs_info from struct btrfs_backref_iterDavid Sterba3-16/+12
The fs_info is available everywhere and we don't need to store it inside a structure that is used within one function only, which is build_backref_tree(). The size of btrfs_backref_iter is now 48 bytes. Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: simplify the btree folio wait during invalidationQu Wenruo1-22/+15
The btree inode is very different from regular data inodes, as the btree inode is never exposed to user space operations. All operations are either initiated by btrfs metadata operations, or MM layer like memory pressure to release folios. This means we never need to handle partial folio invalidation inside btree_invalidate_folio(). With that said, we can slightly simplify the btree folio invalidation by: - Add ASSERT()s to make sure the range covers the whole folio - Remove "if (start > end)" check As the range always covers the full folio, that check is always false and can be removed. - Open code extent_invalidate_folio() Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: unexport and move extent_invalidate_folio()Qu Wenruo3-34/+31
The function extent_invalidate_folio() has only a single caller inside btree_invalidate_folio(). There is no need to export such a function just for a single caller inside another file. Unexport extent_invalidate_folio() and move it to disk-io.c. And since we're moving the code, update the commit to match the current style, and remove the seemingly stale comment on the extent state removal, it's better explained by the comment just before btrfs_unlock_extent(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: optimize fill_holes() to merge a new hole with both adjacent itemsDave Chen1-17/+32
fill_holes() currently merges a punched hole with either the previous or the next file extent item, but never both in the same call. When holes are punched in a non-sequential order this leaves consecutive hole items in the inode's subvolume tree that should have been collapsed into a single one. This is a minor metadata optimization that reduces the number of file extent items when holes are punched in non-sequential order. While having extra file extent items is harmless and has no functional impact, reducing metadata overhead can benefit workloads with heavily fragmented hole patterns. For example: fallocate -p -o 4K -l 4K ${FILE} fallocate -p -o 12K -l 4K ${FILE} fallocate -p -o 8K -l 4K ${FILE} After the third punch the [4K, 8K) and [12K, 16K) holes become adjacent to the new [8K, 12K) hole, but fill_holes() merges only one side and leaves two separate hole items ([4K, 12K) and [12K, 16K)) instead of the expected single [4K, 16K) hole item. Fix this by checking both path->slots[0] - 1 and path->slots[0] in one pass: - If only the previous slot is mergeable, extend it forward as before. - If only the next slot is mergeable, extend it backward and update its key offset as before. - If both are mergeable, extend the previous item to cover the new hole plus the next item, and remove the redundant next item with btrfs_del_items(). Because the merge path may now delete an item, switch the initial btrfs_search_slot() call from a plain lookup (ins_len = 0) to a search-for-deletion (ins_len = -1), so the leaf is prepared for a possible item removal. Note: This optimization only applies to filesystems without the NO_HOLES feature enabled. Since NO_HOLES is now the default, this primarily benefits older filesystems or those explicitly created with NO_HOLES disabled. Signed-off-by: Dave Chen <davechen@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: warn about extent buffer that can not be releasedQu Wenruo3-8/+57
When we unmount the fs or during mount failures, btrfs will call invalidate_inode_pages() to release all btree inode folios. However that function can return -EBUSY if any folios can not be invalidated. This can be caused by: - Some extent buffers are still held by btrfs This is a logic error, as we should release all tree root nodes during unmount and mount failure handling. - Some extent buffers are under readahead and haven't yet finished These are much rarer but valid cases. In that case we should wait for those extent buffers. Introduce a new helper invalidate_and_check_btree_folios() which will: - Call invalidate_inode_pages2() and catch its return value If it returned 0 as expected, that's great and we can call it a day. - Otherwise go through each extent buffer in buffer_tree Increase the ref by one first for the eb we're checking. This is to ensure the eb won't be freed after the readahead is finished. For ebs that still have EXTENT_BUFFER_READING flag, wait for them to finish first. After waiting for the readahead, check the refs of the eb and if it's still dirty. If the eb ref count is greater than 2 (one for the buffer tree, one held by us), it means we are still holding the extent buffer somewhere else, which is a code bug. If the eb is still dirty, it means a bug in transaction handling, e.g. the bug fixed by patch "btrfs: only release the dirty pages io tree after successful writes". For either case, show a warning message about the eb, including its bytenr, owner, refs and flags. And if it's a debug build, also trigger WARN_ON_ONCE() so that fstests can properly catch such situation. Link: https://bugzilla.kernel.org/show_bug.cgi?id=221270 Reported-by: AHN SEOK-YOUNG <iamsyahn@gmail.com> CC: Teng Liu <27rabbitlt@gmail.com> Tested-by: Teng Liu <27rabbitlt@gmail.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: make sure report_eb_range() is not inlinedFilipe Manana1-6/+7
If report_rb_range() is inlined into its single caller (check_eb_range()), we end up with a larger module size, which is undesirable and does not provide any advantage since this code is for a cold path which we don't expect to ever hit. Add the noinline attribute to report_rb_range() and while at it also make it return void as it always returns true. Before this change (with gcc 14.2.0-19 from Debian): $ size fs/btrfs/btrfs.ko text data bss dec hex filename 2018267 176232 15592 2210091 21b92b fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 2017835 176048 15592 2209475 21b6c3 fs/btrfs/btrfs.ko Also, replacing the noinline with __cold, yields slighty worse results: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 2017889 176048 15592 2209529 21b6f9 fs/btrfs/btrfs.ko Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: move transaction abort message to __btrfs_abort_transaction()David Sterba2-12/+7
The btrfs_abort_transaction() is called at the location where we want to report the abort. It must be a macro so we get the correct line and stack trace. This inlines the necessary code and the rest is pushed to __btrfs_abort_transaction(). There's a possibility to reduce the inlined code if we move the message to the helper function as well, without loss of information. The difference is only that the WARN will not print it inside the stack report but after: --[ cut here ]-- WARNING: fs/btrfs/transaction.c:2045 at btrfs_commit_transaction+0xa21/0xd30 [btrfs], CPU#11: bonnie++/3377975 ... --[ end trace ] -- BTRFS error (device dm-0 state A): Transaction aborted (error -28) While previously there would be one more line like: --[ cut here ]-- BTRFS: Transaction aborted (error -28) WARNING: fs/btrfs/transaction.c:2045 at btrfs_commit_transaction+0xa21/0xd30 [btrfs], CPU#11: bonnie++/3377975 ... --[ end trace ] -- This removes about 20KiB of btrfs.ko on a release config. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: don't force DIO writes to be serializedMark Harmstone1-0/+1
Before btrfs switched to the new mount API in 2023, we were setting SB_NOSEC in btrfs_mount_root(). This flag tells the VFS that the filesystem may have files which don't have security xattrs, enabling it to do some optimizations. Unfortunately this was missed in the transition, meaning that IS_NOSEC will always return false for a btrfs inode. This means that btrfs_direct_write() calls will always get the inode lock exclusively, meaning that DIO writes to the same file will be serialized. On my machine, this one-line change results in a ~59% improvement in DIO throughput: Before patch: test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64 ... fio-3.39 Starting 32 processes test: Laying out IO file (1 file / 1024MiB) Jobs: 32 (f=32): [w(32)][100.0%][w=764MiB/s][w=195k IOPS][eta 00m:00s] test: (groupid=0, jobs=32): err= 0: pid=586: Wed Apr 22 13:03:04 2026 write: IOPS=202k, BW=787MiB/s (826MB/s)(46.1GiB/60012msec); 0 zone resets bw ( KiB/s): min=498714, max=1199892, per=100.00%, avg=806659.03, stdev=4229.94, samples=3808 iops : min=124677, max=299971, avg=201661.82, stdev=1057.49, samples=3808 cpu : usr=0.32%, sys=1.27%, ctx=8329204, majf=0, minf=1163 IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=0,12094328,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): WRITE: bw=787MiB/s (826MB/s), 787MiB/s-787MiB/s (826MB/s-826MB/s), io=46.1GiB (49.5GB), run=60012-60012msec After patch: test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64 ... fio-3.39 Starting 32 processes test: Laying out IO file (1 file / 1024MiB) Jobs: 32 (f=32): [w(32)][100.0%][w=1255MiB/s][w=321k IOPS][eta 00m:00s] test: (groupid=0, jobs=32): err= 0: pid=572: Wed Apr 22 13:13:46 2026 write: IOPS=320k, BW=1250MiB/s (1311MB/s)(73.3GiB/60003msec); 0 zone resets bw ( MiB/s): min= 619, max= 2289, per=100.00%, avg=1251.28, stdev= 9.64, samples=3808 iops : min=158538, max=586025, avg=320320.80, stdev=2468.97, samples=3808 cpu : usr=0.35%, sys=11.50%, ctx=1584847, majf=0, minf=1160 IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwts: total=0,19203309,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): WRITE: bw=1250MiB/s (1311MB/s), 1250MiB/s-1250MiB/s (1311MB/s-1311MB/s), io=73.3GiB (78.7GB), run=60003-60003msec The script to reproduce that: #!/bin/bash mkfs.btrfs -f /dev/nvme0n1 mount /dev/nvme0n1 /mnt/test mkdir /mnt/test/nocow chattr +C /mnt/test/nocow fio /root/test.fio # cat /root/test.fio [global] rw=randwrite ioengine=io_uring iodepth=64 size=1g direct=1 startdelay=20 force_async=4 ramp_time=5 runtime=60 group_reporting=1 numjobs=32 time_based disk_util=0 clat_percentiles=0 disable_lat=1 disable_clat=1 disable_slat=1 filename=/mnt/test/nocow/fiofile [test] name=test bs=4k stonewall This was on a VM with 8 cores and 8GB of RAM, with a real NVMe exposed through PCI passthrough. The figures for XFS and ext4 in comparison are both about ~3GB/s. Fixes: ad21f15b0f79 ("btrfs: switch to the new mount API") Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: move large data folios out of experimental featuresQu Wenruo3-21/+1
This feature was introduced in v6.17 under experimental, and we had several small bugs related to or exposed by that: e9e3b22ddfa7 ("btrfs: fix beyond-EOF write handling") 18de34daa7c6 ("btrfs: truncate ordered extent when skipping writeback past i_size") Otherwise, the feature has been frequently tested by btrfs developers. The latest fix only arrived in v6.19. After three releases, I think it's time to move this feature out of experimental. And since we're here, also remove the comment about the bitmap size limit, which is no longer relevant in the context. It will soon be outdated for the incoming huge folio support. Reviewed-by: Neal Gompa <neal@gompa.dev> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: refresh add_ra_bio_pages() to indicate it's using foliosQu Wenruo1-19/+17
The function add_ra_bio_folios() has been utilizing folio interfaces since c808c1dcb1b2 ("btrfs: convert add_ra_bio_pages() to use only folios"), but we are still referring to "pages" inside the function name and all comments. Furthermore, such folio/page mixing can even be confusing, e.g. the variable @page_end is very confusing as we're not really referring to the end of the page, but the end of the folio, especially when we already have large folio support. Enhance that function by: - Rename "page" to "folio" to avoid confusion - Skip to the folio end if there is already a folio in the page cache The existing skip is: cur += folio_size(folio); This is incorrect if @cur is not folio size aligned, and can be common with large folio support. Thankfully this is not going to cause any real bugs, but at most will skip some blocks that can be added to readahead. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: enable cross-folio readahead for bs < ps and large folio casesQu Wenruo1-27/+8
[BACKGROUND] When bs < ps support was initially introduced, the compressed data readahead was disabled as at that time the target page size was 64K. This means a compressed data extent can span at most 3 64K pages (the head and tail parts are not aligned to 64K), meaning the benefit is pretty minimal. [UNEXPECTED WORKING SITUATION] But with the already merged large folio support, we're already enabling readahead with subpage routine unintentionally, e.g.: 0 4K 8K 12K 16K | Folio 0 | Folio 8K | |<----- Compressed data ------->| We have 2 8K sized folios, all backed by a single compressed data. In that case add_ra_bio_pages() will continue to add folio 8K into the read bio, as the condition to skip is only (bs < ps), not taking the newer large folio support into consideration at all. So for folio 8K, it is added to the read bio, but without subpage lock bitmap populated. Then at end_bbio_data_read(), folio 0 has proper locked bitmap set, but folio 8K does not. This inconsistency is handled by the extra safety net at btrfs_subpage_end_and_test_lock() where if a folio has no @nr_locked, it will just be unlocked without touching the locked bitmap. [ENHANCEMENT] Make add_ra_bio_pages() support bs < ps and large folio cases, by removing the check and calling btrfs_folio_set_lock() unconditionally. This won't make any difference on 4K page sized systems with large folios, as the readahead is already working, although unexpectedly. But this will enable true compressed data readahead for bs < ps cases properly. Please note that such readahead will only work if the compressed extent is crossing folio boundaries, which is also the existing limitation. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: remove 32bit compat code for VFS inode numberDavid Sterba2-33/+2
Commit 0b2600f81cefcd ("treewide: change inode->i_ino from unsigned long to u64") sets the inode number type to u64 unconditionally, so we can use it directly as there's no difference on 32bit and 64bit platform. We used to have a copy of the number in our btrfs_inode. The size of btrfs_inode on 32bit platform is about 688 bytes (after the change). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: limit size of bios submitted from writebackJan Kara5-0/+50
Currently btrfs_writepages() just accumulates as large bio as possible (within writeback_control constraints) and then submits it. This can however lead to significant latency in writeback IO submission (I have observed tens of milliseconds) because the submitted bio easily has over hundred of megabytes. Consequently this leads to IO pipeline stalls and reduced throughput. At the same time beyond certain size submitting so large bio provides diminishing returns because the bio is split by the block layer immediately anyway. So compute (estimate of) bio size beyond which we are unlikely to improve performance and just submit the bio for writeback once we accumulate that much to keep the IO pipeline busy. This improves writeback throughput for sequential writes by about 15% on the test machine I was using. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Jan Kara <jack@suse.cz> [ Fix the handling of missing device to avoid NULL pointer dereference. ] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: remove 2K block size supportQu Wenruo1-11/+1
Originally 2K block size support was introduced to test subpage (block size < page size) on x86_64 where the page size is exactly the original minimal block size. However that 2K block size support has some problems: - No 2K nodesize support This is critical, as there is still no way to exercise the subpage metadata routine. - Very easy to test subpage data path now With the currently experimental large folio support, it's very easy to test the subpage data folio path already, as when a folio larger than 4K is encountered on x86_64, we will need all the subpage folio states and bitmaps. So there is no need to use 2K block size just to verify subpage data path even on x86_64. And with the incoming huge folio (2M on x86_64) support, the 2K block size will easily double the bitmap size, considering the burden to maintain and the limited extra coverage, I believe it's time to remove it for the incoming huge folio support. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: change return type from int to bool in check_eb_range()Filipe Manana1-2/+2
The function always returns true or false but the its return type is defined as int, which makes no sense. Change it to bool. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: add missing unlikely to if branches leading to a DEBUG_WARN()Filipe Manana7-17/+17
If statement branches that lead to a DEBUG_WARN() are unexpected to happen and in most places we surround their expressions with the unlikely tag, however a few places are missing. Add the unlikely tag to those missing places to make it explicit to a reader that it's not expected and to hint the compiler to generate better code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: use QSTR() in __btrfs_ioctl_snap_create()Thorsten Blum1-1/+1
Drop the length argument and use the simpler QSTR(). Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: use the enums instead of int type in struct btrfs_block_group fieldsFilipe Manana1-2/+2
The 'disk_cache_state' and 'cached' fields are defined with an int type but all the values we assigned to them come from the enums btrfs_disk_cache_state and btrfs_caching_type. So change the type in the btrfs_block_group structure from int to these enums - in practice an enum is an int, so this is more for readability and clarity. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Sun YangKai <sunk67188@gmail.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: use min_size variable to setup block rsv in btrfs_replace_file_extents()Filipe Manana1-2/+2
There's no need to calculate again the size for the temporary block reserve in btrfs_replace_file_extents() - we have already calculated it and stored it in the 'min_size' variable. So use the variable to make it more clear and also make the variable const since it's not supposed to change during the whole function. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: balance: fix potential bg lookup failure in btrfs_may_alloc_data_chunk()ZhengYuan Huang1-1/+5
[BUG] Running btrfs balance can trigger a null-ptr-deref before relocating a data chunk when metadata corruption leaves a chunk in the chunk tree without a corresponding block group in the in-memory cache: KASAN: null-ptr-deref in range [0x0000000000000088-0x000000000000008f] RIP: 0010:btrfs_may_alloc_data_chunk+0x40/0x1c0 fs/btrfs/volumes.c:3601 Call Trace: __btrfs_balance fs/btrfs/volumes.c:4217 [inline] btrfs_balance+0x2516/0x42b0 fs/btrfs/volumes.c:4604 btrfs_ioctl_balance fs/btrfs/ioctl.c:3577 [inline] btrfs_ioctl+0x25cf/0x5b90 fs/btrfs/ioctl.c:5313 ... [CAUSE] __btrfs_balance() iterates the on-disk chunk tree and passes the chunk logical bytenr to btrfs_may_alloc_data_chunk() before relocating a data chunk. That helper then queries the in-memory block group cache: cache = btrfs_lookup_block_group(fs_info, chunk_offset); chunk_type = cache->flags; /* cache may be NULL */ A corrupt image can contain a chunk item whose matching block group item is missing, so no block group is ever inserted into the cache. In that case btrfs_lookup_block_group() returns NULL. The code only guards this with ASSERT(cache), which becomes a no-op when CONFIG_BTRFS_ASSERT is disabled. The subsequent dereference of cache->flags therefore crashes the kernel. [FIX] Add a NULL check after btrfs_lookup_block_group() in btrfs_may_alloc_data_chunk() and print and error message for clarity. Signed-off-by: ZhengYuan Huang <gality369@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: balance: fix potential bg lookup failure in chunk_usage_range_filter()ZhengYuan Huang1-7/+17
[BUG] Running btrfs balance with a usage range filter (-dusage=min..max) can trigger a null-ptr-deref when metadata corruption causes a chunk to have no corresponding block group in the in-memory cache: KASAN: null-ptr-deref in range [0x0000000000000070-0x0000000000000077] RIP: 0010:chunk_usage_range_filter fs/btrfs/volumes.c:3845 [inline] RIP: 0010:should_balance_chunk fs/btrfs/volumes.c:4031 [inline] RIP: 0010:__btrfs_balance fs/btrfs/volumes.c:4182 [inline] RIP: 0010:btrfs_balance+0x249e/0x4320 fs/btrfs/volumes.c:4618 ... Call Trace: btrfs_ioctl_balance fs/btrfs/ioctl.c:3577 [inline] btrfs_ioctl+0x25cf/0x5b90 fs/btrfs/ioctl.c:5313 vfs_ioctl fs/ioctl.c:51 [inline] ... The bug is reproducible on recent development branch. [CAUSE] Two separate data structures are involved: 1. The on-disk chunk tree, which records every chunk (logical address space region) and is iterated by __btrfs_balance(). 2. The in-memory block group cache (fs_info->block_group_cache_tree), which is built at mount time by btrfs_read_block_groups() and holds a struct btrfs_block_group for each chunk. This cache is what the usage range filter queries. On a well-formed filesystem, these two are kept in 1:1 correspondence. However, btrfs_read_block_groups() builds the cache from block group items in the extent tree, not directly from the chunk tree. A corrupted image can therefore contain a chunk item in the chunk tree whose corresponding block group item is absent from the extent tree; that chunk's block group is then never inserted into the in-memory cache. When balance iterates the chunk tree and reaches such an orphaned chunk, should_balance_chunk() calls chunk_usage_range_filter(), which queries the block group cache: cache = btrfs_lookup_block_group(fs_info, chunk_offset); chunk_used = cache->used; /* cache may be NULL */ btrfs_lookup_block_group() returns NULL silently when no cached entry covers chunk_offset. chunk_usage_range_filter() does not check the return value, so the immediately following dereference of cache->used triggers the crash. [FIX] Add a NULL check after btrfs_lookup_block_group() in chunk_usage_range_filter(). When the lookup fails, emit a btrfs_err() message identifying the affected bytenr and return -EUCLEAN to indicate filesystem corruption. Since chunk_usage_range_filter() now has an error path, change its return type from bool to error pointer, return 0 if the chunk matches the usage range, and 1 if it should be filtered out. Signed-off-by: ZhengYuan Huang <gality369@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
8 daysbtrfs: balance: fix potential bg lookup failure in chunk_usage_filter()ZhengYuan Huang1-9/