aboutsummaryrefslogtreecommitdiff
path: root/fs/btrfs/backref.c
AgeCommit message (Collapse)AuthorFilesLines
2026-04-07btrfs: remove redundant extent_buffer_uptodate() checks after read_tree_block()Filipe Manana1-10/+0
We have several places that call extent_buffer_uptodate() after reading a tree block with read_tree_block(), but that is redundant since we already call extent_buffer_uptodate() in the call chain of read_tree_block(): read_tree_block() btrfs_read_extent_buffer() read_extent_buffer_pages() returns -EIO if extent_buffer_uptodate() returns false So remove those redundant checks. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-03-21Merge tag 'for-7.0-rc4-tag' of ↵Linus Torvalds1-0/+28
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Another batch of fixes for problems that have been identified by tools analyzing code or by fuzzing. Most of them are short, two patches fix the same thing in many places so the diffs are bigger. - handle potential NULL pointer errors after attempting to read extent and checksum trees - prevent ENOSPC when creating many qgroups by ioctls in the same transaction - encoded write ioctl fixes (with 64K page and 4K block size): - fix unexpected bio length - do not let compressed bios and pages interfere with page cache - compression fixes on setups with 64K page and 4K block size: fix folio length assertions (zstd and lzo) - remap tree fixes: - make sure to hold block group reference while moving it - handle early exit when moving block group to unused list - handle deleted subvolumes with inconsistent state of deletion progress" * tag 'for-7.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: reject root items with drop_progress and zero drop_level btrfs: check block group before marking it unused in balance_remap_chunks() btrfs: hold block group reference during entire move_existing_remap() btrfs: fix an incorrect ASSERT() condition inside lzo_decompress_bio() btrfs: fix an incorrect ASSERT() condition inside zstd_decompress_bio() btrfs: do not touch page cache for encoded writes btrfs: fix a bug that makes encoded write bio larger than expected btrfs: reserve enough transaction items for qgroup ioctls btrfs: check for NULL root after calls to btrfs_csum_root() btrfs: check for NULL root after calls to btrfs_extent_root()
2026-03-17btrfs: check for NULL root after calls to btrfs_extent_root()Filipe Manana1-0/+28
btrfs_extent_root() can return a NULL pointer in case the root we are looking for is not in the rb tree that tracks roots. So add checks to every caller that is missing such check to log a message and return an error. The same applies to callers of btrfs_block_group_root(), since it calls btrfs_extent_root(). Reported-by: Chris Mason <clm@meta.com> Link: https://lore.kernel.org/linux-btrfs/20260208161657.3972997-1-clm@meta.com/ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-21Convert 'alloc_obj' family to use the new default GFP_KERNEL argumentLinus Torvalds1-2/+2
This was done entirely with mindless brute force, using git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' | xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21treewide: Replace kmalloc with kmalloc_obj for non-scalar typesKees Cook1-6/+6
This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-03btrfs: remove unreachable return after btrfs_backref_panic() in ↵Zhen Ni1-3/+1
btrfs_backref_finish_upper_links() The return statement after btrfs_backref_panic() is unreachable since btrfs_backref_panic() calls BUG() which never returns. Remove the return to unify it with the other calls to btrfs_backref_panic(). Signed-off-by: Zhen Ni <zhen.ni@easystack.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: use bool type for btrfs_path members used as booleansFilipe Manana1-8/+8
Many fields of struct btrfs_path are used as booleans but their type is an unsigned int (of one 1 bit width to save space). Change the type to bool keeping the :1 suffix so that they combine with the previous u8 fields in order to save space. This makes the code more clear by using explicit true/false and more in line with the preferred style, preserving the size of the structure. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: declare free_ipath() via DEFINE_FREE()Miquel Sabaté Solà1-9/+1
The free_ipath() function was being used as a cleanup function everywhere. Declare it via DEFINE_FREE() so we can use this function with the __free() helper. The name has also been adjusted so it's closer to the type's name. Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: use the key format macros when printing keysFilipe Manana1-6/+5
Change all locations that print a key to use the new macros to print them in order to ensure a consistent style and avoid repetitive code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: add unlikely annotations to branches leading to EIODavid Sterba1-2/+2
The unlikely() annotation is a static prediction hint that compiler may use to reorder code out of hot path. We use it elsewhere (namely tree-checker.c) for error branches that almost never happen, where EIO is one of them. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: add unlikely annotations to branches leading to EUCLEANDavid Sterba1-10/+10
The unlikely() annotation is a static prediction hint that compiler may use to reorder code out of hot path. We use it elsewhere (namely tree-checker.c) for error branches that almost never happen, where EUCLEAN (a corruption) is one of them. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: fix typos in comments and stringsDavid Sterba1-1/+1
Annual typo fixing pass. Strangely codespell found only about 30% of what is in this patch, the rest was done manually using text spellchecker with a custom dictionary of acceptable terms. Reviewed-by: Neal Gompa <neal@gompa.dev> Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22btrfs: set search_commit_root to false in iterate_inodes_from_logical()Filipe Manana1-5/+7
There's no point in checking at iterate_inodes_from_logical() if the path has search_commit_root set, the only caller never sets search_commit_root to true and it doesn't make sense for it ever to be true for the current use case (logical_to_ino ioctl). So stop checking for that and since the only caller allocates the path just for it to be used by iterate_inodes_from_logical(), move the path allocation into that function. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21btrfs: simplify debug print helpers without enabled printkDavid Sterba1-3/+1
The btrfs_debug() helpers depend on various config options. In case of "no_printk" we don't need to define a special helper that in the end does nothing but validates the parameters. As the default build config is to validate the parameters we can simplify it to let the debug helpers expand to nothing and remove btrfs_no_printk_in_rcu(). To avoid warnings use fs_info and inline one variable in extent_from_logical(). Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21btrfs: rename err to ret2 in resolve_indirect_refs()David Sterba1-5/+5
Unify naming of return value to the preferred way, move the variable to the closest scope. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21btrfs: pass struct rb_simple_node pointer directly in rb_simple_insert()Pan Chuang1-3/+2
Replace struct embedding with union to enable safe type conversion in btrfs_backref_node, tree_block and mapping_node. Adjust function calls to use the new unified API, eliminating redundant parameters. Signed-off-by: Pan Chuang <panchuang@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21btrfs: relocation: simplify unused logic related to LINK_LOWERDaniel Vacek1-10/+6
btrfs_backref_link_edge() is always called with the LINK_LOWER argument. We can simplify it and remove the LINK_LOWER and LINK_UPPER macros completely. The last call with LINK_UPPER was removed with commit 0097422c0dfe0a ("btrfs: remove clone_backref_node() from relocation"). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: use list_first_entry() everywhereDavid Sterba1-4/+4
Using the helper makes it a bit more clear that we're accessing the first list entry. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: convert ASSERT(0) with handled errors to DEBUG_WARN()David Sterba1-1/+1
The use of ASSERT(0) is maybe useful for some cases but more like a notice for developers. Assertions can be compiled in independently so convert it to a debugging helper. The difference is that it's just a warning and will not end up in BUG(). The converted cases are in connection with proper error handling. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: convert WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG)) to DEBUG_WARNDavid Sterba1-1/+1
Use the conditional warning instead of typing the whole condition. Optional message is printed where it seems clear what could be the problem. Conversion is left out in btree_csum_one_bio() because of the additional condition. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: unify ordering of btrfs_key initializationsDavid Sterba1-2/+2
The btrfs_key is defined as objectid/type/offset and the keys are also printed like that. For better readability, update all key initializations to match this order. Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: update prelim_ref_insert() to use rb helpersRoger L. Beckermeyer III1-40/+39
Update prelim_ref_insert() to use rb_find_add_cached(). There is a special change that the existing prelim_ref_compare() is called with the first parameter as the existing ref in the rbtree. But the newer rb_find_add_cached() expects the cmp() function to have the first parameter as the to-be-added node, thus the new helper prelim_ref_rb_add_cmp() need to adapt this new order. Signed-off-by: Roger L. Beckermeyer III <beckerlee3@gmail.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: remove detached list from struct btrfs_backref_cacheJosef Bacik1-2/+0
We don't ever look at this list, remove it. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: remove the ->lowest and ->leaves members from struct btrfs_backref_nodeJosef Bacik1-19/+0
Before we were keeping all of our nodes on various lists in order to make sure everything got cleaned up correctly. We used node->lowest to indicate that node->lower was linked into the cache->leaves list. Now that we do cleanup based on the rb-tree both the list and the flag are useless, so delete them both. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: simplify btrfs_backref_release_cache()Josef Bacik1-20/+2
We rely on finding all our nodes on the various lists in the backref cache, when they are all also in the rbtree. Instead just search through the rbtree and free everything. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: do not handle non-shareable roots in backref cacheJosef Bacik1-27/+23
Now that we handle relocation for non-shareable roots without using the backref cache, remove the ->cowonly field from the backref nodes and update the handling to throw an error. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: remove the changed list for backref cacheJosef Bacik1-2/+0
Now that we're not updating the backref cache when we switch transids we can remove the changed list. We're going to keep the new_bytenr field because it serves as a good sanity check for the backref cache and relocation, and can prevent us from making extent tree corruption worse. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: pass fs_info to functions that search for delayed ref headsFilipe Manana1-1/+2
One of the following patches in the series will need to access fs_info in the function find_ref_head(), so pass a fs_info argument to it as well as to the functions btrfs_select_ref_head() and btrfs_find_delayed_ref_head() which call find_ref_head(). Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-01btrfs: drop the backref cache during relocation if we commitJosef Bacik1-4/+8
Since the inception of relocation we have maintained the backref cache across transaction commits, updating the backref cache with the new bytenr whenever we COWed blocks that were in the cache, and then updating their bytenr once we detected a transaction id change. This works as long as we're only ever modifying blocks, not changing the structure of the tree. However relocation does in fact change the structure of the tree. For example, if we are relocating a data extent, we will look up all the leaves that point to this data extent. We will then call do_relocation() on each of these leaves, which will COW down to the leaf and then update the file extent location. But, a key feature of do_relocation() is the pending list. This is all the pending nodes that we modified when we updated the file extent item. We will then process all of these blocks via finish_pending_nodes, which calls do_relocation() on all of the nodes that led up to that leaf. The purpose of this is to make sure we don't break sharing unless we absolutely have to. Consider the case that we have 3 snapshots that all point to this leaf through the same nodes, the initial COW would have created a whole new path. If we did this for all 3 snapshots we would end up with 3x the number of nodes we had originally. To avoid this we will cycle through each of the snapshots that point to each of these nodes and update their pointers to point at the new nodes. Once we update the pointer to the new node we will drop the node we removed the link for and all of its children via btrfs_drop_subtree(). This is essentially just btrfs_drop_snapshot(), but for an arbitrary point in the snapshot. The problem with this is that we will never reflect this in the backref cache. If we do this btrfs_drop_snapshot() for a node that is in the backref tree, we will leave the node in the backref tree. This becomes a problem when we change the transid, as now the backref cache has entire subtrees that no longer exist, but exist as if they still are pointed to by the same roots. In the best case scenario you end up with "adding refs to an existing tree ref" errors from insert_inline_extent_backref(), where we attempt to link in nodes on roots that are no longer valid. Worst case you will double free some random block and re-use it when there's still references to the block. This is extremely subtle, and the consequences are quite bad. There isn't a way to make sure our backref cache is consistent between transid's. In order to fix this we need to simply evict the entire backref cache anytime we cross transid's. This reduces performance in that we have to rebuild this backref cache every time we change transid's, but fixes the bug. This has existed since relocation was added, and is a pretty critical bug. There's a lot more cleanup that can be done now that this functionality is going away, but this patch is as small as possible in order to fix the problem and make it easy for us to backport it to all the kernels it needs to be backported to. Followup series will dismantle more of this code and simplify relocation drastically to remove this functionality. We have a reproducer that reproduced the corruption within a few minutes of running. With this patch it survives several iterations/hours of running the reproducer. Fixes: 3fd0a5585eb9 ("Btrfs: Metadata ENOSPC handling for balance") CC: stable@vger.kernel.org Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: constify more pointer parametersDavid Sterba1-3/+3
Continue adding const to parameters. This is for clarity and minor addition to safety. There are some minor effects, in the assembly code and .ko measured on release config. Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: change root->root_key.objectid to btrfs_root_id()Josef Bacik1-4/+4
A comment from Filipe on one of my previous cleanups brought my attention to a new helper we have for getting the root id of a root, which makes it easier to read in the code. The changes where made with the following Coccinelle semantic patch: // <smpl> @@ expression E,E1; @@ ( E->root_key.objectid = E1 | - E->root_key.objectid + btrfs_root_id(E) ) // </smpl> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: stop referencing btrfs_delayed_tree_ref directlyJosef Bacik1-10/+11
We only ever need to use this to get the level of the tree block ref, so use the btrfs_delayed_ref_owner() helper, which returns the level for the given reference. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: stop referencing btrfs_delayed_data_ref directlyJosef Bacik1-5/+2
Now that most of our elements are inside of btrfs_delayed_ref_node directly and we have helpers for the delayed_data_ref bits, go ahead and remove all direct usage of btrfs_delayed_data_ref and use the helpers where needed. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: move ->parent and ->ref_root into btrfs_delayed_ref_nodeJosef Bacik1-8/+4
These two members are shared by both the tree refs and data refs, so move them into btrfs_delayed_ref_node proper. This allows us to greatly simplify the comparison code, as the shared refs always only sort on parent, and the non shared refs always sort first on ref_root, and then only data refs sort on their specific fields. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-18btrfs: fix information leak in btrfs_ioctl_logical_to_ino()Johannes Thumshirn1-9/+3
Syzbot reported the following information leak for in btrfs_ioctl_logical_to_ino(): BUG: KMSAN: kernel-infoleak in instrument_copy_to_user include/linux/instrumented.h:114 [inline] BUG: KMSAN: kernel-infoleak in _copy_to_user+0xbc/0x110 lib/usercopy.c:40 instrument_copy_to_user include/linux/instrumented.h:114 [inline] _copy_to_user+0xbc/0x110 lib/usercopy.c:40 copy_to_user include/linux/uaccess.h:191 [inline] btrfs_ioctl_logical_to_ino+0x440/0x750 fs/btrfs/ioctl.c:3499 btrfs_ioctl+0x714/0x1260 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:904 [inline] __se_sys_ioctl+0x261/0x450 fs/ioctl.c:890 __x64_sys_ioctl+0x96/0xe0 fs/ioctl.c:890 x64_sys_call+0x1883/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:17 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Uninit was created at: __kmalloc_large_node+0x231/0x370 mm/slub.c:3921 __do_kmalloc_node mm/slub.c:3954 [inline] __kmalloc_node+0xb07/0x1060 mm/slub.c:3973 kmalloc_node include/linux/slab.h:648 [inline] kvmalloc_node+0xc0/0x2d0 mm/util.c:634 kvmalloc include/linux/slab.h:766 [inline] init_data_container+0x49/0x1e0 fs/btrfs/backref.c:2779 btrfs_ioctl_logical_to_ino+0x17c/0x750 fs/btrfs/ioctl.c:3480 btrfs_ioctl+0x714/0x1260 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:904 [inline] __se_sys_ioctl+0x261/0x450 fs/ioctl.c:890 __x64_sys_ioctl+0x96/0xe0 fs/ioctl.c:890 x64_sys_call+0x1883/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:17 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Bytes 40-65535 of 65536 are uninitialized Memory access of size 65536 starts at ffff888045a40000 This happens, because we're copying a 'struct btrfs_data_container' back to user-space. This btrfs_data_container is allocated in 'init_data_container()' via kvmalloc(), which does not zero-fill the memory. Fix this by using kvzalloc() which zeroes out the memory on allocation. CC: stable@vger.kernel.org # 4.14+ Reported-by: <syzbot+510a1abbb8116eeb341d@syzkaller.appspotmail.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <Johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-05btrfs: remove SLAB_MEM_SPREAD flag useChengming Zhou1-4/+1
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was removed as of v6.8-rc1, so it became a dead flag since the commit 16a1d968358a ("mm/slab: remove mm/slab.c and slab_def.h"). And the series[1] went on to mark it obsolete to avoid confusion for users. Here we can just remove all its users, which has no functional change. [1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: uninline some static inline helpers from backref.hDavid Sterba1-0/+90
There are many helpers doing simple things but not simple enough to justify the static inline. None of them seems to be on a hot path so move them to .c. Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: open code btrfs_backref_get_eb()David Sterba1-2/+2
The helper is trivial, we can inline it. It's safe to remove the 'if' as the iterator is always valid when used, the potential NULL was never checked anyway. Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: delete pointless BUG_ONs on extent item sizeDavid Sterba1-3/+0
Checking extent item size in add_inline_refs() is redundant, we do that already in tree-checker after reading the extent buffer and it won't change under normal circumstances. It was added long ago in 8da6d5815c592b ("Btrfs: added btrfs_find_all_roots()") and does not seem to have a clear purpose. Similar case in extent_from_logical(), added in a542ad1bafc7df ("btrfs: added helper functions to iterate backrefs"). Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: handle invalid extent item reference found in extent_from_logical()David Sterba1-0/+11
The extent_from_logical() helper looks up an extent item by a key, allowing to do an inexact search when key->offset is -1. It's never expected to find such item, as it would break the allowed range of a extent item offset. The same error is already handled in btrfs_backref_iter_start() so add a comment for consistency. Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: update comment and drop assertion in extent item lookup in ↵David Sterba1-2/+4
find_parent_nodes() Same comment was added to this type of error, unify that and drop the assertion as we'd find out quickly that something is wrong after returning -EUCLEAN. Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-30Merge tag 'for-6.7-tag' of ↵Linus Torvalds1-1/+4
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "New features: - raid-stripe-tree New tree for logical file extent mapping where the physical mapping may not match on multiple devices. This is now used in zoned mode to implement RAID0/RAID1* profiles, but can be used in non-zoned mode as well. The support for RAID56 is in development and will eventually fix the problems with the current implementation. This is a backward incompatible feature and has to be enabled at mkfs time. - simple quota accounting (squota) A simplified mode of qgroup that accounts all space on the initial extent owners (a subvolume), the snapshots are then cheap to create and delete. The deletion of snapshots in fully accounting qgroups is a known CPU/IO performance bottleneck. The squota is not suitable for the general use case but works well for containers where the original subvolume exists for the whole time. This is a backward incompatible feature as it needs extending some structures, but can be enabled on an existing filesystem. - temporary filesystem fsid (temp_fsid) The fsid identifies a filesystem and is hard coded in the structures, which disallows mounting the same fsid found on different devices. For a single device filesystem this is not strictly necessary, a new temporary fsid can be generated on mount e.g. after a device is cloned. This will be used by Steam Deck for root partition A/B testing, or can be used for VM root images. Other user visible changes: - filesystems with partially finished metadata_uuid conversion cannot be mounted anymore and the uuid fixup has to be done by btrfs-progs (btrfstune). Performance improvements: - reduce reservations for checksum deletions (with enabled free space tree by factor of 4), on a sample workload on file with many extents the deletion time decreased by 12% - make extent state merges more efficient during insertions, reduce rb-tree iterations (run time of critical functions reduced by 5%) Core changes: - the integrity check functionality has been removed, this was a debugging feature and removal does not affect other integrity checks like checksums or tree-checker - space reservation changes: - more efficient delayed ref reservations, this avoids building up too much work or overusing or exhausting the global block reserve in some situations - move delayed refs reservation to the transaction start time, this prevents some ENOSPC corner cases related to exhaustion of global reserve - improvements in reducing excessive reservations for block group items - adjust overcommit logic in near full situations, account for one more chunk to eventually allocate metadata chunk, this is mostly relevant for small filesystems (<10GiB) - single device filesystems are scanned but not registered (except seed devices), this allows temp_fsid to work - qgroup iterations do not need GFP_ATOMIC allocations anymore - cleanups, refactoring, reduced data structure size, function parameter simplifications, error handling fixes" * tag 'for-6.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (156 commits) btrfs: open code timespec64 in struct btrfs_inode btrfs: remove redundant log root tree index assignment during log sync btrfs: remove redundant initialization of variable dirty in btrfs_update_time() btrfs: sysfs: show temp_fsid feature btrfs: disable the device add feature for temp-fsid btrfs: disable the seed feature for temp-fsid btrfs: update comment for temp-fsid, fsid, and metadata_uuid btrfs: remove pointless empty log context list check when syncing log btrfs: update comment for struct btrfs_inode::lock btrfs: remove pointless barrier from btrfs_sync_file() btrfs: add and use helpers for reading and writing last_trans_committed btrfs: add and use helpers for reading and writing fs_info->generation btrfs: add and use helpers for reading and writing log_transid btrfs: add and use helpers for reading and writing last_log_commit btrfs: support cloned-device mount capability btrfs: add helper function find_fsid_by_disk btrfs: stop reserving excessive space for block group item insertions btrfs: stop reserving excessive space for block group item updates btrfs: reorder btrfs_inode to fill gaps btrfs: open code btrfs_ordered_inode_tree in btrfs_inode ...
2023-10-23btrfs: fix unwritten extent buffer after snapshotting a new subvolumeFilipe Manana1-5/+9
When creating a snapshot of a subvolume that was created in the current transaction, we can end up not persisting a dirty extent buffer that is referenced by the snapshot, resulting in IO errors due to checksum failures when trying to read the extent buffer later from disk. A sequence of steps that leads to this is the following: 1) At ioctl.c:create_subvol() we allocate an extent buffer, with logical address 36007936, for the leaf/root of a new subvolume that has an ID of 291. We mark the extent buffer as dirty, and at this point the subvolume tree has a single node/leaf which is also its root (level 0); 2) We no longer commit the transaction used to create the subvolume at create_subvol(). We used to, but that was recently removed in commit 1b53e51a4a8f ("btrfs: don't commit transaction for every subvol create"); 3) The transaction used to create the subvolume has an ID of 33, so the extent buffer 36007936 has a generation of 33; 4) Several updates happen to subvolume 291 during transaction 33, several files created and its tree height changes from 0 to 1, so we end up with a new root at level 1 and the extent buffer 36007936 is now a leaf of that new root node, which is extent buffer 36048896. The commit root remains as 36007936, since we are still at transaction 33; 5) Creation of a snapshot of subvolume 291, with an ID of 292, starts at ioctl.c:create_snapshot(). This triggers a commit of transaction 33 and we end up at transaction.c:create_pending_snapshot(), in the critical section of a transaction commit. There we COW the root of subvolume 291, which is extent buffer 36048896. The COW operation returns extent buffer 36048896, since there's no need to COW because the extent buffer was created in this transaction and it was not written yet. The we call btrfs_copy_root() against the root node 36048896. During this operation we allocate a new extent buffer to turn into the root node of the snapshot, copy the contents of the root node 36048896 into this snapshot root extent buffer, set the owner to 292 (the ID of the snapshot), etc, and then we call btrfs_inc_ref(). This will create a delayed reference for each leaf pointed by the root node with a reference root of 292 - this includes a reference for the leaf 36007936. After that we set the bit BTRFS_ROOT_FORCE_COW in the root's state. Then we call btrfs_insert_dir_item(), to create the directory entry in in the tree of subvolume 291 that points to the snapshot. This ends up needing to modify leaf 36007936 to insert the respective directory items. Because the bit BTRFS_ROOT_FORCE_COW is set for the root's state, we need to COW the leaf. We end up at btrfs_force_cow_block() and then at update_ref_for_cow(). At update_ref_for_cow() we call btrfs_block_can_be_shared() which returns false, despite the fact the leaf 36007936 is shared - the subvolume's root and the snapshot's root point to that leaf. The reason that it incorrectly returns false is because the commit root of the subvolume is extent buffer 36007936 - it was the initial root of the subvolume when we created it. So btrfs_block_can_be_shared() which has the following logic: int btrfs_block_can_be_shared(struct btrfs_root *root, struct extent_buffer *buf) { if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state) && buf != root->node && buf != root->commit_root && (btrfs_header_generation(buf) <= btrfs_root_last_snapshot(&root->root_item) || btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC))) return 1; return 0; } Returns false (0) since 'buf' (extent buffer 36007936) matches the root's commit root. As a result, at update_ref_for_cow(), we don't check for the number of references for extent buffer 36007936, we just assume it's not shared and therefore that it has only 1 reference, so we set the local variable 'refs' to 1. Later on, in the final if-else statement at update_ref_for_cow(): static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct extent_buffer *buf, struct extent_buffer *cow, int *last_ref) { (...) if (refs > 1) { (...) } else { (...) btrfs_clear_buffer_dirty(trans, buf); *last_ref = 1; } } So we mark the extent buffer 36007936 as not dirty, and as a result we don't write it to disk later in the transaction commit, despite the fact that the snapshot's root points to it. Attempting to access the leaf or dumping the tree for example shows that the extent buffer was not written: $ btrfs inspect-internal dump-tree -t 292 /dev/sdb btrfs-progs v6.2.2 file tree key (292 ROOT_ITEM 33) node 36110336 level 1 items 2 free space 119 generation 33 owner 292 node 36110336 flags 0x1(WRITTEN) backref revision 1 checksum stored a8103e3e checksum calced a8103e3e fs uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79 chunk uuid e8c9c885-78f4-4d31-85fe-89e5f5fd4a07 key (256 INODE_ITEM 0) block 36007936 gen 33 key (257 EXTENT_DATA 0) block 36052992 gen 33 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 total bytes 107374182400 bytes used 38572032 uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79 The respective on disk region is full of zeroes as the device was trimmed at mkfs time. Obviously 'btrfs check' also detects and complains about this: $ btrfs check /dev/sdb Opening filesystem to check... Checking filesystem on /dev/sdb UUID: 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79 generation: 33 (33) [1/7] checking root items [2/7] checking extents checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 bad tree block 36007936, bytenr mismatch, want=36007936, have=0 owner ref check failed [36007936 4096] ERROR: errors found in extent allocation tree or chunk allocation [3/7] checking free space tree [4/7] checking fs roots checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 bad tree block 36007936, bytenr mismatch, want=36007936, have=0 The following tree block(s) is corrupted in tree 292: tree block bytenr: 36110336, level: 1, node key: (256, 1, 0) root 292 root dir 256 not found ERROR: errors found in fs roots found 38572032 bytes used, error(s) found total csum bytes: 16048 total tree bytes: 1265664 total fs tree bytes: 1118208 total extent tree bytes: 65536 btree space waste bytes: 562598 file data blocks allocated: 65978368 referenced 36569088 Fix this by updating btrfs_block_can_be_shared() to consider that an extent buffer may be shared if it matches the commit root and if its generation matches the current transaction's generation. This can be reproduced with the following script: $ cat test.sh #!/bin/bash MNT=/mnt/sdi DEV=/dev/sdi # Use a filesystem with a 64K node size so that we have the same node # size on every machine regardless of its page size (on x86_64 default # node size is 16K due to the 4K page size, while on PPC it's 64K by # default). This way we can make sure we are able to create a btree for # the subvolume with a height of 2. mkfs.btrfs -f -n 64K $DEV mount $DEV $MNT btrfs subvolume create $MNT/subvol # Create a few empty files on the subvolume, this bumps its btree # height to 2 (root node at level 1 and 2 leaves). for ((i = 1; i <= 300; i++)); do echo -n > $MNT/subvol/file_$i done btrfs subvolume snapshot -r $MNT/subvol $MNT/subvol/snap umount $DEV btrfs check $DEV Running it on a 6.5 kernel (or any 6.6-rc kernel at the moment): $ ./test.sh Create subvolume '/mnt/sdi/subvol' Create a readonly snapshot of '/mnt/sdi/subvol' in '/mnt/sdi/subvol/snap' Opening filesystem to check... Checking filesystem on /dev/sdi UUID: bbdde2ff-7d02-45ca-8a73-3c36f23755a1 [1/7] checking root items [2/7] checking extents parent transid verify failed on 30539776 wanted 7 found 5 parent transid verify failed on 30539776 wanted 7 found 5 parent transid verify failed on 30539776 wanted 7 found 5 Ignoring transid failure owner ref check failed [30539776 65536] ERROR: errors found in extent allocation tree or chunk allocation [3/7] checking free space tree [4/7] checking fs roots parent transid verify failed on 30539776 wanted 7 found 5 Ignoring transid failure Wrong key of child node/leaf, wanted: (256, 1, 0), have: (2, 132, 0) Wrong generation of child node/leaf, wanted: 5, have: 7 root 257 root dir 256 not found ERROR: errors found in fs roots found 917504 bytes used, error(s) found total csum bytes: 0 total tree bytes: 851968 total fs tree bytes: 393216 total extent tree bytes: 65536 btree space waste bytes: 736550 file data blocks allocated: 0 referenced 0 A test case for fstests will follow soon. Fixes: 1b53e51a4a8f ("btrfs: don't commit transaction for every subvol create") CC: stable@vger.kernel.org # 6.5+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: switch btrfs_backref_cache::is_reloc to boolDavid Sterba1-1/+1
The btrfs_backref_cache::is_reloc is an indicator variable and should use a bool type. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: new inline ref storing owning subvol of data extentsBoris Burkov1-0/+3
In order to implement simple quota groups, we need to be able to associate a data extent with the subvolume that created it. Once you account for reflink, this information cannot be recovered without explicitly storing it. Options for storing it are: - a new key/item - a new extent inline ref item The former is backwards compatible, but wastes space, the latter is incompat, but is efficient in space and reuses the existing inline ref machinery, while only abusing it a tiny amount -- specifically, the new item is not a ref, per-se. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: remove v0 extent handlingQu Wenruo1-17/+12
The v0 extent item has been deprecated for a long time, and we don't have any report from the community either. So it's time to remove the v0 extent specific error handling, and just treat them as regular extent tree corruption. This patch would remove the btrfs_print_v0_err() helper, and enhance the involved error handling to treat them just as any extent tree corrupt