aboutsummaryrefslogtreecommitdiff
path: root/fs/btrfs/tree-checker.c
AgeCommit message (Collapse)AuthorFilesLines
2026-04-07btrfs: tree-checker: add remap-tree checks to check_block_group_item()Mark Harmstone1-0/+41
Add some write-time checks for block group items relating to the remap tree. Here we're checking: * That the REMAPPED or METADATA_REMAP flags aren't set unless the REMAP_TREE incompat flag is also set * That `remap_bytes` isn't more than the size of the block group * That `identity_remap_count` isn't more than the number of sectors in the block group Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: tree-checker: check remap-tree flags in btrfs_check_chunk_valid()Mark Harmstone1-0/+14
Add a check to btrfs_check_chunk_valid() that the METADATA_REMAP and REMAPPED flags are only set if the REMAP_TREE incompat flag is also set. Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: tree-checker: add checker for items in remap treeMark Harmstone1-0/+70
Add write-time checking of items in the remap tree, to catch errors before they are written to disk. We're checking: * That remap items, remap backrefs, and identity remaps aren't written unless the REMAP_TREE incompat flag is set * That identity remaps have a size of 0 * That remap items and remap backrefs have a size of sizeof(struct btrfs_remap_item) * That the objectid for these items is aligned to the sector size * That the offset for these items (i.e. the size of the remapping) isn't 0 and is aligned to the sector size * That objectid + offset doesn't overflow Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: tree-checker: introduce checks for FREE_SPACE_BITMAPZhengYuan Huang1-0/+43
Introduce checks for FREE_SPACE_BITMAP item, which include: - Key alignment check Same as FREE_SPACE_EXTENT, the objectid is the logical bytenr of the free space, and offset is the length of the free space, so both should be aligned to the fs block size. - Non-zero range check A zero key->offset would describe an empty bitmap, which is invalid. - Item size check The item must hold exactly DIV_ROUND_UP(key->offset >> sectorsize_bits, BITS_PER_BYTE) bytes. A mismatch indicates a truncated or otherwise corrupt bitmap item; without this check, the bitmap loading path would walk past the end of the leaf and trigger a NULL dereference in assert_eb_folio_uptodate(). Signed-off-by: ZhengYuan Huang <gality369@gmail.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: tree-checker: introduce checks for FREE_SPACE_EXTENTQu Wenruo1-0/+29
Introduce FREE_SPACE_EXTENT checks, which include: - The key alignment check The objectid is the logical bytenr of the free space, and offset is the length of the free space, thus they should all be aligned to the fs block size. - The item size check The FREE_SPACE_EXTENT item should have a size of zero. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: tree-checker: introduce checks for FREE_SPACE_INFOQu Wenruo1-0/+50
Introduce checks for FREE_SPACE_INFO item, which include: - Key alignment check The objectid is the logical bytenr of the chunk/bg, and offset is the length of the chunk/bg, thus they should all be aligned to the fs block size. - Item size check The FREE_SPACE_INFO should a fix size. - Flags check The flags member should have no other flags than BTRFS_FREE_SPACE_USING_BITMAPS. For future expansion, introduce a new macro BTRFS_FREE_SPACE_FLAGS_MASK for such checks. And since we're here, the BTRFS_FREE_SPACE_USING_BITMAPS should not use unsigned long long, as the flags is only 32 bits wide. So fix that to use unsigned long. - Extent count check That member shows how many free space bitmap/extent items there are inside the chunk/bg. We know the chunk size (from key->offset), thus there should be at most (key->offset >> sectorsize_bits) blocks inside the chunk. Use that value as the upper limit and if that counter is larger than that, there is a high chance it's a bitflip in high bits. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-03-17btrfs: reject root items with drop_progress and zero drop_levelZhengYuan Huang1-0/+17
[BUG] When recovering relocation at mount time, merge_reloc_root() and btrfs_drop_snapshot() both use BUG_ON(level == 0) to guard against an impossible state: a non-zero drop_progress combined with a zero drop_level in a root_item, which can be triggered: ------------[ cut here ]------------ kernel BUG at fs/btrfs/relocation.c:1545! Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI CPU: 1 UID: 0 PID: 283 ... Tainted: 6.18.0+ #16 PREEMPT(voluntary) Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU Ubuntu 24.04 PC v2, BIOS 1.16.3-debian-1.16.3-2 RIP: 0010:merge_reloc_root+0x1266/0x1650 fs/btrfs/relocation.c:1545 Code: ffff0000 00004589 d7e9acfa ffffe8a1 79bafebe 02000000 Call Trace: merge_reloc_roots+0x295/0x890 fs/btrfs/relocation.c:1861 btrfs_recover_relocation+0xd6e/0x11d0 fs/btrfs/relocation.c:4195 btrfs_start_pre_rw_mount+0xa4d/0x1810 fs/btrfs/disk-io.c:3130 open_ctree+0x5824/0x5fe0 fs/btrfs/disk-io.c:3640 btrfs_fill_super fs/btrfs/super.c:987 [inline] btrfs_get_tree_super fs/btrfs/super.c:1951 [inline] btrfs_get_tree_subvol fs/btrfs/super.c:2094 [inline] btrfs_get_tree+0x111c/0x2190 fs/btrfs/super.c:2128 vfs_get_tree+0x9a/0x370 fs/super.c:1758 fc_mount fs/namespace.c:1199 [inline] do_new_mount_fc fs/namespace.c:3642 [inline] do_new_mount fs/namespace.c:3718 [inline] path_mount+0x5b8/0x1ea0 fs/namespace.c:4028 do_mount fs/namespace.c:4041 [inline] __do_sys_mount fs/namespace.c:4229 [inline] __se_sys_mount fs/namespace.c:4206 [inline] __x64_sys_mount+0x282/0x320 fs/namespace.c:4206 ... RIP: 0033:0x7f969c9a8fde Code: 0f1f4000 48c7c2b0 fffffff7 d8648902 b8ffffff ffc3660f ---[ end trace 0000000000000000 ]--- The bug is reproducible on 7.0.0-rc2-next-20260310 with our dynamic metadata fuzzing tool that corrupts btrfs metadata at runtime. [CAUSE] A non-zero drop_progress.objectid means an interrupted btrfs_drop_snapshot() left a resume point on disk, and in that case drop_level must be greater than 0 because the checkpoint is only saved at internal node levels. Although this invariant is enforced when the kernel writes the root item, it is not validated when the root item is read back from disk. That allows on-disk corruption to provide an invalid state with drop_progress.objectid != 0 and drop_level == 0. When relocation recovery later processes such a root item, merge_reloc_root() reads drop_level and hits BUG_ON(level == 0). The same invalid metadata can also trigger the corresponding BUG_ON() in btrfs_drop_snapshot(). [FIX] Fix this by validating the root_item invariant in tree-checker when reading root items from disk: if drop_progress.objectid is non-zero, drop_level must also be non-zero. Reject such malformed metadata with -EUCLEAN before it reaches merge_reloc_root() or btrfs_drop_snapshot() and triggers the BUG_ON. After the fix, the same corruption is correctly rejected by tree-checker and the BUG_ON is no longer triggered. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: ZhengYuan Huang <gality369@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-03-13btrfs: tree-checker: fix misleading root drop_level error messageZhengYuan Huang1-1/+1
Fix tree-checker error message to report "invalid root drop_level" instead of the misleading "invalid root level". Fixes: 259ee7754b67 ("btrfs: tree-checker: Add ROOT_ITEM check") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: ZhengYuan Huang <gality369@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-26btrfs: fix objectid value in error message in check_extent_data_ref()Mark Harmstone1-1/+1
Fix a copy-paste error in check_extent_data_ref(): we're printing root as in the message above, we should be printing objectid. Fixes: f333a3c7e832 ("btrfs: tree-checker: validate dref root and objectid") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-26btrfs: fix incorrect key offset in error message in check_dev_extent_item()Mark Harmstone1-1/+1
Fix the error message in check_dev_extent_item(), when an overlapping stripe is encountered. For dev extents, objectid is the disk number and offset the physical address, so prev_key->objectid should actually be prev_key->offset. (I can't take any credit for this one - this was discovered by Chris and his friend Claude.) Reported-by: Chris Mason <clm@fb.com> Fixes: 008e2512dc56 ("btrfs: tree-checker: add dev extent item checks") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: add extended version of struct block_group_itemMark Harmstone1-2/+8
Add a struct btrfs_block_group_item_v2, which is used in the block group tree if the remap-tree incompat flag is set. This adds two new fields to the block group item: `remap_bytes` and `identity_remap_count`. `remap_bytes` records the amount of data that's physically within this block group, but nominally in another, remapped block group. This is necessary because this data will need to be moved first if this block group is itself relocated. If `remap_bytes` > 0, this is an indicator to the relocation thread that it will need to search the remap-tree for backrefs. A block group must also have `remap_bytes` == 0 before it can be dropped. `identity_remap_count` records how many identity remap items are located in the remap tree for this block group. When relocation is begun for this block group, this is set to the number of holes in the free-space tree for this range. As identity remaps are converted into actual remaps by the relocation process, this number is decreased. Once it reaches 0, either because of relocation or because extents have been deleted, the block group has been fully remapped and its chunk's device extents are removed. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: allow remapped chunks to have zero stripesMark Harmstone1-20/+35
When a chunk has been fully remapped, we are going to set its num_stripes to 0, as it will no longer represent a physical location on disk. Change tree-checker to allow for this, and fix read_one_chunk() to avoid a divide by zero. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: add METADATA_REMAP chunk typeMark Harmstone1-2/+11
Add a new METADATA_REMAP chunk type, which is a metadata chunk that holds the remap tree. This is needed for bootstrapping purposes: the remap tree can't itself be remapped, and must be relocated the existing way, by COWing every leaf. The remap tree can't go in the SYSTEM chunk as space there is limited, because a copy of the chunk item gets placed in the superblock. The changes in fs/btrfs/volumes.h are because we're adding a new block group type bit after the profile bits, and so can no longer rely on the const_ilog2 trick. The sizing to 32MB per chunk, matching the SYSTEM chunk, is an estimate here, we can adjust it later if it proves to be too big or too small. This works out to be ~500,000 remap items, which for a 4KB block size covers ~2GB of remapped data in the worst case and ~500TB in the best case. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: add definitions and constants for remap-treeMark Harmstone1-4/+2
Add an incompat flag for the new remap-tree feature, and the constants and definitions needed to support it. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: make a few more ASSERTs verboseDavid Sterba1-1/+1
We have support for optional string to be printed in ASSERT() (added in 19468a623a9109 ("btrfs: enhance ASSERT() to take optional format string")), it's not yet everywhere it could be so add a few more files. Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: use the key format macros when printing keysFilipe Manana1-12/+9
Change all locations that print a key to use the new macros to print them in order to ensure a consistent style and avoid repetitive code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-10-13btrfs: tree-checker: fix bounds check in check_inode_extref()Dan Carpenter1-1/+1
The parentheses for the unlikely() annotation were put in the wrong place so it means that the condition is basically never true and the bounds checking is skipped. Fixes: aab9458b9f00 ("btrfs: tree-checker: add inode extref checks") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: tree-checker: add inode extref checksQu Wenruo1-0/+37
Like inode refs, inode extrefs have a variable length name, which means we have to do a proper check to make sure no header nor name can exceed the item limits. The check itself is very similar to check_inode_ref(), just a different structure (btrfs_inode_extref vs btrfs_inode_ref). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: fix typos in comments and stringsDavid Sterba1-1/+1
Annual typo fixing pass. Strangely codespell found only about 30% of what is in this patch, the rest was done manually using text spellchecker with a custom dictionary of acceptable terms. Reviewed-by: Neal Gompa <neal@gompa.dev> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-18btrfs: tree-checker: fix the incorrect inode ref size checkQu Wenruo1-2/+2
[BUG] Inside check_inode_ref(), we need to make sure every structure, including the btrfs_inode_extref header, is covered by the item. But our code is incorrectly using "sizeof(iref)", where @iref is just a pointer. This means "sizeof(iref)" will always be "sizeof(void *)", which is much smaller than "sizeof(struct btrfs_inode_extref)". This will allow some bad inode extrefs to sneak in, defeating tree-checker. [FIX] Fix the typo by calling "sizeof(*iref)", which is the same as "sizeof(struct btrfs_inode_extref)", and will be the correct behavior we want. Fixes: 71bf92a9b877 ("btrfs: tree-checker: Add check for INODE_REF") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21btrfs: add btrfs prefix to is_fstree() and make it return boolFilipe Manana1-6/+6
This is an exported function and therefore it should have a 'btrfs_' prefix, to make it clear it's btrfs specific, avoid future name collisions with code outside btrfs, and make its naming consistent with most other btrfs exported functions. So add a 'btrfs_' prefix to it and make it return bool instead of int, since all we need is to return true or false. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: convert WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG)) to DEBUG_WARNDavid Sterba1-5/+3
Use the conditional warning instead of typing the whole condition. Optional message is printed where it seems clear what could be the problem. Conversion is left out in btree_csum_one_bio() because of the additional condition. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: tree-checker: more unlikely annotationsDavid Sterba1-7/+7
Add more unlikely annotations to branches that lead to EUCLEAN, overall in the tree checker this helps to reorder instructions for the no-error case. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-04-17btrfs: tree-checker: adjust error code for header level checkDavid Sterba1-1/+1
The whole tree checker returns EUCLEAN, except the one check in btrfs_verify_level_key(). This was inherited from the function that was moved from disk-io.c in 2cac5af16537 ("btrfs: move btrfs_verify_level_key into tree-checker.c") but this should be unified with the rest. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: validate system chunk array at btrfs_validate_super()Qu Wenruo1-42/+54
Currently btrfs_validate_super() only does a very basic check on the array chunk size (not too large than the available space, but not too small to contain no chunk). The more comprehensive checks (the regular chunk checks and size check inside the system chunk array) are all done inside btrfs_read_sys_array(). It's not a big deal, but it also means we do not do any validation on the system chunk array at super block writeback time either. Do the following modification to centralize the system chunk array checks into btrfs_validate_super(): - Make chunk_err() helper accept stack chunk pointer If @leaf parameter is NULL, then the @chunk pointer will be a pointer to the chunk item, other than the offset inside the leaf. And since @leaf can be NULL, add a new @fs_info parameter for that case. - Make btrfs_check_chunk_valid() handle stack chunk pointer The same as chunk_err(), a new @fs_info parameter, and if @leaf is NULL, then @chunk will be a pointer to a stack chunk. If @chunk is NULL, then all needed btrfs_chunk members will be read using the stack helper instead of the leaf helper. This means we need to read out all the needed member at the beginning of the function. Furthermore, at super block read time, fs_info->sectorsize is not yet initialized, we need one extra @sectorsize parameter to grab the correct sectorsize. - Introduce a helper validate_sys_chunk_array() * Validate the disk key. * Validate the size before we access the full chunk items. * Do the full chunk item validation. - Call validate_sys_chunk_array() at btrfs_validate_super() - Simplify the checks inside btrfs_read_sys_array() Now the checks will be converted to an ASSERT(). - Simplify the checks inside read_one_chunk() Now that all chunk items inside system chunk array and chunk tree are verified, there is no need to verify them again inside read_one_chunk(). This change has the following advantages: - More comprehensive checks at write time And unlike the sys_chunk_array read routine, this time we do not need to allocate a dummy extent buffer to do the check. All the checks done here require no new memory allocation. - Slightly improved readability when iterating the system chunk array Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-17btrfs: tree-checker: reject inline extent items with 0 ref countQu Wenruo1-1/+26
[BUG] There is a bug report in the mailing list where btrfs_run_delayed_refs() failed to drop the ref count for logical 25870311358464 num_bytes 2113536. The involved leaf dump looks like this: item 166 key (25870311358464 168 2113536) itemoff 10091 itemsize 50 extent refs 1 gen 84178 flags 1 ref#0: shared data backref parent 32399126528000 count 0 <<< ref#1: shared data backref parent 31808973717504 count 1 Notice the count number is 0. [CAUSE] There is no concrete evidence yet, but considering 0 -> 1 is also a single bit flipped, it's possible that hardware memory bitflip is involved, causing the on-disk extent tree to be corrupted. [FIX] To prevent us reading such corrupted extent item, or writing such damaged extent item back to disk, enhance the handling of BTRFS_EXTENT_DATA_REF_KEY and BTRFS_SHARED_DATA_REF_KEY keys for both inlined and key items, to detect such 0 ref count and reject them. CC: stable@vger.kernel.org # 5.4+ Link: https://lore.kernel.org/linux-btrfs/7c69dd49-c346-4806-86e7-e6f863a66f48@app.fastmail.com/ Reported-by: Frankie Fisher <frankie@terrorise.me.uk> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: simplify arguments for btrfs_verify_level_key()Filipe Manana1-8/+8
The only caller of btrfs_verify_level_key() is read_block_for_search() and it's passing 3 arguments to it that can be extracted from its on stack variable of type struct btrfs_tree_parent_check. So change btrfs_verify_level_key() to accept an argument of type struct btrfs_tree_parent_check instead of level, first key and parent transid arguments. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-17btrfs: tree-checker: fix the wrong output of data backref objectidQu Wenruo1-1/+1
[BUG] There are some reports about invalid data backref objectids, the report looks like this: BTRFS critical (device sda): corrupt leaf: block=333654787489792 slot=110 extent bytenr=333413935558656 len=65536 invalid data ref objectid value 2543 The data ref objectid is the inode number inside the subvolume. But in above case, the value is completely sane, not really showing the problem. [CAUSE] The root cause of the problem is the deprecated feature, inode cache. This feature results a special inode number, -12ULL, and it's no longer recognized by tree-checker, triggering the error. The direct problem here is the output of data ref objectid. The value shown is in fact the dref_root (subvolume id), not the dref_objectid (inode number). [FIX] Fix the output to use dref_objectid instead. Reported-by: Neil Parton <njparton@gmail.com> Reported-by: Archange <archange@archlinux.org> Link: https://lore.kernel.org/linux-btrfs/CAAYHqBbrrgmh6UmW3ANbysJX9qG9Pbg3ZwnKsV=5mOpv_qix_Q@mail.gmail.com/ Link: https://lore.kernel.org/linux-btrfs/9541deea-9056-406e-be16-a996b549614d@archlinux.org/ Fixes: f333a3c7e832 ("btrfs: tree-checker: validate dref root and objectid") CC: stable@vger.kernel.org # 6.11 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-15btrfs: tree-checker: add dev extent item checksQu Wenruo1-0/+69
[REPORT] There is a corruption report that btrfs refused to mount a fs that has overlapping dev extents: BTRFS error (device sdc): dev extent devid 4 physical offset 14263979671552 overlap with previous dev extent end 14263980982272 BTRFS error (device sdc): failed to verify dev extents against chunks: -117 BTRFS error (device sdc): open_ctree failed [CAUSE] The direct cause is very obvious, there is a bad dev extent item with incorrect length. With btrfs check reporting two overlapping extents, the second one shows some clue on the cause: ERROR: dev extent devid 4 offset 14263979671552 len 6488064 overlap with previous dev extent end 14263980982272 ERROR: dev extent devid 13 offset 2257707008000 len 6488064 overlap with previous dev extent end 2257707270144 ERROR: errors found in extent allocation tree or chunk allocation The second one looks like a bitflip happened during new chunk allocation: hex(2257707008000) = 0x20da9d30000 hex(2257707270144) = 0x20da9d70000 diff = 0x00000040000 So it looks like a bitflip happened during new dev extent allocation, resulting the second overlap. Currently we only do the dev-extent verification at mount time, but if the corruption is caused by memory bitflip, we really want to catch it before writing the corruption to the storage. Furthermore the dev extent items has the following key definition: (<device id> DEV_EXTENT <physical offset>) Thus we can not just rely on the generic key order check to make sure there is no overlapping. [ENHANCEMENT] Introduce dedicated dev extent checks, including: - Fixed member checks * chunk_tree should always be BTRFS_CHUNK_TREE_OBJECTID (3) * chunk_objectid should always be BTRFS_FIRST_CHUNK_CHUNK_TREE_OBJECTID (256) - Alignment checks * chunk_offset should be aligned to sectorsize * length should be aligned to sectorsize * key.offset should be aligned to sectorsize - Overlap checks If the previous key is also a dev-extent item, with the same device id, make sure we do not overlap with the previous dev extent. Reported: Stefan N <stefannnau@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CA+W5K0rSO3koYTo=nzxxTm1-Pdu1HYgVxEpgJ=aGc7d=E8mGEg@mail.gmail.com/ CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-13btrfs: tree-checker: reject BTRFS_FT_UNKNOWN dir typeQu Wenruo1-2/+3
[REPORT] There is a bug report that kernel is rejecting a mismatching inode mode and its dir item: [ 1881.553937] BTRFS critical (device dm-0): inode mode mismatch with dir: inode mode=040700 btrfs type=2 dir type=0 [CAUSE] It looks like the inode mode is correct, while the dir item type 0 is BTRFS_FT_UNKNOWN, which should not be generated by btrfs at all. This may be caused by a memory bit flip. [ENHANCEMENT] Although tree-checker is not able to do any cross-leaf verification, for this particular case we can at least reject any dir type with BTRFS_FT_UNKNOWN. So here we enhance the dir type check from [0, BTRFS_FT_MAX), to (0, BTRFS_FT_MAX). Although the existing corruption can not be fixed just by such enhanced checking, it should prevent the same 0x2->0x0 bitflip for dir type to reach disk in the future. Reported-by: Kota <nospam@kota.moe> Link: https://lore.kernel.org/linux-btrfs/CACsxjPYnQF9ZF-0OhH16dAx50=BXXOcP74MxBc3BG+xae4vTTw@mail.gmail.com/ CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-30Merge tag 'for-6.11-rc1-tag' of ↵Linus Torvalds1-0/+47
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix regression in extent map rework when handling insertion of overlapping compressed extent - fix unexpected file length when appending to a file using direct io and buffer not faulted in - in zoned mode, fix accounting of unusable space when flipping read-only block group back to read-write - fix page locking when COWing an inline range, assertion failure found by syzbot - fix calculation of space info in debugging print - tree-checker, add validation of data reference item - fix a few -Wmaybe-uninitialized build warnings * tag 'for-6.11-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: initialize location to fix -Wmaybe-uninitialized in btrfs_lookup_dentry() btrfs: fix corruption after buffer fault in during direct IO append write btrfs: zoned: fix zone_unusable accounting on making block group read-write again btrfs: do not subtract delalloc from avail bytes btrfs: make cow_file_range_inline() honor locked_page on error btrfs: fix corrupt read due to bad offset of a compressed extent map btrfs: tree-checker: validate dref root and objectid
2024-07-28minmax: don't use max() in situations that want a C constant expressionLinus Torvalds1-1/+1
We only had a couple of array[] declarations, and changing them to just use 'MAX()' instead of 'max()' fixes the issue. This will allow us to simplify our min/max macros enormously, since they can now unconditionally use temporary variables to avoid using the argument values multiple times. Cc: David Laight <David.Laight@aculab.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-07-25btrfs: tree-checker: validate dref root and objectidQu Wenruo1-0/+47
[CORRUPTION] There is a bug report that btrfs flips RO due to a corruption in the extent tree, the involved dumps looks like this: item 188 key (402811572224 168 4096) itemoff 14598 itemsize 79 extent refs 3 gen 3678544 flags 1 ref#0: extent data backref root 13835058055282163977 objectid 281473384125923 offset 81432576 count 1 ref#1: shared data backref parent 1947073626112 count 1 ref#2: shared data backref parent 1156030103552 count 1 BTRFS critical (device vdc1: state EA): unable to find ref byte nr 402811572224 parent 0 root 265 owner 28703026 offset 81432576 slot 189 BTRFS error (device vdc1: state EA): failed to run delayed ref for logical 402811572224 num_bytes 4096 type 178 action 2 ref_mod 1: -2 [CAUSE] The corrupted entry is ref#0 of item 188. The root number 13835058055282163977 is beyond the upper limit for root items (the current limit is 1 << 48), and the objectid also looks suspicious. Only the offset and count is correct. [ENHANCEMENT] Although it's still unknown why we have such many bytes corrupted randomly, we can still enhance the tree-checker for data backrefs by: - Validate the root value For now there should only be 3 types of roots can have data backref: * subvolume trees * data reloc trees * root tree Only for v1 space cache - validate the objectid value The objectid should be a valid inode number. Hopefully we can catch such problem in the future with the new checkers. Reported-by: Kai Krakow <hurikhan77@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CAMthOuPjg5RDT-G_LXeBBUUtzt3cq=JywF+D1_h+JYxe=WKp-Q@mail.gmail.com/#t Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: tree-checker: add extra ram_bytes and disk_num_bytes checkQu Wenruo1-0/+18
This is to ensure non-compressed file extents (both regular and prealloc) should have matching ram_bytes and disk_num_bytes. This is only for CONFIG_BTRFS_DEBUG and CONFIG_BTRFS_ASSERT case, furthermore this will not return error, but just a kernel warning to inform developers. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: remove raid-stripe-tree encoding field from stripe_extentJohannes Thumshirn1-19/+0
Remove the encoding field from 'struct btrfs_stripe_extent'. It was originally intended to encode the RAID type as well as if we're a data or a parity stripe. But the RAID type can be inferred form the block-group and the data vs. parity differentiation can be done easier with adding a new key type for parity stripes in the RAID stripe tree. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: use btrfs_is_testing() everywhereDavid Sterba1-1/+1
There are open coded tests of BTRFS_FS_STATE_DUMMY_FS_INFO and we have a wrapper for that that's a compile-time constant when self-tests are not built in. As this is only for development we can save some bytes and conditions on release configs by using the helper in the remaining cases. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-02btrfs: make sure that WRITTEN is set on all metadata blocksJosef Bacik1-15/+15
We previously would call btrfs_check_leaf() if we had the check integrity code enabled, which meant that we could only run the extended leaf checks if we had WRITTEN set on the header flags. This leaves a gap in our checking, because we could end up with corruption on disk where WRITTEN isn't set on the leaf, and then the extended leaf checks don't get run which we rely on to validate all of the item pointers to make sure we don't access memory outside of the extent buffer. However, since 732fab95abe2 ("btrfs: check-integrity: remove CONFIG_BTRFS_FS_CHECK_INTEGRITY option") we no longer call btrfs_check_leaf() from btrfs_mark_buffer_dirty(), which means we only ever call it on blocks that are being written out, and thus have WRITTEN set, or that are being read in, which should have WRITTEN set. Add checks to make sure we have WRITTEN set appropriately, and then make sure __btrfs_check_leaf() always does the item checking. This will protect us from file systems that have been corrupted and no longer have WRITTEN set on some of the blocks. This was hit on a crafted image tweaking the WRITTEN bit and reported by KASAN as out-of-bound access in the eb accessors. The example is a dir item at the end of an eb. [2.042] BTRFS warning (device loop1): bad eb member start: ptr 0x3fff start 30572544 member offset 16410 size 2 [2.040] general protection fault, probably for non-canonical address 0xe0009d1000000003: 0000 [#1] PREEMPT SMP KASAN NOPTI [2.537] KASAN: maybe wild-memory-access in range [0x0005088000000018-0x000508800000001f] [2.729] CPU: 0 PID: 2587 Comm: mount Not tainted 6.8.2 #1 [2.729] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 [2.621] RIP: 0010:btrfs_get_16+0x34b/0x6d0 [2.621] RSP: 0018:ffff88810871fab8 EFLAGS: 00000206 [2.621] RAX: 0000a11000000003 RBX: ffff888104ff8720 RCX: ffff88811b2288c0 [2.621] RDX: dffffc0000000000 RSI: ffffffff81dd8aca RDI: ffff88810871f748 [2.621] RBP: 000000000000401a R08: 0000000000000001 R09: ffffed10210e3ee9 [2.621] R10: ffff88810871f74f R11: 205d323430333737 R12: 000000000000001a [2.621] R13: 000508800000001a R14: 1ffff110210e3f5d R15: ffffffff850011e8 [2.621] FS: 00007f56ea275840(0000) GS:ffff88811b200000(0000) knlGS:0000000000000000 [2.621] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [2.621] CR2: 00007febd13b75c0 CR3: 000000010bb50000 CR4: 00000000000006f0 [2.621] Call Trace: [2.621] <TASK> [2.621] ? show_regs+0x74/0x80 [2.621] ? die_addr+0x46/0xc0 [2.621] ? exc_general_protection+0x161/0x2a0 [2.621] ? asm_exc_general_protection+0x26/0x30 [2.621] ? btrfs_get_16+0x33a/0x6d0 [2.621] ? btrfs_get_16+0x34b/0x6d0 [2.621] ? btrfs_get_16+0x33a/0x6d0 [2.621] ? __pfx_btrfs_get_16+0x10/0x10 [2.621] ? __pfx_mutex_unlock+0x10/0x10 [2.621] btrfs_match_dir_item_name+0x101/0x1a0 [2.621] btrfs_lookup_dir_item+0x1f3/0x280 [2.621] ? __pfx_btrfs_lookup_dir_item+0x10/0x10 [2.621] btrfs_get_tree+0xd25/0x1910 Reported-by: lei lu <llfamsec@gmail.com> CC: stable@vger.kernel.org # 6.7+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ copy more details from report ] Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-05btrfs: tree-checker: dump the page status if hit something wrongQu Wenruo1-0/+6
[BUG] There is a bug report about very suspicious tree-checker got triggered: BTRFS critical (device dm-0): corrupted node, root=256 block=8550954455682405139 owner mismatch, have 11858205567642294356 expect [256, 18446744073709551360] BTRFS critical (device dm-0): corrupted node, root=256 block=8550954455682405139 owner mismatch, have 11858205567642294356 expect [256, 18446744073709551360] BTRFS critical (device dm-0): corrupted node, root=256 block=8550954455682405139 owner mismatch, have 11858205567642294356 expect [256, 18446744073709551360] SELinux: inode_doinit_use_xattr: getxattr returned 117 for dev=dm-0 ino=5737268 [ANALYZE] The root cause is still unclear, but there are some clues already: - Unaligned eb bytenr The block bytenr is 8550954455682405139, which is not even aligned to 2. This bytenr is fetched from extent buffer header, not from eb->start. This means, at the initial time of read, eb header bytenr is still correct (the very basis check to continue read), but later something wrong happened, got at least the first page corrupted. Thus we got such obviously incorrect value. - Invalid extent buffer header owner The read itself is triggered for subvolume 256, but the eb header owner is 11858205567642294356, which is not really possible. The problem here is, subvolume id is limited to (1 << 48 - 1), and this one definitely goes beyond that limit. So this value is another garbage. We already got two garbage from an extent buffer, which passed the initial bytenr and csum checks, but later the contents become garbage at some point. This looks like a page lifespan problem (e.g. we didn't properly hold the page). [ENHANCEMENT] The current tree-checker only outputs things from the extent buffer, nothing with the page status. So this patch would enhance the tree-checker output by also dumping the first page, which would look like this: page:00000000aa9f3ce8 refcount:4 mapcount:0 mapping:00000000169aa6b6 index:0x1d0c pfn:0x1022e5 memcg:ffff888103456000 aops:btree_aops [btrfs] ino:1 flags: 0x2ffff0000008000(private|node=0|zone=2|lastcpupid=0xffff) page_type: 0xffffffff() raw: 02ffff0000008000 0000000000000000 dead000000000122 ffff88811e06e220 raw: 0000000000001d0c ffff888102fdb1d8 00000004ffffffff ffff888103456000 page dumped because: eb page dump BTRFS critical (device dm-3): corrupt leaf: root=5 block=30457856 slot=6 ino=257 file_offset=0, invalid disk_bytenr for file extent, have 10617606235235216665, should be aligned to 4096 BTRFS error (device dm-3): read time tree block corruption detected on logical 30457856 mirror 1 From the dump we can see some extra info, something can help us to do extra cross-checks: - Page refcount if it's too low, it definitely means something bad. - Page aops Any mapped eb page should have btree_aops with inode number 1. - Page index Since a mapped eb page should has its bytenr matching the page position, (index << PAGE_SHIFT) should match the bytenr of the bytenr from the critical line. - Page Private flags A mapped eb page should have Private flag set to indicate it's managed by btrfs. Link: https://lore.kernel.org/linux-btrfs/CAHk-=whNdMaN9ntZ47XRKP6DBes2E5w7fi-0U3H2+PS18p+Pzw@mail.gmail.com/ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: remove unused included headersDavid Sterba1-2/+0
With help of neovim, LSP and clangd we can identify header files that are not actually needed to be included in the .c files. This is focused only on removal (with minor fixups), further cleanups are possible but will require doing the header files properly with forward declarations, minimized includes and include-what-you-use care. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-01-18btrfs: tree-checker: fix inline ref size in error messagesChung-Chiang Cheng1-1/+1
The error message should accurately reflect the size rather than the type. Fixes: f82d1c7ca8ae ("btrfs: tree-checker: Add EXTENT_ITEM and METADATA_ITEM check") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-15btrfs: tree-checker: add type and sequence check for inline backrefsQu Wenruo1-0/+39
[BUG] There is a bug report that ntfs2btrfs had a bug that it can lead to transaction abort and the filesystem flips to read-only. [CAUSE] For inline backref items, kernel has a strict requirement for their ordered, they must follow the following rules: - All btrfs_extent_inline_ref::type should be in an ascending order - Within the same type, the items should follow a descending order by their sequence number For EXTENT_DATA_REF type, the sequence number is result from hash_extent_data_ref(). For other types, their sequence numbers are btrfs_extent_inline_ref::offset. Thus if there is any code not following above rules, the resulted inline backrefs can prevent the kernel to locate the needed inline backref and lead to transaction abort. [FIX] Ntrfs2btrfs has already fixed the problem, and btrfs-progs has added the ability to detect such problems. For kernel, let's be more noisy and be more specific about the order, so that the next time kernel hits such problem we would reject it in the first place, without leading to transaction abort. Link: https://github.com/kdave/btrfs-progs/pull/622 Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: add and use helpers for reading and writing last_trans_committedFilipe Manana1-1/+1
Currently the last_trans_committed field of struct btrfs_fs_info is modified and read without any locking or other protection. For example early in the fsync path, skip_inode_logging() is called which reads fs_info->last_trans_committed, but at the same time we can have a transaction commit completing and updating that field. In the case of an fsync this is harmless and any data race should be rare and at most cause an unnecessary logging of an inode. To avoid data race warnings from tools like KCSAN and other issues such as load and store tearing (amongst others, see [1]), create helpers to access the last_trans_committed field of struct btrfs_fs_info using READ_ONCE() and WRITE_ONCE(), and use these helpers everywhere. [1] https://lwn.net/Articles/793253/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: new inline ref storing owning subvol of data extentsBoris Burkov1-0/+3
In order to implement simple quota groups, we need to be able to associate a data extent with the subvolume that created it. Once you account for reflink, this information cannot be recovered without explicitly storing it. Options for storing it are: - a new key/item - a new extent inline ref item The former is backwards compatible, but wastes space, the latter is incompat, but is efficient in space and reuses the existing inline ref machinery, while only abusing it a tiny amount -- specifically, the new item is not a ref, per-se. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: tree-checker: add support for raid stripe treeJohannes Thumshirn1-0