aboutsummaryrefslogtreecommitdiff
path: root/fs/btrfs/fs.h
AgeCommit message (Collapse)AuthorFilesLines
2026-05-16btrfs: fix squota accounting during enable generationBoris Burkov1-0/+1
The first transaction that enables squotas is special and a bit tricky. We have to set BTRFS_FS_QUOTA_ENABLED after the transaction to avoid a deadlock, so any delayed refs that run before we set the bit are not squota accounted. For data this is fine, we don't get an owner_ref, so there is no real harm, it's as if the extent predated squotas. However for metadata, the tree block will have gen == enable_gen so when we free it later, we will decrement the squota accounting, which can result in an underflow. Before it is freed, btrfs check shows errors, as we have mismatched usage between the node generations/owners and the squota values. There are two angles to this fix: 1. For extents that come in delayed_refs that run during the enable_gen transaction, we must actually set enable_gen to the *next* transaction. That is the first transaction that we can really properly account in any way. 2. For extents that come in between the end of our transaction handle and the time we set the BTRFS_FS_QUOTA_ENABLED bit, we need an additional bit, BTRFS_FS_SQUOTA_ENABLING which only affects recording squota deltas, so we do pick up those extents. Otherwise, we would miss them, even for enable_gen + 1. Fixes: bd7c1ea3a302 ("btrfs: qgroup: check generation when recording simple quota delta") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: report filesystem shutdown via fserrorMiquel Sabaté Solà1-1/+4
Commit 347b7042fb26 ("Merge patch series "fs: generic file IO error reporting"") has introduced a common framework for reporting errors to fsnotify in a standard way. One of the functions being introduced is fserror_report_shutdown() that, when combined with the experimental support for shutdown in btrfs, it means that user-space can also easily detect whenever a btrfs filesystem has been marked as shutdown. Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-04-07btrfs: constify arguments of some functionsFilipe Manana1-2/+2
There are several functions that take pointer arguments but don't need to modify the objects they point to, so add the const qualifiers. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: fix transaction commit blocking during trim of unallocated spacejinbaohong1-0/+6
When trimming unallocated space, btrfs_trim_fs() holds the device_list_mutex for the entire duration while iterating through all devices. On large filesystems with significant unallocated space, this operation can take minutes to hours on large storage systems. This causes a problem because btrfs_run_dev_stats(), which is called during transaction commit, also requires device_list_mutex: btrfs_trim_fs() mutex_lock(&fs_devices->device_list_mutex) list_for_each_entry(device, ...) btrfs_trim_free_extents(device) mutex_unlock(&fs_devices->device_list_mutex) commit_transaction() btrfs_run_dev_stats() mutex_lock(&fs_devices->device_list_mutex) // blocked! ... While trim is running, all transaction commits are blocked waiting for the mutex. Fix this by refactoring btrfs_trim_free_extents() to process devices in bounded chunks (up to 2GB per iteration) and release device_list_mutex between chunks. Signed-off-by: robbieko <robbieko@synology.com> Signed-off-by: jinbaohong <jinbaohong@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: reorder members in btrfs_delayed_root for better packingDavid Sterba1-1/+1
There are two unnecessary 4B holes in btrfs_delayed_root; struct btrfs_delayed_root { spinlock_t lock; /* 0 4 */ /* XXX 4 bytes hole, try to pack */ struct list_head node_list; /* 8 16 */ struct list_head prepare_list; /* 24 16 */ atomic_t items; /* 40 4 */ atomic_t items_seq; /* 44 4 */ int nodes; /* 48 4 */ /* XXX 4 bytes hole, try to pack */ wait_queue_head_t wait; /* 56 24 */ /* size: 80, cachelines: 2, members: 7 */ /* sum members: 72, holes: 2, sum holes: 8 */ /* last cacheline: 16 bytes */ }; Reordering 'nodes' after 'lock' reduces size by 8B, to 72 on release config. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: embed delayed root to struct btrfs_fs_infoDavid Sterba1-2/+16
The fs_info::delayed_root is allocated dynamically but there's only one instance per filesystem so we can embed it into the fs_info itself. The two object have the same lifetime and delayed roots are always present so we don't need to allocate it on demand from slab. There's still some space left in fs_info until the 4K so there won't be an spill over to next page on release config (size grows from 3880 to 3952). In case we want to shrink fs_info there are still holes to fill or we can separate other non-core or optional structures if needed. Link: https://lore.kernel.org/all/cover.1767979013.git.dsterba@suse.com/ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: handle deletions from remapped block groupMark Harmstone1-1/+3
Handle the case where we free an extent from a block group that has the REMAPPED flag set. Because the remap tree is orthogonal to the extent tree, for data this may be within any number of identity remaps or actual remaps. If we're freeing a metadata node, this will be wholly inside one or the other. btrfs_remove_extent_from_remap_tree() searches the remap tree for the remaps that cover the range in question, then calls remove_range_from_remap_tree() for each one, to punch a hole in the remap and adjust the free-space tree. For an identity remap, remove_range_from_remap_tree() will adjust the block group's `identity_remap_count` if this changes. If it reaches zero we mark the block group as fully remapped. For an identity remap, remove_range_from_remap_tree() will adjust the block group's `identity_remap_count` if this changes. If it reaches zero we mark the block group as fully remapped. Fully remapped block groups have their chunk stripes removed and their device extents freed, which makes the disk space available again to the chunk allocator. This happens asynchronously: in the cleaner thread for sync discard and nodiscard, and (in a later patch) in the discard worker for async discard. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: allow mounting filesystems with remap-tree incompat flagMark Harmstone1-1/+3
If we encounter a filesystem with the remap-tree incompat flag set, validate its compatibility with the other flags, and load the remap tree using the values that have been added to the superblock. The remap-tree feature depends on the free-space-tree, but no-holes and block-group-tree have been made dependencies to reduce the testing matrix. Similarly I'm not aware of any reason why mixed-bg and zoned would be incompatible with remap-tree, but this is blocked for the time being until it can be fully tested. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: add METADATA_REMAP chunk typeMark Harmstone1-0/+2
Add a new METADATA_REMAP chunk type, which is a metadata chunk that holds the remap tree. This is needed for bootstrapping purposes: the remap tree can't itself be remapped, and must be relocated the existing way, by COWing every leaf. The remap tree can't go in the SYSTEM chunk as space there is limited, because a copy of the chunk item gets placed in the superblock. The changes in fs/btrfs/volumes.h are because we're adding a new block group type bit after the profile bits, and so can no longer rely on the const_ilog2 trick. The sizing to 32MB per chunk, matching the SYSTEM chunk, is an estimate here, we can adjust it later if it proves to be too big or too small. This works out to be ~500,000 remap items, which for a 4KB block size covers ~2GB of remapped data in the worst case and ~500TB in the best case. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: split btrfs_fs_closing() and change return type to boolDavid Sterba1-8/+10
There are two tests in btrfs_fs_closing() but checking the BTRFS_FS_CLOSING_DONE bit is done only in one place load_extent_tree_free(). As this is an inline we can reduce size of the generated code. The types can be also changed to bool as this becomes a simple condition. text data bss dec hex filename 1674006 146704 15560 1836270 1c04ee pre/btrfs.ko 1673772 146704 15560 1836036 1c0404 post/btrfs.ko DELTA: -234 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: move unlikely checks around btrfs_is_shutdown() into the helperFilipe Manana1-2/+2
Instead of surrounding every caller of btrfs_is_shutdown() with unlikely, move the unlikely into the helper itself, like we do in other places in btrfs and is common in the kernel outside btrfs too. Also make the fs_info argument of btrfs_is_shutdown() const. On a x86_84 box using gcc 14.2.0-19 from Debian, this resulted in a slight reduction of the module's text size. Before: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1939044 172568 15592 2127204 207564 fs/btrfs/btrfs.ko After: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1938876 172568 15592 2127036 2074bc fs/btrfs/btrfs.ko Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: switch to library APIs for checksumsEric Biggers1-5/+18
Make btrfs use the library APIs instead of crypto_shash, for all checksum computations. This has many benefits: - Allows future checksum types, e.g. XXH3 or CRC64, to be more easily supported. Only a library API will be needed, not crypto_shash too. - Eliminates the overhead of the generic crypto layer, including an indirect call for every function call and other API overhead. A microbenchmark of btrfs_check_read_bio() with crc32c checksums shows a speedup from 658 cycles to 608 cycles per 4096-byte block. - Decreases the stack usage of btrfs by reducing the size of checksum contexts from 384 bytes to 240 bytes, and by eliminating the need for some functions to declare a checksum context at all. - Increases reliability. The library functions always succeed and return void. In contrast, crypto_shash can fail and return errors. Also, the library functions are guaranteed to be available when btrfs is loaded; there's no longer any need to use module softdeps to try to work around the crypto modules sometimes not being loaded. - Fixes a bug where blake2b checksums didn't work on kernels booted with fips=1. Since btrfs checksums are for integrity only, it's fine for them to use non-FIPS-approved algorithms. Note that with having to handle 4 algorithms instead of just 1-2, this commit does result in a slightly positive diffstat. That being said, this wouldn't have been the case if btrfs had actually checked for errors from crypto_shash, which technically it should have been doing. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Neal Gompa <neal@gompa.dev> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-01-20btrfs: reject new transactions if the fs is fully read-onlyQu Wenruo1-0/+8
[BUG] There is a bug report where a heavily fuzzed fs is mounted with all rescue mount options, which leads to the following warnings during unmount: BTRFS: Transaction aborted (error -22) Modules linked in: CPU: 0 UID: 0 PID: 9758 Comm: repro.out Not tainted 6.19.0-rc5-00002-gb71e635feefc #7 PREEMPT(full) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:find_free_extent_update_loop fs/btrfs/extent-tree.c:4208 [inline] RIP: 0010:find_free_extent+0x52f0/0x5d20 fs/btrfs/extent-tree.c:4611 Call Trace: <TASK> btrfs_reserve_extent+0x2cd/0x790 fs/btrfs/extent-tree.c:4705 btrfs_alloc_tree_block+0x1e1/0x10e0 fs/btrfs/extent-tree.c:5157 btrfs_force_cow_block+0x578/0x2410 fs/btrfs/ctree.c:517 btrfs_cow_block+0x3c4/0xa80 fs/btrfs/ctree.c:708 btrfs_search_slot+0xcad/0x2b50 fs/btrfs/ctree.c:2130 btrfs_truncate_inode_items+0x45d/0x2350 fs/btrfs/inode-item.c:499 btrfs_evict_inode+0x923/0xe70 fs/btrfs/inode.c:5628 evict+0x5f4/0xae0 fs/inode.c:837 __dentry_kill+0x209/0x660 fs/dcache.c:670 finish_dput+0xc9/0x480 fs/dcache.c:879 shrink_dcache_for_umount+0xa0/0x170 fs/dcache.c:1661 generic_shutdown_super+0x67/0x2c0 fs/super.c:621 kill_anon_super+0x3b/0x70 fs/super.c:1289 btrfs_kill_super+0x41/0x50 fs/btrfs/super.c:2127 deactivate_locked_super+0xbc/0x130 fs/super.c:474 cleanup_mnt+0x425/0x4c0 fs/namespace.c:1318 task_work_run+0x1d4/0x260 kernel/task_work.c:233 exit_task_work include/linux/task_work.h:40 [inline] do_exit+0x694/0x22f0 kernel/exit.c:971 do_group_exit+0x21c/0x2d0 kernel/exit.c:1112 __do_sys_exit_group kernel/exit.c:1123 [inline] __se_sys_exit_group kernel/exit.c:1121 [inline] __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1121 x64_sys_call+0x2210/0x2210 arch/x86/include/generated/asm/syscalls_64.h:232 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xe8/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x44f639 Code: Unable to access opcode bytes at 0x44f60f. RSP: 002b:00007ffc15c4e088 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 00000000004c32f0 RCX: 000000000044f639 RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000001 RBP: 0000000000000001 R08: ffffffffffffffc0 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004c32f0 R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001 </TASK> Since rescue mount options will mark the full fs read-only, there should be no new transaction triggered. But during unmount we will evict all inodes, which can trigger a new transaction, and triggers warnings on a heavily corrupted fs. [CAUSE] Btrfs allows new transaction even on a read-only fs, this is to allow log replay happen even on read-only mounts, just like what ext4/xfs do. However with rescue mount options, the fs is fully read-only and cannot be remounted read-write, thus in that case we should also reject any new transactions. [FIX] If we find the fs has rescue mount options, we should treat the fs as error, so that no new transaction can be started. Reported-by: Jiaming Zhang <r772577952@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CANypQFYw8Nt8stgbhoycFojOoUmt+BoZ-z8WJOZVxcogDdwm=Q@mail.gmail.com/ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: move and rename CSUM_FMT definitionDavid Sterba1-0/+4
Move the CSUM_FMT* definitions to fs.h where is be the BTRFS_KEY_FMT and add the prefix for consistency. Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: remove btrfs_fs_info::compressed_write_workersQu Wenruo1-1/+0
The reason why end_bbio_compressed_write() queues a work into compressed_write_workers wq is for end_compressed_writeback() call, as it will grab all the involved folios and clear the writeback flags, which may sleep. However now we always run btrfs_bio::end_io() in task context, there is no need to queue the work anymore. Just remove btrfs_fs_info::compressed_write_workers and compressed_bio::write_end_work. There is a comment about the works queued into compressed_write_workers, now change to flush endio wq instead, which is responsible to handle all data endio functions. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: add macros to facilitate printing of keysFilipe Manana1-0/+3
There's a lot of places where we need to print a key, and it's tiresome to type the format specifier, typically "(%llu %u %llu)", as well as passing 3 arguments to a prink family function (key->objectid, key->type, key->offset). So add a couple macros for this just like we have for csum values in btrfs_inode.h (CSUM_FMT and CSUM_FMT_VALUE). This also ensures that we consistently print a key in the same format, always as "(%llu %llu %llu)", which is the most common format we use, but we have a few variations such as "[%llu %llu %llu]" for no good reason. This patch introduces the macros while the next one makes use of it. This is to ease backports of future patches, since then we can backport this patch which is simple and short and then backport those future patches, as the next patch in the series that makes use of these new macros is quite large and may have some dependencies. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: introduce a new shutdown stateQu Wenruo1-0/+28
A new fs state EMERGENCY_SHUTDOWN is introduced, which is btrfs' equivalent of XFS_IOC_GOINGDOWN or EXT4_IOC_SHUTDOWN, after entering emergency shutdown state, all operations will return errors (-EIO), and can not be bring back to normal state until unmouont. The new state will reject the following file operations: - read_iter() - write_iter() - mmap() - open() - remap_file_range() - uring_cmd() - splice_read() This requires a small wrapper to do the extra shutdown check, then call the regular filemap_splice_read() function This should reject most of the file operations on a shutdown btrfs. And for the existing dirty folios, extra shutdown checks are introduced to the following functions: - run_delalloc_nocow() - run_delalloc_compressed() - cow_file_range() So that dirty ranges will still be properly cleaned without being submitted. Finally the shutdown state will also set the fs error, so that no new transaction will be committed, protecting the metadata from any possible further corruption. And when the fs entered shutdown mode for the first time, a critical level kernel message will show up to indicate the incident. That message will be important for end users as rejected delalloc ranges will output error messages, hopefully that shutdown message and the fact that all fs operations are returning error will prevent end users from getting too confused about the delalloc error messages. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <asj@kernel.org> Tested-by: Anand Jain <asj@kernel.org> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: prepare compression folio alloc/free for bs > ps casesQu Wenruo1-0/+6
This includes the following preparation for bs > ps cases: - Always alloc/free the folio directly if bs > ps This adds a new @fs_info parameter for btrfs_alloc_compr_folio(), thus affecting all compression algorithms. For btrfs_free_compr_folio() it needs no parameter for now, as we can use the folio size to skip the caching part. For now the change is just to passing a @fs_info into the function, all the folio size assumption is still based on page size. - Properly zero the last folio in compress_file_range() Since the compressed folios can be larger than a page, we need to properly zero the whole folio. - Use correct folio size for btrfs_add_compressed_bio_folios() Instead of page size, use the correct folio size. - Use correct folio size/shift for btrfs_compress_filemap_get_folio() As we are not only using simple page sized folios anymore. - Use correct folio size for btrfs_decompress() There is an ASSERT() making sure the decompressed range is no larger than a page, which will be triggered for bs > ps cases. - Skip readahead for compressed pages Similar to subpage cases. - Make btrfs_alloc_folio_array() to accept a new @order parameter - Add a helper to calculate the minimal folio size All those changes should not affect the existing bs <= ps handling. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: annotate btrfs_is_testing() as unlikely and make it return boolFilipe Manana1-4/+4
We can annotate btrfs_is_testing() as unlikely since that's the most expected scenario and it's desirable for the compiler to optimize for the case we are not running the self tests. So add the annotation to btrfs_is_testing() and while at it also make it return bool instead of int. Also make two of the existing callers use btrfs_is_testing() directly instead of storing its result in a local variable. On x86_64 with Debian's gcc 14.2.0-19 this resulted in a very tiny object code reduction. Before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1913263 161567 15592 2090422 1fe5b6 fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1913257 161567 15592 2090416 1fe5b0 fs/btrfs/btrfs.ko Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: dump detailed info and specific messages on log replay failuresFilipe Manana1-0/+2
Currently debugging log replay failures can be harder than needed, since all we do now is abort a transaction, which gives us a line number, a stack trace and an error code. But that is most of the times not enough to give some clue about what went wrong. So add a new helper to abort log replay and provide contextual information: 1) Dump the current leaf of the log tree being processed and print the slot we are currently at and the key at that slot; 2) Dump the current subvolume tree leaf if we have any; 3) Print the current stage of log replay; 4) Print the id of the subvolume root associated with the log tree we are currently processing (as we can have multiple); 5) Print some error message to mention what we were trying to do when we got an error. Replace all transaction abort calls (btrfs_abort_transaction()) with the new helper btrfs_abort_log_replay(), which besides dumping all that extra information, it also aborts the current transaction. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: cache max and min order inside btrfs_fs_infoQu Wenruo1-0/+2
Inside btrfs_fs_info we cache several bits shift like sectorsize_bits. Apply this to max and min folio orders so that every time mapping order needs to be applied we can skip the calculation. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: fix typos in comments and stringsDavid Sterba1-1/+1
Annual typo fixing pass. Strangely codespell found only about 30% of what is in this patch, the rest was done manually using text spellchecker with a custom dictionary of acceptable terms. Reviewed-by: Neal Gompa <neal@gompa.dev> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23btrfs: add workspace manager initialization for zstdQu Wenruo1-0/+13
This involves: - Add zstd_alloc_workspace_manager() and zstd_free_workspace_manager() Those two functions will accept an fs_info pointer, and alloc/free fs_info->compr_wsm[BTRFS_COMPRESS_ZSTD] pointer. - Add btrfs_alloc_compress_wsm() and btrfs_free_compress_wsm() Those are helpers allocating the workspace managers for all algorithms. For now only zstd is supported, and the timing is a little unusual, the btrfs_alloc_compress_wsm() should only be called after the sectorsize being initialized. Meanwhile btrfs_free_fs_info_compress() is called in btrfs_free_fs_info(). - Move the definition of btrfs_compression_type to "fs.h" The reason is that "compression.h" has already included "fs.h", thus we can not just include "compression.h" to get the definition of BTRFS_NR_COMPRESS_TYPES to define fs_info::compr_wsm[]. For now the per-fs zstd workspace manager won't really have any effect, and all compression is still going through the global workspace manager. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-22btrfs: add mount option for ref_trackerLeo Martins1-0/+1
The ref_tracker infrastructure aids debugging but is not enabled by default as it has a performance impact. Add mount option 'ref_tracker' so it can be selectively enabled on a filesystem. Currently it track references of 'delayed inodes'. Signed-off-by: Leo Martins <loemra.dev@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-22btrfs: move ref-verify under CONFIG_BTRFS_DEBUGLeo Martins1-3/+1
Remove CONFIG_BTRFS_FS_REF_VERIFY Kconfig and add it as part of CONFIG_BTRFS_DEBUG. This should not be impactful to the performance of debug. The struct btrfs_ref takes an additional u64, btrfs_fs_info takes an additional spinlock_t and rb_root. All of the ref_verify logic is still protected by a mount option. Signed-off-by: Leo Martins <loemra.dev@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-22btrfs: simplify support block size checkQu Wenruo1-0/+3
Currently we manually check the block size against 3 different values: - 4K - PAGE_SIZE - MIN_BLOCKSIZE Those 3 values can match or differ from each other. This makes it pretty complex to output the supported block sizes. Considering we're going to add block size > page size support soon, this can make the support block size sysfs attribute much harder to implement. To make it easier, factor out a helper, btrfs_supported_blocksize() to do a simple check for the block size. Then utilize it in the two locations: - btrfs_validate_super() This is very straightforward - supported_sectorsizes_show() Iterate through all valid block sizes, and only output supported ones. This is to make future full range block sizes support much easier. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22btrfs: index buffer_tree using node sizeDaniel Vacek1-1/+2
So far we've been deriving the buffer tree index using the sector size. But each extent buffer covers multiple sectors. This makes the buffer tree rather sparse. For example the typical and quite common configuration uses sector size of 4KiB and node size of 16KiB. In this case it means the buffer tree is using up to the maximum of 25% of it's slots. Or in other words at least 75% of the tree slots are wasted as never used. We can score significant memory savings on the required tree nodes by indexing the tree using the node size instead. As a result far less slots are wasted and the tree can now use up to all 100% of it's slots this way. Note: This works even with unaligned tree blocks as we can still get unique index by doing eb->start >> nodesize_shift. Getting some stats from running fio write test, there is a bit of variance. The values presented in the table below are medians from 5 test runs. The numbers are: - # of allocated ebs in the tree - # of leaf tree nodes - highest index in the tree (radix tree width)): ebs / leaves / Index | bare for-next | with fix ---------------------+--------------------+------------------- post mount | 16 / 11 / 10e5c | 16 / 10 / 4240 post test | 5810 / 891 / 11cfc | 4420 / 252 / 473a post rm | 574 / 300 / 10ef0 | 540 / 163 / 46e9 In this case (10GiB filesystem) the height of the tree is still 3 levels but the 4x width reduction is clearly visible as expected. But since the tree is more dense we can see the 54-72% reduction of leaf nodes. That's very close to ideal with this test. It means the tree is getting really dense with this kind of workload. Also, the fio results show no performance change. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22btrfs: use the super_block as holder when mounting file systemsChristoph Hellwig1-2/+0
The file system type is not a very useful holder as it doesn't allow us to go back to the actual file system instance. Pass the super_block instead which is useful when passed back to the file system driver. This matches what is done for all other block device based file systems, and allows us to remove btrfs_fs_info::bdev_holder completely. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22btrfs: qgroup: remove no longer used fs_info->qgroup_ulistFilipe Manana1-6/+0
It's not used anymore after commit 091344508249 ("btrfs: qgroup: use qgroup_iterator in qgroup_convert_meta()"), so remove it. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21btrfs: sysfs: track current commit duration in commit_statsBoris Burkov1-0/+2
When debugging/detecting outlier commit stalls, having an indicator that we are currently in a long commit critical section can be very useful. Extend the commit_stats sysfs file to also include the current commit critical section duration. Since this requires storing the last commit start time, use that rather than a separate stack variable for storing the finished commit durations as well. This also requires slightly moving up the timing of the stats updating to *inside* the critical section to avoid the transaction T+1 setting the critical_section_start_time to 0 before transaction T can update its stats, which would trigger the new ASSERT. This is an improvement in and of itself, as it makes the stats more accurately represent the true critical section time. It may be yet better to pull the stats up to where start_transaction gets unblocked, rather than the next commit, but this seems like a good enough place as well. Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: add block reserve for treelogNaohiro Aota1-0/+2
We need to add a dedicated block_rsv for tree-log, because the block_rsv serves for a tree node allocation in btrfs_alloc_tree_block(). Currently, tree-log tree uses fs_info->empty_block_rsv, which is shared across trees and points to the normal metadata space_info. Instead, we add a dedicated block_rsv and that block_rsv can use the dedicated sub-space_info. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: convert the buffer_radix to an xarrayJosef Bacik1-3/+1
In order to fully utilize xarray tagging to improve writeback we need to convert the buffer_radix to a proper xarray. This conversion is relatively straightforward as the radix code uses the xarray underneath. Using xarray directly allows for quite a lot less code. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-12btrfs: add back warning for mount option commit values exceeding 300Kyoji Ogasawara1-0/+1
The Btrfs documentation states that if the commit value is greater than 300 a warning should be issued. The warning was accidentally lost in the new mount API update. Fixes: 6941823cc878 ("btrfs: remove old mount API code") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Kyoji Ogasawara <sawara04.o@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: add extra warning if delayed iput is added when it's not allowedQu Wenruo1-0/+3
Since I have triggered the ASSERT() on the delayed iput too many times, now is the time to add some extra debug warnings for delayed iput. All delayed iputs should be queued after all ordered extents finish their IO and all involved workqueues are flushed. Thus after the btrfs_run_delayed_iputs() inside close_ctree(), there should be no more delayed puts added. So introduce a new BTRFS_FS_STATE_NO_DELAYED_IPUT, set after the above mentioned timing. And all btrfs_add_delayed_iput() will check that flag and give a WARN_ON_ONCE(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: defrag: extend ioctl to accept compression levelsDaniel Vacek1-1/+1
The zstd and zlib compression types support setting compression levels. Enhance the defrag interface to specify the levels as well. For zstd the negative (realtime) levels are also accepted. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: allow debug builds to accept 2K block sizeQu Wenruo1-0/+12
Currently we only support two block sizes, 4K and PAGE_SIZE. This means on the most common architecture x86_64, we have no way to test subpage block size. And that's exactly I have an aarch64 machine dedicated for subpage tests. But this is still a hurdle for a lot of btrfs developers, and to improve the test coverage mostly on x86_64, here we enable debug builds to accept 2K block size. This involves: - Introduce a dedicated minimal block size macro BTRFS_MIN_BLOCKSIZE, which depends on if CONFIG_BTRFS_DEBUG is set. If so it's 2K, otherwise it's 4K as usual. - Allow 4K, PAGE_SIZE and BTRFS_MIN_BLOCKSIZE as block size - Update subpage block size checks to be based on BTRFS_MIN_BLOCKSIZE - Export the new supported blocksize through sysfs interfaces As most of the subpage support is already pretty mature, there is no extra work needed to support the extra 2K block size. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: remove btrfs_fs_info::sectors_per_pageQu Wenruo1-1/+6
For the future large folio support, our filemap can have folios with different sizes, thus we can no longer rely on a fixed blocks_per_page value. To prepare for that future, here we do: - Remove btrfs_fs_info::sectors_per_page - Introduce a helper, btrfs_blocks_per_folio() Which uses the folio size to calculate the number of blocks for each folio. - Migrate the existing btrfs_fs_info::sectors_per_page to use that helper There are some exceptions: * Metadata nodesize < page size support In the future, even if we support large folios, we will only allocate a folio that matches our nodesize. Thus we won't have a folio covering multiple metadata unless nodesize < page size. * Existing subpage bitmap dump We use a single unsigned long to store the bitmap. That means until we change the bitmap dumping code, our upper limit for folio size will only be 256K (4K block size, 64 bit unsigned long). * btrfs_is_subpage() check This will be migrated into a future patch. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: zstd: enable negative compression levels mount optionDaniel Vacek1-1/+1
Allow using the fast modes (negative compression levels) of zstd as a mount option. As per the results, the compression ratio is (expectedly) lower: for level in {-15..-1} 1 2 3; \ do printf "level %3d\n" $level; \ mount -o compress=zstd:$level /dev/sdb /mnt/test/; \ grep sdb /proc/mounts; \ cp -r /usr/bin /mnt/test/; sync; compsize /mnt/test/bin; \ cp -r /usr/share/doc /mnt/test/; sync; compsize /mnt/test/doc; \ cp enwik9 /mnt/test/; sync; compsize /mnt/test/enwik9; \ cp linux-6.13.tar /mnt/test/; sync; compsize /mnt/test/linux-6.13.tar; \ rm -r /mnt/test/{bin,doc,enwik9,linux-6.13.tar}; \ umount /mnt/test/; \ done |& tee results | \ awk '/^level/{print}/^TOTAL/{print$3"\t"$2" |"}' | paste - - - - - 266M bin | 45M doc | 953M wiki | 1.4G source =============================+===============+===============+===============+ level -15 180M 67% | 30M 68% | 694M 72% | 598M 40% | level -14 180M 67% | 30M 67% | 683M 71% | 581M 39% | level -13 177M 66% | 29M 66% | 671M 70% | 566M 38% | level -12 174M 65% | 29M 65% | 658M 69% | 548M 37% | level -11 174M 65% | 28M 64% | 645M 67% | 530M 35% | level -10 171M 64% | 28M 62% | 631M 66% | 512M 34% | level -9 165M 62% | 27M 61% | 615M 64% | 493M 33% | level -8 161M 60% | 27M 59% | 598M 62% | 475M 32% | level -7 155M 58% | 26M 58% | 582M 61% | 457M 30% | level -6 151M 56% | 25M 56% | 565M 59% | 437M 29% | level -5 145M 54% | 24M 55% | 545M 57% | 417M 28% | level -4 139M 52% | 23M 52% | 520M 54% | 391M 26% | level -3 135M 50% | 22M 50% | 495M 51% | 369M 24% | level -2 127M 47% | 22M 48% | 470M 49% | 349M 23% | level -1 120M 45% | 21M 47% | 452M 47% | 332M 22% | level 1 110M 41% | 17M 39% | 362M 38% | 290M 19% | level 2 106M 40% | 17M 38% | 349M 36% | 288M 19% | level 3 104M 39% | 16M 37% | 340M 35% | 276M 18% | The samples represent some data sets that can be commonly found and show approximate compressibility. The fast levels trade off speed for ratio and are best suitable for highly compressible data. As can be seen above, comparing the results to the current default zstd level 3, the negative levels are roughly 2x worse at -15 and the ratio increases almost linearly with each level. Signed-off-by: Daniel Vacek <neelx@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: add tracking of read blocks for read policyAnand Jain1-0/+3
Track number of read blocks in the whole filesystem. The counter is initialized when devices are opened. The counter is increased at btrfs_submit_dev_bio() if the stats tracking is enabled (depends on the read policy). Stats tracking is disabled by default and is enabled through fs_devices::collect_fs_stats when required. The code is not under the EXPERIMENTAL define, as stats can be expanded to include write counts and other performance counters, with the user interface independent of its internal use. This is an in-memory-only feature, not related to the dev error stats. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: don't include linux/rwlock_types.h directlyWolfram Sang1-1/+0
The header clearly states that it does not want to be included directly, only via linux/spinlock_types.h. Drop this as we can simply use the spinlock.h which is already included. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: use uuid_is_null() to verify if an uuid is emptyFilipe Manana1-1/+4
At btrfs_is_empty_uuid() we have our custom code to check if an uuid is empty, however there a kernel uuid library that has a function named uuid_is_null() which does the same and probably more efficient. So change btrfs_is_empty_uuid() to use uuid_is_null(), which is almost a directly replacement, it just wraps the necessary casting since our uuid types are u8 arrays while the uuid kernel library uses the uuid_t type, which is just a typedef of an u8 array of 16 elements as well. Also since the function is now to trivial, make it a static inline function in fs.h. Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: move btrfs_alloc_write_mask() into fs.hFilipe Manana1-0/+6
Currently btrfs_alloc_write_mask() is defined in ctree.h but it's not related at all to the btree data structure, so move it into fs.h. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: move BTRFS_BYTES_TO_BLKS() into fs.hFilipe Manana1-0/+2
Currently BTRFS_BYTES_TO_BLKS() is defined in ctree.h but it's not related at all to the btree data structure, so move it into fs.h. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: move the folio ordered helpers from ctree.h into fs.hFilipe Manana1-0/+8
The folio ordered helper macros are defined at ctree.h but this is not the best place since ctree.{h,c} is all about the btree data structure implementation and not a generic module. So move these macros into the fs.h header. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: move btrfs_is_empty_uuid() from ioctl.c into fs.cFilipe Manana1-0/+2