aboutsummaryrefslogtreecommitdiff
path: root/fs/btrfs/sysfs.c
AgeCommit message (Collapse)AuthorFilesLines
2026-02-21Convert 'alloc_obj' family to use the new default GFP_KERNEL argumentLinus Torvalds1-1/+1
This was done entirely with mindless brute force, using git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' | xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21treewide: Replace kmalloc with kmalloc_obj for non-scalar typesKees Cook1-2/+2
This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-03btrfs: add METADATA_REMAP chunk typeMark Harmstone1-0/+2
Add a new METADATA_REMAP chunk type, which is a metadata chunk that holds the remap tree. This is needed for bootstrapping purposes: the remap tree can't itself be remapped, and must be relocated the existing way, by COWing every leaf. The remap tree can't go in the SYSTEM chunk as space there is limited, because a copy of the chunk item gets placed in the superblock. The changes in fs/btrfs/volumes.h are because we're adding a new block group type bit after the profile bits, and so can no longer rely on the const_ilog2 trick. The sizing to 32MB per chunk, matching the SYSTEM chunk, is an estimate here, we can adjust it later if it proves to be too big or too small. This works out to be ~500,000 remap items, which for a 4KB block size covers ~2GB of remapped data in the worst case and ~500TB in the best case. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: add definitions and constants for remap-treeMark Harmstone1-0/+3
Add an incompat flag for the new remap-tree feature, and the constants and definitions needed to support it. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: remove experimental offload csum modeQu Wenruo1-44/+0
The offload csum mode was introduced to allow developers to compare the performance of generating checksum for data writes at different timings: - During btrfs_submit_chunk() This is the most common one, if any of the following condition is met we go this path: * The csum is fast For now it's CRC32C and xxhash. * It's a synchronous write * Zoned - Delay the checksum generation to a workqueue However since commit dd57c78aec39 ("btrfs: introduce btrfs_bio::async_csum") we no longer need to bother any of them. As if it's an experimental build, async checksum generation at the background will be faster anyway. And if not an experimental build, we won't even have the offload csum mode support. Considering the async csum will be the new default, let's remove the offload csum mode code. There will be no impact to end users, and offload csum mode is still under experimental features. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03btrfs: switch to library APIs for checksumsEric Biggers1-4/+2
Make btrfs use the library APIs instead of crypto_shash, for all checksum computations. This has many benefits: - Allows future checksum types, e.g. XXH3 or CRC64, to be more easily supported. Only a library API will be needed, not crypto_shash too. - Eliminates the overhead of the generic crypto layer, including an indirect call for every function call and other API overhead. A microbenchmark of btrfs_check_read_bio() with crc32c checksums shows a speedup from 658 cycles to 608 cycles per 4096-byte block. - Decreases the stack usage of btrfs by reducing the size of checksum contexts from 384 bytes to 240 bytes, and by eliminating the need for some functions to declare a checksum context at all. - Increases reliability. The library functions always succeed and return void. In contrast, crypto_shash can fail and return errors. Also, the library functions are guaranteed to be available when btrfs is loaded; there's no longer any need to use module softdeps to try to work around the crypto modules sometimes not being loaded. - Fixes a bug where blake2b checksums didn't work on kernels booted with fips=1. Since btrfs checksums are for integrity only, it's fine for them to use non-FIPS-approved algorithms. Note that with having to handle 4 algorithms instead of just 1-2, this commit does result in a slightly positive diffstat. That being said, this wouldn't have been the case if btrfs had actually checked for errors from crypto_shash, which technically it should have been doing. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Neal Gompa <neal@gompa.dev> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2026-01-14btrfs: remove zoned statistics from sysfsJohannes Thumshirn1-52/+0
Remove the newly introduced zoned statistics from sysfs, as sysfs can only show a single page this will truncate the output on a busy filesystem. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: zoned: show statistics for zoned filesystemsJohannes Thumshirn1-0/+53
Provide statistics for zoned filesystems. These statistics include, the number of active block-groups, how many of them are reclaimable or unused, if the filesystem needs to be reclaimed, the currently assigned relocation and treelog block-groups if they're present and a list of active zones. Example: active block-groups: 4   reclaimable: 0   unused: 2   need reclaim: false data relocation block-group: 4294967296 active zones:   start: 1610612736, wp: 344064 used: 16384, reserved: 0, unusable: 327680   start: 1879048192, wp: 34963456 used: 131072, reserved: 0, unusable: 34832384   start: 4026531840, wp: 0 used: 0, reserved: 0, unusable: 0   start: 4294967296, wp: 0 used: 0, reserved: 0, unusable: 0 Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24btrfs: remove fs_info argument from btrfs_sysfs_add_space_info_type()Filipe Manana1-3/+2
We don't need it since we can grab fs_info from the given space_info. So remove the fs_info argument. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-22btrfs: simplify support block size checkQu Wenruo1-6/+10
Currently we manually check the block size against 3 different values: - 4K - PAGE_SIZE - MIN_BLOCKSIZE Those 3 values can match or differ from each other. This makes it pretty complex to output the supported block sizes. Considering we're going to add block size > page size support soon, this can make the support block size sysfs attribute much harder to implement. To make it easier, factor out a helper, btrfs_supported_blocksize() to do a simple check for the block size. Then utilize it in the two locations: - btrfs_validate_super() This is very straightforward - supported_sectorsizes_show() Iterate through all valid block sizes, and only output supported ones. This is to make future full range block sizes support much easier. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21btrfs: use our message helpers instead of pr_err/pr_warn/pr_infoDavid Sterba1-3/+2
Our message helpers accept NULL for the fs_info in the context that does not provide and print the common header of the message. The use of pr_* helpers is only for special reasons, like module loading, device scanning or multi-line output (print-tree). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21btrfs: rename error to ret in btrfs_sysfs_add_mounted()David Sterba1-24/+23
Unify naming of return value to the preferred way. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21btrfs: rename error to ret in btrfs_sysfs_add_fsid()David Sterba1-5/+5
Unify naming of return value to the preferred way. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21btrfs: rename err to ret in quota_override_store()David Sterba1-4/+4
Unify naming of return value to the preferred way. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21btrfs: sysfs: track current commit duration in commit_statsBoris Burkov1-0/+8
When debugging/detecting outlier commit stalls, having an indicator that we are currently in a long commit critical section can be very useful. Extend the commit_stats sysfs file to also include the current commit critical section duration. Since this requires storing the last commit start time, use that rather than a separate stack variable for storing the finished commit durations as well. This also requires slightly moving up the timing of the stats updating to *inside* the critical section to avoid the transaction T+1 setting the critical_section_start_time to 0 before transaction T can update its stats, which would trigger the new ASSERT. This is an improvement in and of itself, as it makes the stats more accurately represent the true critical section time. It may be yet better to pull the stats up to where start_transaction gets unblocked, rather than the next commit, but this seems like a good enough place as well. Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: introduce tree-log sub-space_infoNaohiro Aota1-2/+9
Introduce the tree-log sub-space_info, which is sub-space of metadata space_info and dedicated for tree-log node allocation. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: introduce btrfs_space_info sub-groupNaohiro Aota1-3/+15
Current code assumes we have only one space_info for each block group type (DATA, METADATA, and SYSTEM). We sometime need multiple space infos to manage special block groups. One example is handling the data relocation block group for the zoned mode. That block group is dedicated for writing relocated data and we cannot allocate any regular extent from that block group, which is implemented in the zoned extent allocator. This block group still belongs to the normal data space_info. So, when all the normal data block groups are full and there is some free space in the dedicated block group, the space_info looks to have some free space, while it cannot allocate normal extent anymore. That results in a strange ENOSPC error. We need to have a space_info for the relocation data block group to represent the situation properly. Adds a basic infrastructure for having a "sub-group" of a space_info: creation and removing. A sub-group space_info belongs to one of the primary space_infos and has the same flags as its parent. This commit first introduces the relocation data sub-space_info, and the next commit will introduce tree-log sub-space_info. In the future, it could be useful to implement tiered storage for btrfs e.g. by implementing a sub-group space_info for block groups resides on a fast storage. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: allow debug builds to accept 2K block sizeQu Wenruo1-1/+2
Currently we only support two block sizes, 4K and PAGE_SIZE. This means on the most common architecture x86_64, we have no way to test subpage block size. And that's exactly I have an aarch64 machine dedicated for subpage tests. But this is still a hurdle for a lot of btrfs developers, and to improve the test coverage mostly on x86_64, here we enable debug builds to accept 2K block size. This involves: - Introduce a dedicated minimal block size macro BTRFS_MIN_BLOCKSIZE, which depends on if CONFIG_BTRFS_DEBUG is set. If so it's 2K, otherwise it's 4K as usual. - Allow 4K, PAGE_SIZE and BTRFS_MIN_BLOCKSIZE as block size - Update subpage block size checks to be based on BTRFS_MIN_BLOCKSIZE - Export the new supported blocksize through sysfs interfaces As most of the subpage support is already pretty mature, there is no extra work needed to support the extra 2K block size. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: sysfs: accept size suffixes for read policy valuesAnand Jain1-5/+6
We now parse human-friendly size values (e.g. '1G', '2M') when setting read policies. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-02-26btrfs: replace deprecated strncpy() with strscpy()Thorsten Blum1-2/+2
strncpy() is deprecated for NUL-terminated destination buffers. Use strscpy() instead and don't zero-initialize the param array. Link: https://github.com/KSPP/linux/issues/90 Cc: linux-hardening@vger.kernel.org Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: configure read policy via module parameterAnand Jain1-1/+30
For testing purposes allow to configure the read policy via module parameter from the beginning. Available only with CONFIG_BTRFS_EXPERIMENTAL Examples: - Set the RAID1 balancing method to round-robin with a custom min_contig_read of 4k: $ modprobe btrfs read_policy=round-robin:4096 - Set the round-robin balancing method with the default min_contiguous_read: $ modprobe btrfs read_policy=round-robin - Set the "devid" balancing method, defaulting to the latest device: $ modprobe btrfs read_policy=devid Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: add read policy to set a preferred deviceAnand Jain1-1/+30
Add read policy that will force all reads to go to the given device (specified by devid) on the RAID1 profiles. This will be used for testing, e.g. to read from stale device. Users may find other use cases. Can be set in sysfs, the value format is "devid:<devid>" to the file /sys/fs/btrfs/FSID/read_policy Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: introduce RAID1 round-robin read balancingAnand Jain1-1/+47
Add round-robin read policy that balances reads over available devices (all RAID1 block group profiles). Switch to the next devices is done after a number of blocks is read, which is 256K by default and is configurable in sysfs. The format is "round-robin:<min-contig-read>" and can be set in file /sys/fs/btrfs/FSID/read_policy Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: sysfs: handle value associated with read balancing policyAnand Jain1-2/+22
Enable specifying additional configuration values along the RAID1 balancing read policy in a single input string. Update btrfs_read_policy_to_enum() to parse and handle a value associated with the policy in the format "policy:value", the value part if present is converted to 64-bit integer. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: sysfs: add btrfs_read_policy_to_enum() helper and refactor read ↵Anand Jain1-12/+22
policy store Introduce btrfs_read_policy_to_enum() helper to simplify the conversion of a string read policy to its corresponding enum value. This reduces duplication and improves code clarity in btrfs_read_policy_store(). The parameter is copied locally to allow modification, enabling the separation of the method and its value. This prepares for the addition of more functionality in subsequent patches. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: sysfs: refactor output formatting in btrfs_read_policy_show()Anand Jain1-8/+10
Refactor the logic in btrfs_read_policy_show() for easier extension with more balancing methods. Streamline the space and bracket handling around the active policy without altering the functional output. This is in preparation to add more methods. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-23btrfs: sysfs: fix direct super block member readsQu Wenruo1-3/+3
The following sysfs entries are reading super block member directly, which can have a different endian and cause wrong values: - sys/fs/btrfs/<uuid>/nodesize - sys/fs/btrfs/<uuid>/sectorsize - sys/fs/btrfs/<uuid>/clone_alignment Thankfully those values (nodesize and sectorsize) are always aligned inside the btrfs_super_block, so it won't trigger unaligned read errors, just endian problems. Fix them by using the native cached members instead. Fixes: df93589a1737 ("btrfs: export more from FS_INFO to sysfs") CC: stable@vger.kernel.org Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-28btrfs: sysfs: advertise experimental features only if ↵Filipe Manana1-2/+2
CONFIG_BTRFS_EXPERIMENTAL=y We are advertising experimental features through sysfs if CONFIG_BTRFS_DEBUG is set, without looking if CONFIG_BTRFS_EXPERIMENTAL is set. This is wrong as it will result in reporting experimental features as supported when CONFIG_BTRFS_EXPERIMENTAL is not set but CONFIG_BTRFS_DEBUG is set. Fix this by checking for CONFIG_BTRFS_EXPERIMENTAL instead of CONFIG_BTRFS_DEBUG. Fixes: 67cd3f221769 ("btrfs: split out CONFIG_BTRFS_EXPERIMENTAL from CONFIG_BTRFS_DEBUG") Reviewed-by: Neal Gompa <neal@gompa.dev> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: split out CONFIG_BTRFS_EXPERIMENTAL from CONFIG_BTRFS_DEBUGQu Wenruo1-2/+2
Currently CONFIG_BTRFS_EXPERIMENTAL is not only for the extra debugging output, but also for experimental features. This is not ideal to distinguish planned but not yet stable features from those purely designed for debugging. This patch splits the following features into CONFIG_BTRFS_EXPERIMENTAL: - Extent map shrinker This seems to be the first one to exit experimental. - Extent tree v2 This seems to be the last one to graduate from experimental. - Raid stripe tree - Csum offload mode - Send protocol v3 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: introduce new "rescue=ignoresuperflags" mount optionQu Wenruo1-0/+1
This new mount option allows the kernel to skip the super flags check, it's mostly to allow the kernel to do a rescue mount of an interrupted checksum conversion. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: introduce new "rescue=ignoremetacsums" mount optionQu Wenruo1-0/+1
Introduce "rescue=ignoremetacsums" to ignore metadata csums, all the other metadata sanity checks are still kept as is. This new mount option is mostly to allow the kernel to mount an interrupted checksum conversion (at the metadata csum overwrite stage). And since the main part of metadata sanity checks is inside tree-checker, we shouldn't lose much safety, and the new mount option is rescue mount option it requires full read-only mount. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: periodic block_group reclaimBoris Burkov1-0/+34
We currently employ a edge-triggered block group reclaim strategy which marks block groups for reclaim as they free down past a threshold. With a dynamic threshold, this is worse than doing it in a level-triggered fashion periodically. That is because the reclaim itself happens periodically, so the threshold at that point in time is what really matters, not the threshold at freeing time. If we mark the reclaim in a big pass, then sort by usage and do reclaim, we also benefit from a negative feedback loop preventing unnecessary reclaims as we crunch through the "best" candidates. Since this is quite a different model, it requires some additional support. The edge triggered reclaim has a good heuristic for not reclaiming fresh block groups, so we need to replace that with a typical GC sweep mark which skips block groups that have seen an allocation since the last sweep. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: dynamic block_group reclaim thresholdBoris Burkov1-1/+42
We can currently recover allocated block_groups by: - explicitly starting balance operations - "auto reclaim" via bg_reclaim_threshold The latter works by checking against a fixed threshold on frees. If we pass from above the threshold to below, relocation triggers and the block group will get reclaimed by the cleaner thread (assuming it is still eligible) Picking a threshold is challenging. Too high, and you end up trying to reclaim very full block_groups which is quite costly, and you don't do reclaim on block_groups that don't get quite THAT full, but could still be quite fragmented and stranding a lot of space. Too low, and you similarly miss out on reclaim even if you badly need it to avoid running out of unallocated space, if you have heavily fragmented block groups living above the threshold. No matter the threshold, it suffers from a workload that happens to bounce around that threshold, which can introduce arbitrary amounts of reclaim waste. To improve this situation, introduce a dynamic threshold. The basic idea behind this threshold is that it should be very lax when there is plenty of unallocated space, and increasingly aggressive as we approach zero unallocated space. To that end, it sets a target for unallocated space (10 chunks) and then linearly increases the threshold as the amount of space short of the target we are increases. The formula is: (target - unalloc) / target I tested this by running it on three interesting workloads: 1. bounce allocations around X% full. 2. fill up all the way and introduce full fragmentation. 3. write in a fragmented way until the filesystem is just about full. 1. and 2. attack the weaknesses of a fixed threshold; fixed either works perfectly or fully falls apart, depending on the threshold. Dynamic always handles these cases well. 3. attacks dynamic by checking whether it is too zealous to reclaim in conditions with low unallocated and low unused. It tends to claw back 1GiB of unallocated fairly aggressively, but not much more. Early versions of dynamic threshold struggled on this test. Additional work could be done to intelligently ratchet up the urgency of reclaim in very low unallocated conditions. Existing mechanisms are already useless in that case anyway. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: report reclaim stats in sysfsBoris Burkov1-0/+6
When evaluating various reclaim strategies/thresholds against each other, it is useful to collect data about the amount of reclaim happening. Expose a count, error count, and byte count via sysfs per space_info. Note that this is only for automatic reclaim, not manually invoked balances or other codepaths that use "relocate_block_group" Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: use btrfs_is_testing() everywhereDavid Sterba1-4/+4
There are open coded tests of BTRFS_FS_STATE_DUMMY_FS_INFO and we have a wrapper for that that's a compile-time constant when self-tests are not built in. As this is only for development we can save some bytes and conditions on release configs by using the helper in the remaining cases. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: introduce offload_csum_mode to tweak checksum offloading behaviorNaohiro Aota1-0/+44
We disable offloading checksum to workqueues and do it synchronously when the checksum algorithm is fast. However, as reported in the link below, RAID0 with multiple devices may suffer from the sync checksum, because "fast checksum" is still not fast enough to catch up with RAID0 writing. We don't have an effective way to determine whether to offload or not, for now add a sysfs knob so this can be debugged. This is intentionally under CONFIG_BTRFS_DEBUG so ti's not exposed to users as it may be removed in the future agin. Introduce fs_devices->offload_csum_mode, so that a btrfs developer can change the behavior by writing to /sys/fs/btrfs/<uuid>/offload_csum. The default is "auto" which is the same as the previous behavior. Or, you can set "on" or "off" (or "y" or "n" whatever kstrtobool() accepts) to always/never offload checksum. More benchmark need to be collected with this knob to implement a proper criteria to enable/disable checksum offloading. Link: https://lore.kernel.org/linux-btrfs/20230731152223.4EFB.409509F4@e16-tech.com/ Link: https://lore.kernel.org/linux-btrfs/p3vo3g7pqn664mhmdhlotu5dzcna6vjtcoc2hb2lsgo2fwct7k@xzaxclba5tae/ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: sysfs: drop unnecessary double logical negation in acl_show()Neal Gompa1-1/+1
The IS_ENABLED() macro already guarantees the result will be a suitable boolean return value ("1" for enabled, and "0" for disabled). Thus, it seems that the "!!" used right before is unnecessary to force the 0/1 values. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Neal Gompa <neal@gompa.dev> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: use READ/WRITE_ONCE for fs_devices->read_policyNaohiro Aota1-3/+4
Since we can read/modify the value from the sysfs interface concurrently, it would be better to protect it from compiler optimizations. Currently, there is only one read policy BTRFS_READ_POLICY_PID available, so no actual problem can happen now. This is a preparation for the future expansion. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: sysfs: validate scrub_speed_max valueDavid Disseldorp1-0/+4
The value set as scrub_speed_max accepts size with suffixes (k/m/g/t/p/e) but we should still validate it for trailing characters, similar to what we do with chunk_size_store. CC: stable@vger.kernel.org # 5.15+ Signed-off-by: David Disseldorp <ddiss@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: sysfs: show temp_fsid featureAnand Jain1-0/+18
This adds sysfs objects to indicate temp_fsid feature support and its status. /sys/fs/btrfs/features/temp_fsid /sys/fs/btrfs/<UUID>/temp_fsid For example: Consider two cloned and mounted devices. $ blkid /dev/sdc[1-2] /dev/sdc1: UUID="509ad44b-ad2a-4a8a-bc8d-fe69db7220d5" .. /dev/sdc2: UUID="509ad44b-ad2a-4a8a-bc8d-fe69db7220d5" .. One gets actual fsid, and the other gets the temp_fsid when mounted. $ btrfs filesystem show -m Label: none uuid: 509ad44b-ad2a-4a8a-bc8d-fe69db7220d5 Total devices 1 FS bytes used 54.14MiB devid 1 size 300.00MiB used 144.00MiB path /dev/sdc1 Label: none uuid: 33bad74e-c91b-43a5-aef8-b3cab97ae63a Total devices 1 FS bytes used 54.14MiB devid 1 size 300.00MiB used 144.00MiB path /dev/sdc2 Their sysfs as below. $ cat /sys/fs/btrfs/features/temp_fsid 0 $ cat /sys/fs/btrfs/509ad44b-ad2a-4a8a-bc8d-fe69db7220d5/temp_fsid 0 $ cat /sys/fs/btrfs/33bad74e-c91b-43a5-aef8-b3cab97ae63a/temp_fsid 1 Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: add and use helpers for reading and writing fs_info->generationFilipe Manana1-1/+1
Currently the generation field of struct btrfs_fs_info is always modified while holding fs_info->trans_lock locked. Most readers will access this field without taking that lock but while holding a transaction handle, which is safe to do due to the transaction life cycle. However there are other readers that are neither holding the lock nor holding a transaction handle open: 1) When reading an inode from disk, at btrfs_read_locked_inode(); 2) When reading the generation to expose it to sysfs, at btrfs_generation_show(); 3) Early in the fsync path, at skip_inode_logging(); 4) When creating a hole at btrfs_cont_expand(), during write paths, truncate and reflinking; 5) In the fs_info ioctl (btrfs_ioctl_fs_info()); 6) While mounting the filesystem, in the open_ctree() path. In these cases it's safe to directly read fs_info->generation as no one can concurrently start a transaction and update fs_info->generation. In case of the fsync path, races here should be harmless, and in the worst case they may cause a fsync to log an inode when it's not really needed, so nothing bad from a functional perspective. In the other cases it's not so clear if functional problems may arise, though in case 1 rare things like a load/store tearing [1] may cause the BTRFS_INODE_NEEDS_FULL_SYNC flag not being set on an inode and therefore result in incorrect logging later on in case a fsync call is made. To avoid data race warnings from tools like KCSAN and other issues such as load and store tearing (amongst others, see [1]), create helpers to access the generation field of struct btrfs_fs_info using READ_ONCE() and WRITE_ONCE(), and use these helpers where needed. [1] https://lwn.net/Articles/793253/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: sysfs: add simple_quota incompat feature entryBoris Burkov1-0/+2
Add an entry in the features directory for the new incompat flag Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: sysfs: expose quota mode via sysfsBoris Burkov1-0/+28
Add a new sysfs file /sys/fs/btrfs/<uuid>/qgroups/mode which prints out the mode qgroups is running in. The possible modes are qgroup, and squota. If quotas are not enabled, then the qgroups directory will not exist, so don't handle that mode. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: sysfs: announce presence of raid-stripe-treeJohannes Thumshirn1-0/+3
If a filesystem with a raid-stripe-tree is mounted, show the RST feature in sysfs, currently still under the CONFIG_BTRFS_DEBUG option. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: sysfs: show if ACL support has been compiled inAnand Jain1-0/+7
ACL support depends on the compile-time configuration option CONFIG_BTRFS_FS_POSIX_ACL. Prior to mounting a btrfs filesystem, it is not possible to determine whether ACL support has been compiled in. To address this, add a sysfs interface, /sys/fs/btrfs/features/acl, and check for ACL support in the system's btrfs. To determine ACL support: Return 0 indicates ACL is not supported: $ cat /sys/fs/btrfs/features/acl 0 Return 1 indicates ACL is supported: $ cat /sys/fs/btrfs/features/acl 1 IMO, this is a better approach, so that we also know if kernel is older. On an older kernel $ ls /sys/fs/btrfs/features/acl ls: cannot access '/sys/fs/btrfs/features/acl': No such file or directory mount a btrfs filesystem $ cat /proc/self/mounts | grep btrfs | grep -q noacl $ echo $? 0 Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17btrfs: sysfs: relax bg_reclaim_threshold for debugging purposesNaohiro Aota1-0/+5
Currently, /sys/fs/btrfs/<UUID>/bg_reclaim_threshold is limited to 0 (disable) or [50 .. 100]%, so we need to fill 50% of a device to start the auto reclaim process. It is cumbersome to do so when we want to shake out possible race issues of normal write vs reclaim. Relax the threshold check under the BTRFS_DEBUG option. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-01btrfs: sysfs: add size class statsBoris Burkov1-0/+42
Make it possible to see the distribution of size classes for block groups. Helpful for testing and debugging the allocator w.r.t. to size classes. The new stats can be found at the path: /sys/fs/btrfs/<FSID>/allocation/<bg-type>/size_class but they will only be non-zero for bg-type = data. Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: make kobj_type structures constantThomas Weißschuh1-6/+6
Since commit ee6d3dd4ed48 ("driver core: make kobj_type constant.") the driver core allows the usage of const struct kobj_type. Take advantage of this to constify the structure definitions to prevent modification at runtime. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: sysfs: update fs features directory asynchronouslyQu Wenruo1-21/+8
[BUG] Since the introduction of per-fs feature sysfs interface (/sys/fs/btrfs/<UUID>/features/), the content of that directory is never updated. Thus for the following case, that directory will not show the new features like RAID56: # mkfs.btrfs -f $dev1 $dev2 $dev3 # mount $dev1 $mnt # btrfs balance start -f -mconvert=raid5 $mnt # ls /sys/fs/btrfs/$uuid/features/ extended_iref free_space_tree no_holes skinny_metadata While after unmount and mount, we got the correct features: # umount $mnt # mount $dev1 $mnt # ls /sys/fs/btrfs/$uuid/features/ extended_iref free_space_tree no_holes raid56 skinny_metadata [CAUSE] Because we never really try to update the content of per-fs features/ directory. We had an attempt to update the features directory dynamically in commit 14e46e04958d ("btrfs: synchronize incompat feature bits with sysfs files"), but unfortunately it get reverted in commit e410e34fad91 ("Revert "btrfs: synchronize incompat feature bits with sysfs files""). The problem in the original patch is, in the context of btrfs_create_chunk(), we can not afford to update the sysfs group. The exported but never utilized function, btrfs_sysfs_feature_update() is the leftover of such attempt. As even if we go sysfs_update_group(), new files will need extra memory allocation, and we have no way to specify the sysfs update to go GFP_NOFS. [FIX] This patch will address the old problem by doing asynchronous sysfs update in the cleaner thread. This involves the following changes: - Make __btrfs_(set|clear)_fs_(incompat|compat_ro) helpers to set BTRFS_FS_FEATURE_CHANGED flag when needed - Update btrfs_sysfs_feature_update() to use sysfs_update_group() And drop unnecessary arguments. - Call btrfs_sysfs_feature_update() in cleaner_kthread If we have the BTRFS_FS_FEATURE_CHANGED flag set. - Wake up cleaner_kthread in btrfs_commit_transaction if we have BTRFS_FS_FEATURE_CHANGED flag By this, all the previously dangerous call sites like btrfs_create_chunk() need no new changes, as above helpers would have already set the BTRFS_FS_FEATURE_CHANGED flag. The real work happens at cleaner_kthread, thus we pay the cost of delaying the update to sysfs directory, but the delayed time should be small enough that end user can not distinguish though it might get delayed if the cleaner thread is busy with removing subvolumes or defrag. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: simplify percent calculation helpers, rename div_factorDavid Sterba1-1/+1
The div_factor* helpers calculate fraction or percentage fraction. The name is a bit confusing, we use it only for percentage calculations and there are two helpers. There's a helper mult_frac that's for general fractions, that tries to be accurate but we multiply and divide by small numbers so we can use the div_u64 helper. Rename the div_factor* helpers and use 1..100 percentage range, also drop the case checking for percentage == 100, it's never hit. The conversions: * div_factor calculates tenths and the numbers need to be adjusted * div_factor_fine is direct replacement Signed-off-by: David Sterba <dsterba@suse.com>