| Age | Commit message (Collapse) | Author | Files | Lines |
|
As far as I can tell, we never intentionally constrained ourselves to
these status codes, and it is misleading and surprising to lack the
bdev error logging when we get a different error code from the block
layer. This can lead to jumping to a wrong conclusion like "this
system didn't see any bio failures but aborted with EIO".
For example on nvme devices, I observe many failures coming back as
BLK_STS_MEDIUM. It is apparent that the nvme driver returns a variety of
BLK_STS_* status values in nvme_error_status().
So handle the known expected errors and make some noise on the rest
which we expect won't really happen.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
[BUG]
There is a bug report that when btrfs hits ENOSPC error in a critical
path, btrfs flips RO (this part is expected, although the ENOSPC bug
still needs to be addressed).
The problem is after the RO flip, if there is a read repair pending, we
can hit the ASSERT() inside btrfs_repair_io_failure() like the following:
BTRFS info (device vdc): relocating block group 30408704 flags metadata|raid1
------------[ cut here ]------------
BTRFS: Transaction aborted (error -28)
WARNING: fs/btrfs/extent-tree.c:3235 at __btrfs_free_extent.isra.0+0x453/0xfd0, CPU#1: btrfs/383844
Modules linked in: kvm_intel kvm irqbypass
[...]
---[ end trace 0000000000000000 ]---
BTRFS info (device vdc state EA): 2 enospc errors during balance
BTRFS info (device vdc state EA): balance: ended with status: -30
BTRFS error (device vdc state EA): parent transid verify failed on logical 30556160 mirror 2 wanted 8 found 6
BTRFS error (device vdc state EA): bdev /dev/nvme0n1 errs: wr 0, rd 0, flush 0, corrupt 10, gen 0
[...]
assertion failed: !(fs_info->sb->s_flags & SB_RDONLY) :: 0, in fs/btrfs/bio.c:938
------------[ cut here ]------------
assertion failed: !(fs_info->sb->s_flags & SB_RDONLY) :: 0, in fs/btrfs/bio.c:938
kernel BUG at fs/btrfs/bio.c:938!
Oops: invalid opcode: 0000 [#1] SMP NOPTI
CPU: 0 UID: 0 PID: 868 Comm: kworker/u8:13 Tainted: G W N 6.19.0-rc6+ #4788 PREEMPT(full)
Tainted: [W]=WARN, [N]=TEST
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
Workqueue: btrfs-endio simple_end_io_work
RIP: 0010:btrfs_repair_io_failure.cold+0xb2/0x120
RSP: 0000:ffffc90001d2bcf0 EFLAGS: 00010246
RAX: 0000000000000051 RBX: 0000000000001000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff8305cf42 RDI: 00000000ffffffff
RBP: 0000000000000002 R08: 00000000fffeffff R09: ffffffff837fa988
R10: ffffffff8327a9e0 R11: 6f69747265737361 R12: ffff88813018d310
R13: ffff888168b8a000 R14: ffffc90001d2bd90 R15: ffff88810a169000
FS: 0000000000000000(0000) GS:ffff8885e752c000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
------------[ cut here ]------------
[CAUSE]
The cause of -ENOSPC error during the test case btrfs/124 is still
unknown, although it's known that we still have cases where metadata can
be over-committed but can not be fulfilled correctly, thus if we hit
such ENOSPC error inside a critical path, we have no choice but abort
the current transaction.
This will mark the fs read-only.
The problem is inside the btrfs_repair_io_failure() path that we require
the fs not to be mount read-only. This is normally fine, but if we are
doing a read-repair meanwhile the fs flips RO due to a critical error,
we can enter btrfs_repair_io_failure() with super block set to
read-only, thus triggering the above crash.
[FIX]
Just replace the ASSERT() with a proper return if the fs is already
read-only.
Reported-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/linux-btrfs/20260126045555.GB31641@lst.de/
Tested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When a btrfs_bio gets split, only 'bbio->csum_search_commit_root' gets
copied to the new btrfs_bio, all the other flags don't.
When a bio is split in btrfs_submit_chunk(), btrfs_split_bio() creates
the new split bio via btrfs_bio_init() which zeroes the struct with
memset. Looking at btrfs_split_bio(), it copies csum_search_commit_root
from the original but does not copy can_use_append.
After the split, the code does:
bbio = split;
bio = &bbio->bio;
This means the split bio (with can_use_append = false) gets submitted,
not the original. In btrfs_submit_dev_bio(), the condition:
if (btrfs_bio(bio)->can_use_append && btrfs_dev_is_sequential(...))
Will be false for the split bio even when writing to a sequential zone.
Does the split bio need to inherit can_use_append from the original? The
old code used a local variable use_append which persisted across the
split.
Copy the rest of the flags as well.
Link: https://lore.kernel.org/linux-btrfs/20260125132120.2525146-1-clm@meta.com/
Reported-by: Chris Mason <clm@meta.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If when relocating a block group we find that `remap_bytes` > 0 in its
block group item, that means that it has been the destination block
group for another that has been remapped.
We need to search the remap tree for any remap backrefs within this
range, and move the data to a third block group. This is because
otherwise btrfs_translate_remap() could end up following an unbounded
chain of remaps, which would only get worse over time.
We only relocate one block group at a time, so `remap_bytes` will only
ever go down while we are doing this. Once we're finished we set the
REMAPPED flag on the block group, which will permanently prevent any
other data from being moved to within it.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The offload csum mode was introduced to allow developers to compare the
performance of generating checksum for data writes at different timings:
- During btrfs_submit_chunk()
This is the most common one, if any of the following condition is met
we go this path:
* The csum is fast
For now it's CRC32C and xxhash.
* It's a synchronous write
* Zoned
- Delay the checksum generation to a workqueue
However since commit dd57c78aec39 ("btrfs: introduce
btrfs_bio::async_csum") we no longer need to bother any of them.
As if it's an experimental build, async checksum generation at the
background will be faster anyway.
And if not an experimental build, we won't even have the offload csum
mode support.
Considering the async csum will be the new default, let's remove the
offload csum mode code.
There will be no impact to end users, and offload csum mode is still
under experimental features.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This function already dereferences 'inode' multiple times earlier,
making the additional NULL check at line 840 redundant since the
function would have crashed already if inode were NULL.
After commit 81cea6cd7041 ("btrfs: remove btrfs_bio::fs_info by
extracting it from btrfs_bio::inode"), the btrfs_bio::inode field is
mandatory for all btrfs_bio allocations and is guaranteed to be
non-NULL.
Simplify the condition for allocating dummy checksums for zoned
NODATASUM data by removing the unnecessary 'inode &&' check.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Zhen Ni <zhen.ni@easystack.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In case of a zoned RAID, it can happen that a data write is targeting a
sequential write required zone and a conventional zone. In this case the
bio will be marked as REQ_OP_ZONE_APPEND but for the conventional zone,
this needs to be REQ_OP_WRITE.
The setting of REQ_OP_ZONE_APPEND is deferred to the last possible time in
btrfs_submit_dev_bio(), but the decision if we can use zone append is
cached in btrfs_bio.
CC: Naohiro Aota <naohiro.aota@wdc.com>
Fixes: e9b9b911e03c ("btrfs: add raid stripe tree to features enabled with debug config")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When checksumming the encrypted bio on writes we need to know which
logical address this checksum is for. At the point where we get the
encrypted bio the bi_sector is the physical location on the target disk,
so we need to save the original logical offset in the btrfs_bio. Then
we can use this when checksumming the bio instead of the
bio->iter.bi_sector.
Note: The patch was taken from v5 of fscrypt patchset
(https://lore.kernel.org/linux-btrfs/cover.1706116485.git.josef@toxicpanda.com/)
which was handled over time by various people: Omar Sandoval, Sweet Tea
Dorminy, Josef Bacik.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since the read verification and read repair are all supporting bs > ps
without large folios now, we can enable encoded read/write/send.
Now we can relax the alignment in assert_bbio_alignment() to
min(blocksize, PAGE_SIZE).
But also add the extra blocksize based alignment check for the logical
and length of the bbio.
There is a pitfall in btrfs_add_compress_bio_folios(), which relies on
the folios passed in to meet the minimal folio order.
But now we can pass regular page sized folios in, update it to check
each folio's size instead of using the minimal folio size.
This allows btrfs_add_compress_bio_folios() to even handle folios array
with different sizes, thankfully we don't yet need to handle such crazy
situation.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The current read verification is also relying on large folios to support
bs > ps cases, but that introduced quite some limits.
To enhance read-repair to support bs > ps without large folios:
- Make btrfs_data_csum_ok() to accept an array of paddrs
Which can pass the paddrs[] direct into
btrfs_calculate_block_csum_pages().
- Make repair_one_sector() to accept an array of paddrs
So that it can submit a repair bio backed by regular pages, not only
large folios.
This requires us to allocate more slots at bio allocation time though.
Also since the caller may have only partially advanced the saved_iter
for bs > ps cases, we can not directly trust the logical bytenr from
saved_iter (can be unaligned), thus a manual round down is necessary
for the logical bytenr.
- Make btrfs_check_read_bio() to build an array of paddrs
The tricky part is that we can only call btrfs_data_csum_ok() after
all involved pages are assembled.
This means at the call time of btrfs_check_read_bio(), our offset
inside the bio is already at the end of the fs block.
Thus we must re-calculate @bio_offset for btrfs_data_csum_ok() and
repair_one_sector().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently btrfs_repair_io_failure() only accept a single @paddr
parameter, and for bs > ps cases it's required that @paddr is backed by
a large folio.
That assumption has quite some limitations, preventing us from utilizing
true zero-copy direct-io and encoded read/writes.
To address the problem, enhance btrfs_repair_io_failure() by:
- Accept an array of paddrs, up to 64K / PAGE_SIZE entries
This kind of acts like a bio_vec, but with very limited entries, as the
function is only utilized to repair one fs data block, or a tree block.
Both have an upper size limit (BTRFS_MAX_BLOCK_SIZE, i.e. 64K), so we
don't need the full bio_vec thing to handle it.
- Allocate a bio with multiple slots
Previously even for bs > ps cases, we only passed in a contiguous
physical address range, thus a single slot will be enough.
But not anymore, so we have to allocate a bio structure, other than
using the on-stack one.
- Use on-stack memory to allocate @paddrs array
It's at most 16 pages (4K page size, 64K block size), will take up at
most 128 bytes.
I think the on-stack cost is still acceptable.
- Add one extra check to make sure the repair bio is exactly one block
- Utilize btrfs_repair_io_failure() to submit a single bio for metadata
This should improve the read-repair performance for metadata, as now
we submit a node sized bio then wait, other than submit each block of
the metadata and wait for each submitted block.
- Add one extra parameter indicating the step
This is due to the fact that metadata step can be as large as
nodesize, instead of sectorsize.
So we need a way to distinguish metadata and data repair.
- Reduce the width of @length parameter of btrfs_repair_io_failure()
Since we only call btrfs_repair_io_failure() on a single data or
metadata block, u64 is overkilled.
Use u32 instead and add one extra ASSERT()s to make sure the length
never exceed BTRFS_MAX_BLOCK_SIZE.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
There is a report that memory allocation failed for btrfs_bio::csum
during a large read:
b2sum: page allocation failure: order:4, mode:0x40c40(GFP_NOFS|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
CPU: 0 UID: 0 PID: 416120 Comm: b2sum Tainted: G W 6.17.0 #1 NONE
Tainted: [W]=WARN
Hardware name: Raspberry Pi 4 Model B Rev 1.5 (DT)
Call trace:
show_stack+0x18/0x30 (C)
dump_stack_lvl+0x5c/0x7c
dump_stack+0x18/0x24
warn_alloc+0xec/0x184
__alloc_pages_slowpath.constprop.0+0x21c/0x730
__alloc_frozen_pages_noprof+0x230/0x260
___kmalloc_large_node+0xd4/0xf0
__kmalloc_noprof+0x1c8/0x260
btrfs_lookup_bio_sums+0x214/0x278
btrfs_submit_chunk+0xf0/0x3c0
btrfs_submit_bbio+0x2c/0x4c
submit_one_bio+0x50/0xac
submit_extent_folio+0x13c/0x340
btrfs_do_readpage+0x4b0/0x7a0
btrfs_readahead+0x184/0x254
read_pages+0x58/0x260
page_cache_ra_unbounded+0x170/0x24c
page_cache_ra_order+0x360/0x3bc
page_cache_async_ra+0x1a4/0x1d4
filemap_readahead.isra.0+0x44/0x74
filemap_get_pages+0x2b4/0x3b4
filemap_read+0xc4/0x3bc
btrfs_file_read_iter+0x70/0x7c
vfs_read+0x1ec/0x2c0
ksys_read+0x4c/0xe0
__arm64_sys_read+0x18/0x24
el0_svc_common.constprop.0+0x5c/0x130
do_el0_svc+0x1c/0x30
el0_svc+0x30/0xa0
el0t_64_sync_handler+0xa0/0xe4
el0t_64_sync+0x198/0x19c
[CAUSE]
Btrfs needs to allocate memory for btrfs_bio::csum for large reads, so
that we can later verify the contents of the read.
However nowadays a read bio can easily go beyond BIO_MAX_VECS *
PAGE_SIZE (which is 1M for 4K page sizes), due to the multi-page bvec
that one bvec can have more than one pages, as long as the pages are
physically adjacent.
This will become more common when the large folio support is moved out
of experimental features.
In the above case, a read larger than 4MiB with SHA256 checksum (32
bytes for each 4K block) will be able to trigger a order 4 allocation.
The order 4 is larger than PAGE_ALLOC_COSTLY_ORDER (3), thus without
extra flags such allocation will not retry.
And if the system has very small amount of memory (e.g. RPI4 with low
memory spec) or VMs with small vRAM, or the memory is heavily
fragmented, such allocation will fail and cause the above warning.
[FIX]
Although btrfs is handling the memory allocation failure correctly, we
do not really need the physically contiguous memory just to restore
our checksum.
In fact btrfs_csum_one_bio() is already using kvzalloc() to reduce the
memory pressure.
So follow the step to use kvcalloc() for btrfs_bio::csum.
Reported-by: Calvin Owens <calvin@wbinvd.org>
Link: https://lore.kernel.org/linux-btrfs/20251105180054.511528-1-calvin@wbinvd.org/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[ENHANCEMENT]
Btrfs currently calculates data checksums then submits the bio.
But after commit 968f19c5b1b7 ("btrfs: always fallback to buffered write
if the inode requires checksum"), any writes with data checksum will
fallback to buffered IO, meaning the content will not change during
writeback.
This means we're safe to calculate the data checksum and submit the bio
in parallel, and only need the following new behavior:
- Wait the csum generation to finish before calling btrfs_bio::end_io()
Or this can lead to use-after-free for the csum generation worker.
- Save the current bi_iter for csum_one_bio()
As the submission part can advance btrfs_bio::bio.bi_iter, if not
saved csum_one_bio() may got an empty bi_iter and do not generate any
checksum.
Unfortunately this means we have to increase the size of btrfs_bio for
16 bytes, but this is still acceptable.
As usual, such new feature is hidden behind the experimental flag.
[THEORETIC ANALYZE]
Consider the following theoretic hardware performance, which should be
more or less close to modern mainstream hardware:
Memory bandwidth: 50GiB/s
CRC32C bandwidth: 45GiB/s
SSD bandwidth: 8GiB/s
Then write bandwidth with data checksum before the patch is:
1 / ( 1 / 50 + 1 / 45 + 1 / 8) = 5.98 GiB/s
After the patch, the bandwidth is:
1 / ( 1 / 50 + max( 1 / 45 + 1 / 8)) = 6.90 GiB/s
The difference is 15.32% improvement.
[REAL WORLD BENCHMARK]
I'm using a Zen5 (HX 370) as the host, the VM has 4GiB memory, 10 vCPUs, the
storage is backed by a PCIe gen3 x4 NVMe.
The test is a direct IO write, with 1MiB block size, write 7GiB data
into a btrfs mount with data checksum. Thus the direct write will
fallback to buffered one:
Vanilla Datasum: 1619.97 GiB/s
Patched Datasum: 1792.26 GiB/s
Diff +10.6 %
In my case, the bottleneck is the storage, thus the improvement is not
reaching the theoretic one, but still some observable improvement.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BACKGROUND]
Btrfs has a lot of different bi_end_io functions, to handle different
raid profiles. But they introduced a lot of different contexts for
btrfs_bio::end_io() calls:
- Simple read bios
Run in task context, backed by either endio_meta_workers or
endio_workers.
- Simple write bios
Run in IRQ context.
- RAID56 write or rebuild bios
Run in task context, backed by rmw_workers.
- Mirrored write bios
Run in irq context.
This is inconsistent, and contributes to the number of workqueues used
in btrfs.
[ENHANCEMENT]
Make all the above bios call their btrfs_bio::end_io() in task context,
backed by either endio_meta_workers for metadata, or endio_workers for
data.
For simple write bios, merge the handling into simple_end_io_work().
For mirrored write bios, it will be a little more complex, since both
the original or the cloned bios can run the final btrfs_bio::end_io().
Here we make sure the cloned bios are using btrfs_bioset, to reuse the
end_io_work, and run both original and cloned work inside the workqueue.
Add extra ASSERT()s to make sure btrfs_bio_end_io() is running in task
context.
This not only unifies the context for btrfs_bio::end_io() functions, but
also opens a new door for further btrfs_bio::end_io() related cleanups.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently there is only one caller which doesn't populate
btrfs_bio::inode, and that's scrub.
The idea is scrub doesn't want any automatic csum verification nor
read-repair, as everything will be handled by scrub itself.
However that behavior is really no different than metadata inode, thus
we can reuse btree_inode as btrfs_bio::inode for scrub.
The only exception is in btrfs_submit_chunk() where if a bbio is from
scrub or data reloc inode, we set rst_search_commit_root to true.
This means we still need a way to distinguish scrub from metadata, but
that can be done by a new flag inside btrfs_bio.
Now btrfs_bio::inode is a mandatory parameter, we can extract fs_info
from that inode thus can remove btrfs_bio::fs_info to save 8 bytes from
btrfs_bio structure.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The unlikely() annotation is a static prediction hint that compiler may
use to reorder code out of hot path. We use it elsewhere (namely
tree-checker.c) for error branches that almost never happen, where
EIO is one of them.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Btrfs uses btrfs_bio to handle read/write of logical address, for the
incoming bs > ps support, btrfs has extra requirements:
- One folio must contain at least one fs block
- No fs block can cross folio boundaries
This requirement is not hard to maintain, thanks to the address space's
minimal folio order.
But not all btrfs bios are generated through address space, e.g.
compression and scrub.
To catch possible unaligned bios, introduce a helper,
assert_bbio_alginment(), for each btrfs_bio in btrfs_submit_bbio().
This will check the following things:
- bv_offset is aligned to block size
- bv_len is aligned to block size
With a btrfs bio passing above checks, unless it's empty it will ensure
the requirements for bs > ps support.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently if we want to iterate a bio in block unit, we do something
like this:
while (iter->bi_size) {
struct bio_vec bv = bio_iter_iovec();
/* Do something with using the bv */
bio_advance_iter_single(&bbio->bio, iter, sectorsize);
}
That's fine for now, but it will not handle future bs > ps, as
bio_iter_iovec() returns a single-page bvec, meaning the bv_len will not
exceed page size.
This means the code using that bv can only handle a block if bs <= ps.
To address this problem and handle future bs > ps cases better:
- Introduce a helper btrfs_bio_for_each_block()
Instead of bio_vec, which has single and multiple page version and
multiple page version has quite some limits, use my favorite way to
represent a block, phys_addr_t.
For bs <= ps cases, nothing is changed, except we will do a very
small overhead to convert phys_addr_t to a folio, then use the proper
folio helpers to handle the possible highmem cases.
For bs > ps cases, all blocks will be backed by large folios, meaning
every folio will cover at least one block. And still use proper folio
helpers to handle highmem cases.
With phys_addr_t, we will handle both large folio and highmem
properly. So there is no better single variable to present a btrfs
block than phys_addr_t.
- Extract the data block csum calculation into a helper
The new helper, btrfs_calculate_block_csum() will be utilized by
btrfs_csum_one_bio().
- Use btrfs_bio_for_each_block() to replace existing call sites
Including:
* index_one_bio() from raid56.c
Very straight-forward.
* btrfs_check_read_bio()
Also update repair_one_sector() to grab the folio using phys_addr_t,
and do extra checks to make sure the folio covers at least one
block.
We do not need to bother bv_len at all now.
* btrfs_csum_one_bio()
Now we can move the highmem handling into a dedicated helper,
calculate_block_csum(), and use btrfs_bio_for_each_block() helper.
There is one exception in btrfs_decompress_buf2page(), which is copying
decompressed data into the original bio, which is not iterating using
block size thus we don't need to bother.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently for btrfs checksum verification, we do it in the following
pattern:
kaddr = kmap_local_*();
ret = btrfs_check_csum_csum(kaddr);
kunmap_local(kaddr);
It's OK for now, but it's still not following the patterns of helpers
inside linux/highmem.h, which never requires a virt memory address.
In those highmem helpers, they mostly accept a folio, some offset/length
inside the folio, and in the implementation they check if the folio
needs partial kmap, and do the handling.
Inspired by those formal highmem helpers, enhance the highmem handling
of data checksum verification by:
- Rename btrfs_check_sector_csum() to btrfs_check_block_csum()
To follow the more common term "block" used in all other major
filesystems.
- Pass a physical address into btrfs_check_block_csum() and
btrfs_data_csum_ok()
The physical address is always available even for a highmem page.
Since it's page frame number << PAGE_SHIFT + offset in page.
And with that physical address, we can grab the folio covering the
page, and do extra checks to ensure it covers at least one block.
This also allows us to do the kmap inside btrfs_check_block_csum().
This means all the extra HIGHMEM handling will be concentrated into
btrfs_check_block_csum(), and no callers will need to bother highmem
by themselves.
- Properly zero out the block if csum mismatch
Since btrfs_data_csum_ok() only got a paddr, we can not and should not
use memzero_bvec(), which only accepts single page bvec.
Instead use paddr to grab the folio and call folio_zero_range()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If you run a workload with:
- a cgroup that does tons of parallel data reading, with a working set
much larger than its memory limit
- a second cgroup that writes relatively fewer files, with overwrites,
with no memory limit
(see full code listing at the bottom for a reproducer)
Then what quickly occurs is:
- we have a large number of threads trying to read the csum tree
- we have a decent number of threads deleting csums running delayed refs
- we have a large number of threads in direct reclaim and thus high
memory pressure
The result of this is that we writeback the csum tree repeatedly mid
transaction, to get back the extent_buffer folios for reclaim. As a
result, we repeatedly COW the csum tree for the delayed refs that are
deleting csums. This means repeatedly write locking the higher levels of
the tree.
As a result of this, we achieve an unpleasant priority inversion. We
have:
- a high degree of contention on the csum root node (and other upper
nodes) eb rwsem
- a memory starved cgroup doing tons of reclaim on CPU.
- many reader threads in the memory starved cgroup "holding" the sem
as readers, but not scheduling promptly. i.e., task __state == 0, but
not running on a cpu.
- btrfs_commit_transaction stuck trying to acquire the sem as a writer.
(running delayed_refs, deleting csums for unreferenced data extents)
This results in arbitrarily long transactions. This then results in
seriously degraded performance for any cgroup using the filesystem (the
victim cgroup in the script).
It isn't an academic problem, as we see this exact problem in production
at Meta with one cgroup over its memory limit ruining btrfs performance
for the whole system, stalling critical system services that depend on
btrfs syncs.
The underlying scheduling "problem" with global rwsems is sort of thorny
and apparently well known and was discussed at LPC 2024, for example.
As a result, our main lever in the short term is just trying to reduce
contention on our various rwsems with an eye to reducing the frequency
of write locking, to avoid disabling the read lock fast acquisition path.
Luckily, it seems likely that many reads are for old extents written
many transactions ago, and that for those we *can* in fact search the
commit root. The commit_root_sem only gets taken write once, near the
end of transaction commit, no matter how much memory pressure there is,
so we have much less contention between readers and writers.
This change detects when we are trying to read an old extent (according
to extent map generation) and then wires that through bio_ctrl to the
btrfs_bio, which unfortunately isn't allocated yet when we have this
information. When we go to lookup the csums in lookup_bio_sums we can
check this condition on the btrfs_bio and do the commit root lookup
accordingly.
Note that a single bio_ctrl might collect a few extent_maps into a single
bio, so it is important to track a maximum generation across all the
extent_maps used for each bio to make an accurate decision on whether it
is valid to look in the commit root. If any extent_map is updated in the
current generation, we can't use the commit root.
To test and reproduce this issue, I used the following script and
accompanying C program (to avoid bottlenecks in constantly forking
thousands of dd processes):
====== big-read.c ======
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>
#include <errno.h>
#define BUF_SZ (128 * (1 << 10UL))
int read_once(int fd, size_t sz) {
char buf[BUF_SZ];
size_t rd = 0;
int ret = 0;
while (rd < sz) {
ret = read(fd, buf, BUF_SZ);
if (ret < 0) {
if (errno == EINTR)
continue;
fprintf(stderr, "read failed: %d\n", errno);
return -errno;
} else if (ret == 0) {
break;
} else {
rd += ret;
}
}
return rd;
}
int read_loop(char *fname) {
int fd;
struct stat st;
size_t sz = 0;
int ret;
while (1) {
fd = open(fname, O_RDONLY);
if (fd == -1) {
perror("open");
return 1;
}
if (!sz) {
if (!fstat(fd, &st)) {
sz = st.st_size;
} else {
perror("stat");
return 1;
}
}
ret = read_once(fd, sz);
close(fd);
}
}
int main(int argc, char *argv[]) {
int fd;
struct stat st;
off_t sz;
char *buf;
int ret;
if (argc != 2) {
fprintf(stderr, "Usage: %s <filename>\n", argv[0]);
return 1;
}
return read_loop(argv[1]);
}
====== repro.sh ======
#!/usr/bin/env bash
SCRIPT=$(readlink -f "$0")
DIR=$(dirname "$SCRIPT")
dev=$1
mnt=$2
shift
shift
CG_ROOT=/sys/fs/cgroup
BAD_CG=$CG_ROOT/bad-nbr
GOOD_CG=$CG_ROOT/good-nbr
NR_BIGGOS=1
NR_LITTLE=10
NR_VICTIMS=32
NR_VILLAINS=512
START_SEC=$(date +%s)
_elapsed() {
echo "elapsed: $(($(date +%s) - $START_SEC))"
}
_stats() {
local sysfs=/sys/fs/btrfs/$(findmnt -no UUID $dev)
echo "================"
date
_elapsed
cat $sysfs/commit_stats
cat $BAD_CG/memory.pressure
}
_setup_cgs() {
echo "+memory +cpuset" > $CG_ROOT/cgroup.subtree_control
mkdir -p $GOOD_CG
mkdir -p $BAD_CG
echo max > $BAD_CG/memory.max
# memory.high much less than the working set will cause heavy reclaim
echo $((1 << 30)) > $BAD_CG/memory.high
# victims get a subset of villain CPUs
echo 0 > $GOOD_CG/cpuset.cpus
echo 0,1,2,3 > $BAD_CG/cpuset.cpus
}
_kill_cg() {
local cg=$1
local attempts=0
echo "kill cgroup $cg"
[ -f $cg/cgroup.procs ] || return
while true; do
attempts=$((attempts + 1))
echo 1 > $cg/cgroup.kill
sleep 1
procs=$(wc -l $cg/cgroup.procs | cut -d' ' -f1)
[ $procs -eq 0 ] && break
done
rmdir $cg
echo "killed cgroup $cg in $attempts attempts"
}
_biggo_vol() {
echo $mnt/biggo_vol.$1
}
_biggo_file() {
echo $(_biggo_vol $1)/biggo
}
_subvoled_biggos() {
total_sz=$((10 << 30))
per_sz=$((total_sz / $NR_VILLAINS))
dd_count=$((per_sz >> 20))
echo "create $NR_VILLAINS subvols with a file of size $per_sz bytes for a total of $total_sz bytes."
for i in $(seq $NR_VILLAINS)
do
btrfs subvol create $(_biggo_vol $i) &>/dev/null
dd if=/dev/zero of=$(_biggo_file $i) bs=1M count=$dd_count &>/dev/null
done
echo "done creating subvols."
}
_setup() {
[ -f .done ] && rm .done
findmnt -n $dev && exit 1
if [ -f .re-mkfs ]; then
mkfs.btrfs -f -m single -d single $dev >/dev/null || exit 2
else
echo "touch .re-mkfs to populate the test fs"
fi
mount -o noatime $dev $mnt || exit 3
[ -f .re-mkfs ] && _subvoled_biggos
_setup_cgs
}
_my_cleanup() {
echo "CLEANUP!"
_kill_cg $BAD_CG
_kill_cg $GOOD_CG
sleep 1
umount $mnt
}
_bad_exit() {
_err "Unexpected Exit! $?"
_stats
exit $?
}
trap _my_cleanup EXIT
trap _bad_exit INT TERM
_setup
# Use a lot of page cache reading the big file
_villain() {
local i=$1
echo $BASHPID > $BAD_CG/cgroup.procs
$DIR/big-read $(_biggo_file $i)
}
# Hit del_csum a lot by overwriting lots of small new files
_victim() {
echo $BASHPID > $GOOD_CG/cgroup.procs
i=0;
while (true)
do
local tmp=$mnt/tmp.$i
dd if=/dev/zero of=$tmp bs=4k count=2 >/dev/null 2>&1
i=$((i+1))
[ $i -eq $NR_LITTLE ] && i=0
done
}
_one_sync() {
echo "sync..."
before=$(date +%s)
sync
after=$(date +%s)
echo "sync done in $((after - before))s"
_stats
}
# sync in a loop
_sync() {
echo "start sync loop"
syncs=0
echo $BASHPID > $GOOD_CG/cgroup.procs
while true
do
[ -f .done ] && break
_one_sync
syncs=$((syncs + 1))
[ -f .done ] && break
sleep 10
done
if [ $syncs -eq 0 ]; then
echo "do at least one sync!"
_one_sync
fi
echo "sync loop done."
}
_sleep() {
local time=${1-60}
local now=$(date +%s)
local end=$((now + time))
while [ $now -lt $end ];
do
echo "SLEEP: $((end - now))s left. Sleep 10."
sleep 10
now=$(date +%s)
done
}
echo "start $NR_VILLAINS villains"
for i in $(seq $NR_VILLAINS)
do
_villain $i &
disown # get rid of annoying log on kill (done via cgroup anyway)
done
echo "start $NR_VICTIMS victims"
for i in $(seq $NR_VICTIMS)
do
_victim &
disown
done
_sync &
SYNC_PID=$!
_sleep $1
_elapsed
touch .done
wait $SYNC_PID
echo "OK"
exit 0
Without this patch, that reproducer:
- Ran for 6+ minutes instead of 60s
- Hung hundreds of threads in D state on the csum reader lock
- Got a commit stuck for 3 minutes
sync done in 388s
================
Wed Jul 9 09:52:31 PM UTC 2025
elapsed: 420
commits 2
cur_commit_ms 0
last_commit_ms 159446
max_commit_ms 159446
total_commit_ms 160058
some avg10=99.03 avg60=98.97 avg300=75.43 total=418033386
full avg10=82.79 avg60=80.52 avg300=59.45 total=324995274
419 hits state R, D comms big-read
btrfs_tree_read_lock_nested
btrfs_read_lock_root_node
btrfs_search_slot
btrfs_lookup_csum
btrfs_lookup_bio_sums
btrfs_submit_bbio
1 hits state D comms btrfs-transacti
btrfs_tree_lock_nested
btrfs_lock_root_node
btrfs_search_slot
btrfs_del_csums
__btrfs_run_delayed_refs
btrfs_run_delayed_refs
With the patch, the reproducer exits naturally, in 65s, completing a
pretty decent 4 commits, despite heavy memory pressure. Occasionally you
can still trigger a rather long commit (couple seconds) but never one
that is minutes long.
sync done in 3s
================
elapsed: 65
commits 4
cur_commit_ms 0
last_commit_ms 485
max_commit_ms 689
total_commit_ms 2453
some avg10=98.28 avg60=64.54 avg300=19.39 total=64849893
full avg10=74.43 avg60=48.50 avg300=14.53 total=48665168
some random rwalker samples showed the most common stack in reclaim,
rather than the csum tree:
145 hits state R comms bash, sleep, dd, shuf
shrink_folio_list
shrink_lruvec
shrink_node
do_try_to_free_pages
try_to_free_mem_cgroup_pages
reclaim_high
Link: https://lpc.events/event/18/contributions/1883/
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The RCU protection is now done in the plain helpers, we can remove the
"_in_rcu" and "_rl_in_rcu".
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The RCU protection is now done in the plain helpers, we can remove the
"_in_rcu" and "_rl_in_rcu".
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
With all the preparation patches already merged, it's pretty easy to
enable large data folios:
- Remove the ASSERT() on folio size in btrfs_end_repair_bio()
- Add a helper to properly set the max folio order
Currently due to several call sites that are fetching the bitmap
content directly into an unsigned long, we can only support
BITS_PER_LONG blocks for each bitmap.
- Call the helper when reading/creating an inode
The support has the following limitations:
- No large folios for data reloc inode
The relocation code still requires page sized folio.
But it's not that hot nor common compared to regular buffered ios.
Will be improved in the future.
- Requires CONFIG_BTRFS_EXPERIMENTAL
- Will require all folio related operations to check if it needs the
extra btrfs_subpage structure
Now any folio larger than block size will need btrfs_subpage structure
handling.
Unfortunately I do not have a physical machine for performance test, but
if everything goes like XFS/EXT4, it should mostly bring single digits
percentage performance improvement in the real world.
Although I believe there are still quite some optimizations to be done,
let's focus on testing the current large data folio support first.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Another batch of pointer parameter constifications. This is for clarity
and minor addition to type safety. There are no observable effects in the
assembly code and .ko measured on release config.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The type blk_status_t is from block layer and not related to checksums
in our context. Use int internally and do the conversions to blk_status_t
as needed in btrfs_submit_chunk().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We can now rename 'error' to 'ret' and use it for generic errors.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We're using 'status' for the blk_status_t variables, rename 'ret' so we
can use it for proper return type.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The type blk_status_t is from block layer and not related to checksums
in our context. Use int internally and do the conversions to blk_status_t
as needed.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The type blk_status_t is from block layer and not related to checksums
in our context. Use int internally and do the conversions to blk_status_t
as needed in btrfs_bio_csum().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The type blk_status_t is from block layer and not related to checksums
in our context. Use int internally and do the conversions to blk_status_t
as needed in btrfs_bio_csum().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The type blk_status_t is from block layer and not related to checksums
in our context. Use int internally and do the conversions to blk_status_t
as needed in btrfs_submit_chunk().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Using physical address has the following advantages:
- All involved callers only need a single pointer
Instead of the old @folio + @offset pair.
- No complex poking into the bio_vec structure
As a bio_vec can be single or multiple paged, grabbing the real page
can be quite complex if the bio_vec is a multi-page one.
Instead bvec_phys() will always give a single physical address, and it
cab be easily converted to a page.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Do not duplicate the cleanup after failed initialization
in btrfs_bioset_init() and reuse the exit function btrfs_bioset_exit().
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We have dereferenced the async_submit_bio structure and extracted the bio
pointer into a local variable, so there's no need to dereference it again
when calling btrfs_bio_end_io(). Just use "bio->bi_status" instead of the
longer expression "async->bbio->bio.bi_status".
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The btrfs_cleanup_bio() helper is trivial and has a single caller, there's
no point in having a dedicated helper function. So get rid of it and move
its code into the caller (btrfs_bio_end_io()).
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The __btrfs_bio_end_io() helper is trivial and has a single caller, so
there's no point in having a dedicated helper function. Further the double
underscore prefix in the name is discouraged. So get rid of it and move
its code into the caller (btrfs_bio_end_io()).
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
| |