aboutsummaryrefslogtreecommitdiff
path: root/io_uring
AgeCommit message (Collapse)AuthorFilesLines
2026-04-03Merge tag 'io_uring-7.0-20260403' of ↵Linus Torvalds7-29/+87
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fixes from Jens Axboe: - A previous fix in this release covered the case of the rings being RCU protected during resize, but it missed a few spots. This covers the rest - Fix the cBPF filters when COW'ed, introduced in this merge window - Fix for an attempt to import a zero sized buffer - Fix for a missing clamp in importing bundle buffers * tag 'io_uring-7.0-20260403' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/bpf_filters: retain COW'ed settings on parse failures io_uring: protect remaining lockless ctx->rings accesses with RCU io_uring/rsrc: reject zero-length fixed buffer import io_uring/net: fix slab-out-of-bounds read in io_bundle_nbufs()
2026-04-02io_uring/timeout: use 'ctx' consistentlyYang Xiuwei1-2/+2
There's already a local ctx variable, yet cq_timeouts accounting uses req->ctx. Use ctx consistently. Signed-off-by: Yang Xiuwei <yangxiuwei@kylinos.cn> Link: https://patch.msgid.link/20260402014952.260414-1-yangxiuwei@kylinos.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-02io_uring/rw: clean up __io_read() obsolete comment and early returnsJoanne Koong1-6/+5
After commit a9165b83c193 ("io_uring/rw: always setup io_async_rw for read/write requests") which moved the iovec allocation into the prep path and stores it in req->async_data where it now gets freed as part of the request lifecycle, this comment is now outdated. Remove it and clean up the goto as well. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://patch.msgid.link/20260401173511.4052303-1-joannelkoong@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-02io_uring/zcrx: use correct mmap off constantsPavel Begunkov1-1/+1
zcrx was using IORING_OFF_PBUF_SHIFT during first iterations, but there is now a separate constant it should use. Both are 16 so it doesn't change anything, but improve it for the future. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/fe16ebe9ba4048a7e12f9b3b50880bd175b1ce03.1774780198.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-02io_uring/zcrx: use dma_len for chunk size calculationPavel Begunkov1-1/+1
Buffers are now dma-mapped earlier and we can sg_dma_len(), otherwise, since it's walking with for_each_sgtable_dma_sg(), it might wrongfully reject some configurations. As a bonus, it'd now be able to use larger chunks if dma addresses are coalesced e.g by iommu. Fixes: 8c0cab0b7bf7 ("io_uring/zcrx: always dma map in advance") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/03b219af3f6cfdd1cf64679b8bab7461e47cc123.1774780198.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-02io_uring/zcrx: don't clear not allocated niovsPavel Begunkov1-2/+4
Now that area->is_mapped is set earlier before niovs array is allocated, io_zcrx_free_area -> io_zcrx_unmap_area in an error path can try to clear dma addresses for unallocated niovs, fix it. Fixes: 8c0cab0b7bf7 ("io_uring/zcrx: always dma map in advance") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/cbcb7749b5a001ecd4d1c303515ce9403215640c.1774780198.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01io_uring/zcrx: don't use mark0 for allocating xarrayPavel Begunkov1-2/+2
XA_MARK_0 is not compatible with xarray allocating entries, use XA_MARK_1. Fixes: fda90d43f4fac ("io_uring/zcrx: return back two step unregistration") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/f232cfd3c466047d333b474dd2bddd246b6ebb82.1774780198.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01io_uring: cast id to u64 before shifting in io_allocate_rbuf_ring()Anas Iqbal1-1/+1
Smatch warns: io_uring/zcrx.c:393 io_allocate_rbuf_ring() warn: should 'id << 16' be a 64 bit type? The expression 'id << IORING_OFF_PBUF_SHIFT' is evaluated using 32-bit arithmetic because id is a u32. This may overflow before being promoted to the 64-bit mmap_offset. Cast id to u64 before shifting to ensure the shift is performed in 64-bit arithmetic. Signed-off-by: Anas Iqbal <mohd.abd.6602@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/52400e1b343691416bef3ed3ae287fb1a88d407f.1774780198.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01io_uring/zcrx: reject REG_NODEV with large rx_buf_sizePavel Begunkov1-1/+3
The copy fallback path doesn't care about the actual niov size and only uses first PAGE_SIZE bytes, and any additional space will be wasted. Since ZCRX_REG_NODEV solely relies on the copy path, it doesn't make sense to support non-standard rx_buf_len. Reject it for now, and re-enable once improved. Fixes: c11728021d5cd ("io_uring/zcrx: implement device-less mode for zcrx") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/3e7652d9c27f8ac5d2b141e3af47971f2771fb05.1774780198.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01io_uring/cancel: validate opcode for IORING_ASYNC_CANCEL_OPAmir Mohammad Jahangirzad1-1/+8
io_async_cancel_prep() reads the opcode selector from sqe->len and stores it in cancel->opcode, which is an 8-bit field. Since sqe->len is a 32-bit value, values larger than U8_MAX are implicitly truncated. This can cause unintended opcode matches when the truncated value corresponds to a valid io_uring opcode. For example, submitting a value such as 0x10b will be truncated to 0x0b (IORING_OP_TIMEOUT), allowing a cancel request to match operations it did not intend to target. Validate the opcode value before assigning it to the 8-bit field and reject values outside the valid io_uring opcode range. Signed-off-by: Amir Mohammad Jahangirzad <a.jahangirzad@gmail.com> Link: https://patch.msgid.link/20260331232113.615972-1-a.jahangirzad@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01io_uring/rsrc: use io_cache_free() to free nodeJackie Liu1-1/+1
Replace kfree(node) with io_cache_free() in io_buffer_register_bvec() to match all other error paths that free nodes allocated via io_rsrc_node_alloc(). The node is allocated through io_cache_alloc() internally, so it should be returned to the cache via io_cache_free() for proper object reuse. Signed-off-by: Jackie Liu <liuyun01@kylinos.cn> Link: https://patch.msgid.link/20260331104509.7055-1-liu.yun@linux.dev [axboe: remove fixes tag, it's not a fix, it's a cleanup] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01io_uring/zcrx: rename zcrx [un]register functionsPavel Begunkov4-10/+10
Drop "ifqs" from function names, as it refers to an interface queue and there might be none once a device-less mode is introduced. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/657874acd117ec30fa6f45d9d844471c753b5a0f.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01io_uring/zcrx: check ctrl op payload struct sizesPavel Begunkov1-0/+2
Add a build check that ctrl payloads are of the same size and don't grow struct zcrx_ctrl. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/af66caf9776d18e9ff880ab828eb159a6a03caf5.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01io_uring/zcrx: cache fallback availability in zcrx ctxPavel Begunkov2-1/+9
Store a flag in struct io_zcrx_ifq telling if the backing memory is normal page or dmabuf based. It was looking it up from the area, however it logically allocates from the zcrx ctx and not a particular area, and once we add more than one area it'll become a mess. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/65e75408a7758fe7e60fae89b7a8d5ae4857f515.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01io_uring/zcrx: warn on a repeated area appendPavel Begunkov1-1/+1
We only support a single area, no path should be able to call io_zcrx_append_area() twice. Warn if that happens instead of just returning an error. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/28eb67fb8c48445584d7c247a36e1ad8800f0c8b.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01io_uring/zcrx: consolidate dma syncingPavel Begunkov1-11/+12
Split refilling into two steps, first allocate niovs, and then do DMA sync for them. This way dma synchronisation code can be better optimised. E.g. we don't need to call dma_dev_need_sync() for each every niov, and maybe we can coalesce sync for adjacent netmems in the future as well. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/19f2d50baa62ff2e0c6cd56dd7c394cab728c567.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01io_uring/zcrx: netmem array as refiling formatPavel Begunkov1-15/+25
Instead of peeking into page pool allocation cache directly or via net_mp_netmem_place_in_cache(), pass a netmem array around. It's a better intermediate format, e.g. you can have it on stack and reuse the refilling code and decouples it from page pools a bit more. It still points into the page pool directly, there will be no additional copies. As the next step, we can change the callback prototype to take the netmem array from page pool. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/9d8549adb7ef6672daf2d8a52858ce5926279a82.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01io_uring/zcrx: warn on alloc with non-empty pp cachePavel Begunkov1-2/+2
Page pool ensures the cache is empty before asking to refill it. Warn if the assumption is violated. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/9c9792d6e65f3780d57ff83b6334d341ed9a5f29.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01io_uring/zcrx: move count check into zcrx_get_free_niovPavel Begunkov1-17/+21
Instead of relying on the caller of __io_zcrx_get_free_niov() to check that there are free niovs available (i.e. free_count > 0), move the check into the function and return NULL if can't allocate. It consolidates the free count checks, and it'll be easier to extend the niov free list allocator in the future. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/6df04a6b3a6170f86d4345da9864f238311163f9.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01io_uring/zcrx: use guards for lockingPavel Begunkov1-8/+7
Convert last several places using manual locking to guards to simplify the code. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/eb4667cfaf88c559700f6399da9e434889f5b04a.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01io_uring/zcrx: add a struct for refill queuePavel Begunkov2-31/+37
Add a new structure that keeps the refill queue state. It's cleaner and will be useful once we introduce multiple refill queues. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/4ce200da1ff0309c377293b949200f95f80be9ae.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01io_uring/zcrx: use better name for RQ regionPavel Begunkov2-5/+5
Rename "region" to "rq_region" to highlight that it's a refill queue region. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/ac815790d2477a15826aecaa3d94f2a94ef507e6.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01io_uring/zcrx: implement device-less mode for zcrxPavel Begunkov2-15/+28
Allow creating a zcrx instance without attaching it to a net device. All data will be copied through the fallback path. The user is also expected to use ZCRX_CTRL_FLUSH_RQ to handle overflows as it normally should even with a netdev, but it becomes even more relevant as there will likely be no one to automatically pick up buffers. Apart from that, it follows the zcrx uapi for the I/O path, and is useful for testing, experimentation, and potentially for the copy receive path in the future if improved. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/674f8ad679c5a0bc79d538352b3042cf0999596e.1774261953.git.asml.silence@gmail.com [axboe: fix spelling error in uapi header and commit message] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01io_uring/zcrx: extract netdev+area init into a helperPavel Begunkov1-29/+43
In preparation to following patches, add a function that is responsibly for looking up a netdev, creating an area, DMA mapping it and opening a queue. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/88cb6f746ecb496a9030756125419df273d0b003.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01io_uring/zcrx: always dma map in advancePavel Begunkov1-29/+15
zcrx was originally establisihing dma mappings at a late stage when it was being bound to a page pool. Dma-buf couldn't work this way, so it's initialised during area creation. It's messy having them do it at different spots, just move everything to the area creation time. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/334092a2cbdd4aabd7c025050aa99f05ace89bb5.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01io_uring/zcrx: fully clean area on error in io_import_umem()Pavel Begunkov1-6/+10
When accounting fails, io_import_umem() sets the page array, etc. and returns an error expecting that the error handling code will take care of the rest. To make the next patch simpler, only return a fully initialised areas from the function. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/3a602b7fb347dbd4da6797ac49b52ea5dedb856d.1774261953.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01io_uring/zcrx: return back two step unregistrationPavel Begunkov3-3/+51
There are reports where io_uring instance removal takes too long and an ifq reallocation by another zcrx instance fails. Split zcrx destruction into two steps similarly how it was before, first close the queue early but maintain zcrx alive, and then when all inflight requests are completed, drop the main zcrx reference. For extra protection, mark terminated zcrx instances in xarray and warn if we double put them. Cc: stable@vger.kernel.org # 6.19+ Link: https://github.com/axboe/liburing/issues/1550 Reported-by: Youngmin Choi <youngminchoi94@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/0ce21f0565ab4358668922a28a8a36922dfebf76.1774261953.git.asml.silence@gmail.com [axboe: NULL ifq before break inside scoped guard] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01io_uring/bpf_filters: retain COW'ed settings on parse failuresJens Axboe1-1/+9
If io_parse_restrictions() fails, it ends up clearing any restrictions currently set. The intent is only to clear whatever it already applied, but it ends up clearing everything, including whatever settings may have been applied in a copy-on-write fashion already. Ensure that those are retained. Link: https://lore.kernel.org/io-uring/CAK8a0jzF-zaO5ZmdOrmfuxrhXuKg5m5+RDuO7tNvtj=kUYbW7Q@mail.gmail.com/ Reported-by: antonius <bluedragonsec2023@gmail.com> Fixes: ed82f35b926b ("io_uring: allow registration of per-task restrictions") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-04-01io_uring: protect remaining lockless ctx->rings accesses with RCUJens Axboe4-28/+70
Commit 96189080265e addressed one case of ctx->rings being potentially accessed while a resize is happening on the ring, but there are still a few others that need handling. Add a helper for retrieving the rings associated with an io_uring context, and add some sanity checking to that to catch bad uses. ->rings_rcu is always valid, as long as it's used within RCU read lock. Any use of ->rings_rcu or ->rings inside either ->uring_lock or ->completion_lock is sane as well. Do the minimum fix for the current kernel, but set it up such that this basic infra can be extended for later kernels to make this harder to mess up in the future. Thanks to Junxi Qian for finding and debugging this issue. Cc: stable@vger.kernel.org Fixes: 79cfe9e59c2a ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS") Reviewed-by: Junxi Qian <qjx1298677004@gmail.com> Tested-by: Junxi Qian <qjx1298677004@gmail.com> Link: https://lore.kernel.org/io-uring/20260330172348.89416-1-qjx1298677004@gmail.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-29io_uring/rsrc: reject zero-length fixed buffer importQi Tang1-0/+4
validate_fixed_range() admits buf_addr at the exact end of the registered region when len is zero, because the check uses strict greater-than (buf_end > imu->ubuf + imu->len). io_import_fixed() then computes offset == imu->len, which causes the bvec skip logic to advance past the last bio_vec entry and read bv_offset from out-of-bounds slab memory. Return early from io_import_fixed() when len is zero. A zero-length import has no data to transfer and should not walk the bvec array at all. BUG: KASAN: slab-out-of-bounds in io_import_reg_buf+0x697/0x7f0 Read of size 4 at addr ffff888002bcc254 by task poc/103 Call Trace: io_import_reg_buf+0x697/0x7f0 io_write_fixed+0xd9/0x250 __io_issue_sqe+0xad/0x710 io_issue_sqe+0x7d/0x1100 io_submit_sqes+0x86a/0x23c0 __do_sys_io_uring_enter+0xa98/0x1590 Allocated by task 103: The buggy address is located 12 bytes to the right of allocated 584-byte region [ffff888002bcc000, ffff888002bcc248) Fixes: 8622b20f23ed ("io_uring: add validate_fixed_range() for validate fixed buffer") Signed-off-by: Qi Tang <tpluszz77@gmail.com> Link: https://patch.msgid.link/20260329164936.240871-1-tpluszz77@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-29io_uring/net: fix slab-out-of-bounds read in io_bundle_nbufs()Junxi Qian1-0/+4
sqe->len is __u32 but gets stored into sr->len which is int. When userspace passes sqe->len values exceeding INT_MAX (e.g. 0xFFFFFFFF), sr->len overflows to a negative value. This negative value propagates through the bundle recv/send path: 1. io_recv(): sel.val = sr->len (ssize_t gets -1) 2. io_recv_buf_select(): arg.max_len = sel->val (size_t gets 0xFFFFFFFFFFFFFFFF) 3. io_ring_buffers_peek(): buf->len is not clamped because max_len is astronomically large 4. iov[].iov_len = 0xFFFFFFFF flows into io_bundle_nbufs() 5. io_bundle_nbufs(): min_t(int, 0xFFFFFFFF, ret) yields -1, causing ret to increase instead of decrease, creating an infinite loop that reads past the allocated iov[] array This results in a slab-out-of-bounds read in io_bundle_nbufs() from the kmalloc-64 slab, as nbufs increments past the allocated iovec entries. BUG: KASAN: slab-out-of-bounds in io_bundle_nbufs+0x128/0x160 Read of size 8 at addr ffff888100ae05c8 by task exp/145 Call Trace: io_bundle_nbufs+0x128/0x160 io_recv_finish+0x117/0xe20 io_recv+0x2db/0x1160 Fix this by rejecting negative sr->len values early in both io_sendmsg_prep() and io_recvmsg_prep(). Since sqe->len is __u32, any value > INT_MAX indicates overflow and is not a valid length. Fixes: a05d1f625c7a ("io_uring/net: support bundles for send") Cc: stable@vger.kernel.org Signed-off-by: Junxi Qian <qjx1298677004@gmail.com> Link: https://patch.msgid.link/20260329153909.279046-1-qjx1298677004@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-27Merge tag 'io_uring-7.0-20260327' of ↵Linus Torvalds1-1/+3
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fixes from Jens Axboe: "Just two small fixes, both fixing regressions added in the fdinfo code in 6.19 with the SQE mixed size support" * tag 'io_uring-7.0-20260327' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/fdinfo: fix OOB read in SQE_MIXED wrap check io_uring/fdinfo: fix SQE_MIXED SQE displaying
2026-03-26io_uring/fdinfo: fix OOB read in SQE_MIXED wrap checkNicholas Carlini1-1/+2
__io_uring_show_fdinfo() iterates over pending SQEs and, for 128-byte SQEs on an IORING_SETUP_SQE_MIXED ring, needs to detect when the second half of the SQE would be past the end of the sq_sqes array. The current check tests (++sq_head & sq_mask) == 0, but sq_head is only incremented when a 128-byte SQE is encountered, not on every iteration. The actual array index is sq_idx = (i + sq_head) & sq_mask, which can be sq_mask (the last slot) while the wrap check passes. Fix by checking sq_idx directly. Keep the sq_head increment so the loop still skips the second half of the 128-byte SQE on the next iteration. Fixes: 1cba30bf9fdd ("io_uring: add support for IORING_SETUP_SQE_MIXED") Signed-off-by: Nicholas Carlini <nicholas@carlini.com> Link: https://patch.msgid.link/20260327021823.3138396-1-nicholas@carlini.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-26io_uring/fdinfo: fix SQE_MIXED SQE displayingJens Axboe1-0/+1
When displaying pending SQEs for a MIXED ring, each 128-byte SQE increments sq_head to skip the second slot, but the loop counter is not adjusted. This can cause the loop to read past sq_tail by one entry for each 128-byte SQE encountered, displaying SQEs that haven't been made consumable yet by the application. Match the kernel's own consumption logic in io_init_req() which decrements what's left when consuming the extra slot. Fixes: 1cba30bf9fdd ("io_uring: add support for IORING_SETUP_SQE_MIXED") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-23fs: pass on FTRUNCATE_* flags to do_truncateChristoph Hellwig1-1/+1
Pass the flags one level down to replace the somewhat confusing small argument, and clean up do_truncate as a result. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260323070205.2939118-3-hch@lst.de Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-20Merge tag 'io_uring-7.0-20260320' of ↵Linus Torvalds2-5/+18
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fixes from Jens Axboe: - A bit of a work-around for AF_UNIX recv multishot, as the in-kernel implementation doesn't properly signal EOF. We'll likely rework this one going forward, but the fix is sufficient for now - Two fixes for incrementally consumed buffers, for non-pollable files and for 0 byte reads * tag 'io_uring-7.0-20260320' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/kbuf: propagate BUF_MORE through early buffer commit path io_uring/kbuf: fix missing BUF_MORE for incremental buffers at EOF io_uring/poll: fix multishot recv missing EOF on wakeup race
2026-03-19io_uring/kbuf: propagate BUF_MORE through early buffer commit pathJens Axboe1-3/+7
When io_should_commit() returns true (eg for non-pollable files), buffer commit happens at buffer selection time and sel->buf_list is set to NULL. When __io_put_kbufs() generates CQE flags at completion time, it calls __io_put_kbuf_ring() which finds a NULL buffer_list and hence cannot determine whether the buffer was consumed or not. This means that IORING_CQE_F_BUF_MORE is never set for non-pollable input with incrementally consumed buffers. Likewise for io_buffers_select(), which always commits upfront and discards the return value of io_kbuf_commit(). Add REQ_F_BUF_MORE to store the result of io_kbuf_commit() during early commit. Then __io_put_kbuf_ring() can check this flag and set IORING_F_BUF_MORE accordingy. Reported-by: Martin Michaelis <code@mgjm.de> Cc: stable@vger.kernel.org Fixes: ae98dbf43d75 ("io_uring/kbuf: add support for incremental buffer consumption") Link: https://github.com/axboe/liburing/issues/1553 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-19io_uring/kbuf: fix missing BUF_MORE for incremental buffers at EOFJens Axboe1-0/+4
For a zero length transfer, io_kbuf_inc_commit() is called with !len. Since we never enter the while loop to consume the buffers, io_kbuf_inc_commit() ends up returning true, consuming the buffer. But if no data was consumed, by definition it cannot have consumed the buffer. Return false for that case. Reported-by: Martin Michaelis <code@mgjm.de> Cc: stable@vger.kernel.org Fixes: ae98dbf43d75 ("io_uring/kbuf: add support for incremental buffer consumption") Link: https://github.com/axboe/liburing/issues/1553 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-17io_uring: avoid req->ctx reload in io_req_put_rsrc_nodes()Jens Axboe1-2/+4
Cache 'ctx' to avoid it needing to get potentially reloaded. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-17io_uring/rw: use cached file rather than req->fileJens Axboe1-1/+1
In io_rw_init_file(), req->file is cached in file, yet the former is still being used when checking for O_DIRECT. As this is post setting the kiocb flags, the compiler has to reload req->file. Just use the locally cached file instead. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-17io_uring/net: use 'ctx' consistentlyJens Axboe1-1/+1
There's already a local ctx variable, use it for the io_is_compat() check as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-17io_uring/poll: cache req->apoll_eventsJens Axboe1-3/+5
Avoid a potential reload of ->apoll_events post vfs_poll() by caching it in a local variable. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-17io_uring/kbuf: use 'ctx' consistentlyJens Axboe1-2/+2
There's already a local ctx variable, yet the ring lock and unlock helpers use req->ctx. use ctx consistently. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-17io_uring/poll: fix multishot recv missing EOF on wakeup raceJens Axboe1-2/+7
When a socket send and shutdown() happen back-to-back, both fire wake-ups before the receiver's task_work has a chance to run. The first wake gets poll ownership (poll_refs=1), and the second bumps it to 2. When io_poll_check_events() runs, it calls io_poll_issue() which does a recv that reads the data and returns IOU_RETRY. The loop then drains all accumulated refs (atomic_sub_return(2) -> 0) and exits, even though only the first event was consumed. Since the shutdown is a persistent state change, no further wakeups will happen, and the multishot recv can hang forever. Check specifically for HUP in the poll loop, and ensure that another loop is done to check for status if more than a single poll activation is pending. This ensures we don't lose the shutdown event. Cc: stable@vger.kernel.org Fixes: dbc2564cfe0f ("io_uring: let fast poll support multishot") Reported-by: Francis Brosseau <francis@malagauche.com> Link: https://github.com/axboe/liburing/issues/1549 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-16io_uring/bpf-ops: implement bpf ops registrationPavel Begunkov3-2/+99
Implement BPF struct ops registration. It's registered off the BPF path, and can be removed by BPF as well as io_uring. To protect it, introduce a global lock synchronising registration. ctx->uring_lock can be nested under it. ctx->bpf_ops is write protected by both locks and so it's safe to read it under either of them. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/1f46bffd76008de49cbafa2ad77d348810a4f69e.1772109579.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-16io_uring/bpf-ops: add kfunc helpersPavel Begunkov2-0/+61
Add two kfuncs that should cover most of the needs: 1. bpf_io_uring_submit_sqes(), which allows to submit io_uring requests. It mirrors the normal user space submission path and follows all related io_uring_enter(2) rules. i.e. SQEs are taken from the SQ according to head/tail values. In case of IORING_SETUP_SQ_REWIND, it'll submit first N entries. 2. bpf_io_uring_get_region() returns a pointer to the specified region, where io_uring regions are kernel-userspace shared chunks of memory. It takes the size as an argument, which should be a load time constant. There are 3 types of regions: - IOU_REGION_SQ returns the submission queue. - IOU_REGION_CQ stores the CQ, SQ/CQ headers and the sqarray. In other words, it gives same memory that would normally be mmap'ed with IORING_FEAT_SINGLE_MMAP enabled IORING_OFF_SQ_RING. - IOU_REGION_MEM represents the memory / parameter region. It can be used to store request indirect parameters and for kernel - user communication. It intentionally provides a thin but flexible API and expects BPF programs to implement CQ/SQ header parsing, CQ walking, etc. That mirrors how the normal user space works with rings and should help to minimise kernel / kfunc helpers changes while introducing new generic io_uring features. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/967bcc10e94c796eb273998621551b2a21848cde.1772109579.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-16io_uring/bpf-ops: implement loop_step with BPF struct_opsPavel Begunkov5-0/+148
Introduce io_uring BPF struct ops implementing the loop_step callback, which will allow BPF to overwrite the default io_uring event loop logic. The callback takes an io_uring context, the main role of which is to be passed to io_uring kfuncs. The other argument is a struct iou_loop_params, which BPF can use to request CQ waiting and communicate other parameters. See the event loop description in the previous patch for more details. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/98db437651ce64e9cbeb611c60bf5887259db09f.1772109579.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-16io_uring: introduce callback driven main loopPavel Begunkov5-1/+131
The io_uring_enter() has a fixed order of execution: it submits requests, waits for completions, and returns to the user. Allow to optionally replace it with a custom loop driven by a callback called loop_step. The basic requirements to the callback is that it should be able to submit requests, wait for completions, parse them and repeat. Most of the communication including parameter passing can be implemented via shared memory. The callback should return IOU_LOOP_CONTINUE to continue execution or IOU_LOOP_STOP to return to the user space. Note that the kernel may decide to prematurely terminate it as well, e.g. in case the process was signalled or killed. The hook takes a structure with parameters. It can be used to ask the kernel to wait for CQEs by setting cq_wait_idx to the CQE index it wants to wait for. Spurious wake ups are possible and even likely, the callback is expected to handle it. There will be more parameters in the future like timeout. It can be used with kernel callbacks, for example, as a slow path deprecation mechanism overwiting SQEs and emulating the wanted behaviour, however it's more useful together with BPF programs implemented in following patches. Note that keeping it separately from the normal io_uring wait loop makes things much simpler and cleaner. It keeps it in one place instead of spreading a bunch of checks in different places including disabling the submission path. It holds the lock by default, which is a better fit for BPF synchronisation and the loop execution model. It nicely avoids existing quirks like forced wake ups on timeout request completion. And it should be easier to implement new features. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://patch.msgid.link/a2d369aa1c9dd23ad7edac9220cffc563abcaed6.1772109579.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-16io_uring/uring_cmd: allow non-iopoll cmds with IORING_SETUP_IOPOLLCaleb Sander Mateos1-3/+1
Currently, creating an io_uring with IORING_SETUP_IOPOLL requires all requests issued to it to support iopoll. This prevents, for example, using ublk zero-copy together with IORING_SETUP_IOPOLL, as ublk zero-copy buffer registrations are performed using a uring_cmd. There's no technical reason why these non-iopoll uring_cmds can't be supported. They will either complete synchronously or via an external mechanism that calls io_uring_cmd_done(), io_uring_cmd_post_mshot_cqe32(), or io_uring_mshot_cmd_post_cqe(), so they don't need to be polled. Allow uring_cmd requests to be issued to IORING_SETUP_IOPOLL io_urings even if their files don't implement ->uring_cmd_iopoll(). For these uring_cmd requests, skip initializing struct io_kiocb's iopoll fields, don't set REQ_F_IOPOLL, and don't set IO_URING_F_IOPOLL in issue_flags. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://patch.msgid.link/20260302172914.2488599-5-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-03-16io_uring: count CQEs in io_iopoll_check()Caleb Sander Mateos1-7/+2
A subsequent commit will allow uring_cmds that don't use iopoll on IORING_SETUP_IOPOLL io_urings. As a result, CQEs can be posted without setting the iopoll_completed flag for a request in iopoll_list or going through task work. For example, a UBLK_U_IO_FETCH_IO_CMDS command could call io_uring_mshot_cmd_post_cqe() to directly post a CQE. The io_iopoll_check() loop currently only counts completions posted in io_do_iopoll() when determining whether the min_events threshold has been met. It also exits early if there are any existing CQEs before polling, or if any CQEs are posted while running task work. CQEs posted via io_uring_mshot_cmd_post_cqe() or other mechanisms won't be counted against min_events. Explicitly check the available CQEs in each io_iopoll_check() loop iteration to account for CQEs posted in any fashion. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Link: https://patch.msgid.link/20260302172914.2488599-4-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>