aboutsummaryrefslogtreecommitdiff
path: root/fs/fuse
AgeCommit message (Collapse)AuthorFilesLines
7 daysfuse: split off fch->lock from fc->lockMiklos Szeredi5-29/+44
And document which members they protect. end_polls() is called with both, outer fch->lock is probably unnecessary, but doesn't hurt for now. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
7 daysfuse: move interrupt related members to fuse_chanMiklos Szeredi3-5/+5
Move: - no_interrupt Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
7 daysfuse: move io_uring related members to fuse_chanMiklos Szeredi6-26/+26
Move: - io_uring - ring Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
7 daysfuse: move request blocking related members to fuse_chanMiklos Szeredi8-81/+81
Move: - initialized - blocked - blocked_waitq - connected - num_waiting Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
7 daysfuse: move background queuing related members to fuse_chanMiklos Szeredi10-72/+73
Move: - max_background - num_background - active_background - bg_queue - bg_lock Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
7 daysfuse: move 'devices' member from fuse_conn to fuse_chanMiklos Szeredi5-12/+17
This belongs in the transport layer. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
7 daysfuse: move fuse_dev and fuse_pqueue to dev.cMiklos Szeredi5-140/+144
Move function definitions to dev.c, struct definitions to fuse_dev_i.h. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
7 daysfuse: move fuse_iqueue to fuse_chanMiklos Szeredi10-137/+146
Move the 'fiq' member from fuse_conn to fuse_chan. Move iqueue related structure definitions and function declarations from "fuse_i.h" to "fuse_dev_i.h". Add a fuse_dev_chan_new() helper, that returns a fuse_chan initialized with the fuse_dev_fiq_ops. Add a fuse_chan_release() function, that calls fiq->ops->release(). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
7 daysfuse: add struct fuse_chanMiklos Szeredi7-5/+50
The goal is to separate transport layer stuff out from struct fuse_conn, leaving just the filesystem related members. Add a new object referenced from fuse_conn. This patch just implements the allocation and freeing of this object. Following patches will move transport related members from fuse_conn to fuse_chan. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
7 daysfuse: move request timeout code to a new source fileMiklos Szeredi8-145/+173
This marks the first step in cleanly separating the transport layer from the filesystem layer. Add "dev.h", which will contain the interface definition for the transport layer. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
7 daysfuse: fix io-uring background queue dispatch on request completionJoanne Koong3-17/+26
When a background request completes via the io_uring path, the background queue gets flushed to dispatch pending background requests, but this is done before the connection-level background counters (fc->num_background, fc->active_background) are properly accounted, which may reduce effective queue depth to one. The connection-level counters are decremented in fuse_request_end(), but flush_bg_queue() flushes the /dev/fuse path queue (fc->bg_queue), not the io_uring per-queue bg one, which means pending uring background requests on the queue are never dispatched in this path. Fix this by accounting the connection-level background counters first before flushing the queue's background queue. Since fuse_request_bg_finish() clears FR_BACKGROUND, fuse_request_end() will skip the background cleanup branch entirely, which avoids any double-decrements; it will call the wake_up(&req->waitq) branch but this is effectively a no-op as background requests have no waiters on req->waitq. Reviewed-by: Bernd Schubert <bernd@bsbernd.com> Fixes: 857b0263f30e ("fuse: Allow to queue bg requests through io-uring") Cc: stable@vger.kernel.org Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
7 daysfuse: fix device node leak in cuse_process_init_reply()Alberto Ruiz1-1/+3
If device_add() succeeds during CUSE initialization but a subsequent step (cdev_alloc() or cdev_add()) fails, the error path calls put_device() without first calling device_del(). This leaks the devtmpfs entry created by device_add(), leaving a stale /dev/<name> node that persists until reboot. Since the cuse_conn is never linked into cuse_conntbl on the failure path, cuse_channel_release() sees cc->dev == NULL and skips device_unregister(), so no other code path cleans up the node. This has several consequences: - The device name is permanently poisoned: any subsequent attempt to create a CUSE device with the same name hits the stale sysfs entry, device_add() fails, and the new device is aborted. - The collision manifests as ENODEV returned to userspace with no dmesg diagnostic, making it very difficult to debug. - The failure is self-perpetuating: once a name is leaked, all future attempts with that name fail identically. Fix this by introducing an err_dev label that calls device_del() to undo device_add() before falling through to err_unlock. The existing err_unlock path from a device_add() failure correctly skips device_del() since the device was never added. Testing instructions can be found at the lore link below. Link: https://lore.kernel.org/all/20260408-wip-cuse-leak-fix-v1-0-1c028d575e97@redhat.com/ Signed-off-by: Alberto Ruiz <aruiz@redhat.com> Fixes: 151060ac1314 ("CUSE: implement CUSE - Character device in Userspace") Cc: stable@vger.kernel.org Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
7 daysfuse: do not use start_removing_noperm()Miklos Szeredi1-8/+11
Revert the fuse part of commit c9ba789dad15 ("VFS: introduce start_creating_noperm() and start_removing_noperm()"). Commit c9ba789dad15 ("VFS: introduce start_creating_noperm() and start_removing_noperm()") caused a regression in FUSE_NOTIFY_INVAL_ENTRY, which failed to invalidate negative dentries. This manifests in the filesystem returning -ENOENT for operations on an existing file. Fixing it properly while still keeping the start_removing* infrastructure would add much additional complexity. Instead revert to the original simple implementation. The start_removing* infrastructure is needed in VFS to abstract the filesystem locking. However filesystem code can still safely use the raw locking primitives without affacting other filesystems. This is part two of the revert. Reported-by: Артем Лабазов <123321artyom@gmail.com> Closes: https://lore.kernel.org/all/CAFbF8N7++zopZuEcsKRxBV_sgOGCbzCY0hOyMw1SiGAtuzGhyQ@mail.gmail.com/ Fixes: c9ba789dad15 ("VFS: introduce start_creating_noperm() and start_removing_noperm()") Cc: stable@vger.kernel.org # 6.19 Cc: NeilBrown <neilb@ownmail.net> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
7 daysRevert "fuse: fix conversion of fuse_reverse_inval_entry() to start_removing()"Miklos Szeredi1-16/+7
This reverts commit cab012375122304a6343c1ed09404e5143b9dc01. Commit c9ba789dad15 ("VFS: introduce start_creating_noperm() and start_removing_noperm()") caused a regression in FUSE_NOTIFY_INVAL_ENTRY, which failed to invalidate negative dentries. This manifests in the filesystem returning -ENOENT for operations on an existing file. Fixing it properly while still keeping the start_removing* infrastructure would add much additional complexity. Instead revert to the original simple implementation. The start_removing* infrastructure is needed in VFS to abstract the filesystem locking. However filesystem code can still safely use the raw locking primitives without affacting other filesystems. This is part one of the revert. Reported-by: Артем Лабазов <123321artyom@gmail.com> Closes: https://lore.kernel.org/all/CAFbF8N7++zopZuEcsKRxBV_sgOGCbzCY0hOyMw1SiGAtuzGhyQ@mail.gmail.com/ Fixes: c9ba789dad15 ("VFS: introduce start_creating_noperm() and start_removing_noperm()") Cc: stable@vger.kernel.org # 6.19 Cc: NeilBrown <neilb@ownmail.net> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
7 daysfuse: avoid 32-bit prune notification count wrapSamuel Moelius2-1/+435
FUSE_NOTIFY_PRUNE validates the nodeid payload length with: size - sizeof(outarg) != outarg.count * sizeof(u64) On 32-bit kernels, size_t is also 32 bits, so the daemon-controlled count multiplication can wrap. A prune notification with count 0x20000000 and no nodeid payload passes the check, enters the copy loop, and asks the device copy path to read nodeids that are not present in the userspace write buffer. In QEMU this reaches the fuse_copy_fill() BUG_ON(!err) path. Validate the payload length with array_size() instead. That accepts exactly the same valid messages, but avoids wrapping arithmetic before the copy loop consumes the count. Assisted-by: Codex:gpt-5.5-cyber-preview Fixes: 3f29d59e92a9 ("fuse: add prune notification") Cc: stable@vger.kernel.org Signed-off-by: Samuel Moelius <sam.moelius@trailofbits.com> Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
7 daysfuse-uring: remove request-less entries from ent_w_req_queue to fix NULL derefJoanne Koong1-2/+12
If a copy into the userspace ring buffer fails, a request will be terminated and fuse_uring_req_end() will set ent->fuse_req to NULL but it will leave the entry on ent_w_req_queue in FRRS_FUSE_REQ state. This can lead to a NULL deref if the request expiration logic scans ent_w_req_queue in the window before the entry is moved off it. Fix this by taking the entry off ent_w_req_queue and changing its state from FRRS_FUSE_REQ to FRRS_INVALID before terminating the request. Fixes: 4fea593e625c ("fuse: optimize over-io-uring request expiration check") Cc: stable@kernel.org Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
7 daysfuse: clear intr_entry in fuse_resend and fuse_remove_pending_reqJi'an Zhou1-0/+9
When fuse_resend() moves a request from fpq->processing back to fiq->pending, it sets FR_PENDING and clears FR_SENT but does not remove the requests intr_entry from fiq->interrupts. If the request had FR_INTERRUPTED set from a prior signal, intr_entry remains dangling on fiq->interrupts. When the requesting task then receives a fatal signal, fuse_remove_pending_req() sees FR_PENDING=1, removes the request from fiq->pending and frees it via the refcount path, also without cleaning intr_entry. The stale intr_entry causes use-after-free when fuse_read_interrupt() iterates fiq->interrupts: - list_del_init(&req->intr_entry) -> UAF write on freed slab - req->in.h.unique -> UAF read, data leaked to userspace Remove intr_entry from fiq->interrupts in fuse_resend() for interrupted requests before they are placed back on fiq->pending. Add a WARN_ON if the intr_entry is not empty on request destruction. Fixes: 760eac73f9f6 ("fuse: Introduce a new notification type for resend pending requests") Cc: stable@vger.kernel.org # 6.9 Signed-off-by: Ji'an Zhou <eilaimemedsnaimel@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
7 daysfuse-uring: make a fuse_req on SQE commit only findable after memcpyBernd Schubert1-14/+19
Bad userspace might try to trick us and send commit SQEs request unique / commit-id of requests that are not even send to fuse-server (io_uring_cmd_done() not called) yet. fuse_uring_commit_fetch() ends the fuse request when the ring entry has a wrong state, but that could have caused a use-after-free with the memcpy operations in fuse_uring_send_in_task(). In order to avoid such races the call of fuse_uring_add_to_pq() is moved after the copy operations and just before completing the io-uring request - malicious userspace cannot find the request anymore until all prepration work in fuse-client/kernel is completed. This also moves fuse_uring_add_to_pq() a bit up in the code to avoid a forward declaration. Also not with a preparation commit, to make it easier to back port to older kernels. Reported-by: xlabai <xlabai@tencent.com> Reported-by: Berkant Koc <me@berkoc.com> Fixes: c090c8abae4b6b ("fuse: Add io-uring sqe commit and fetch support") Cc: stable@kernel.org # 6.14 Signed-off-by: Bernd Schubert <bernd@bsbernd.com> Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
7 daysfuse-uring: Avoid queue->stopped races and set/read that value under lockBernd Schubert1-5/+9
There are several readers of queue->stopped that check the value under lock, but fuse_uring_commit_fetch() did not and actually the value was not set under the lock in fuse_uring_abort_end_requests() either. Especially in fuse_uring_commit_fetch it is important to check under a lock, because due to races 'struct fuse_req' might be freed with fuse_request_end, but another thread/cpu might already do teardown work. Cc: stable@kernel.org # 6.14 Fixes: 4a9bfb9b6850fec ("fuse: {io-uring} Handle teardown of ring entries") Reported-by: Berkant Koc <me@berkoc.com> Reported-by: xlabai <xlabai@tencent.com> Signed-off-by: Bernd Schubert <bernd@bsbernd.com> Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
7 daysfuse-uring: Avoid use-after-free in fuse_uring_async_stop_queuesBernd Schubert1-0/+2
fuse_uring_async_stop_queues() might run when the last reference on ring->queue_refs was already dropped. In order to avoid an early destruction a reference on struct fuse_conn is now taken before starting fuse_uring_async_stop_queues() and that reference is only released when that delayed work queue terminates. Fixes: 4a9bfb9b6850 ("fuse: {io-uring} Handle teardown of ring entries") Cc: stable@kernel.org # 6.14 Reported-by: Berkant Koc <me@berkoc.com> Signed-off-by: Bernd Schubert <bernd@bsbernd.com> Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
7 daysfuse-uring: end fuse_req on io-uring cancel task workChris Mason1-2/+12
When io_uring delivers task work with tw.cancel set (PF_EXITING, PF_KTHREAD fallback, or percpu_ref_is_dying on the ring context), fuse_uring_send_in_task() takes the cancel branch, assigns -ECANCELED, and falls through to fuse_uring_send(). That path only flips the entry to FRRS_USERSPACE and completes the io_uring cmd; it never discharges the ring entry's owning reference to the fuse_req that fuse_uring_add_req_to_ring_ent() handed it at dispatch time. fuse_uring_send_in_task() tw.cancel == true err = -ECANCELED fuse_uring_send(ent, cmd, err, issue_flags) ent->state = FRRS_USERSPACE list_move(&ent->list, &queue->ent_in_userspace) ent->cmd = NULL io_uring_cmd_done(-ECANCELED) /* ent->fuse_req still set, req still hashed */ The fuse_req stays linked on fpq->processing[hash] and fuse_request_end() is never invoked. The originating syscall thread blocks in D-state in request_wait_answer() until fuse_abort_conn() runs, which can be the entire connection lifetime. For FR_BACKGROUND requests fc->num_background is never decremented either, so repeated cancels inflate the counter until max_background is hit and all later background ops stall. tw.cancel does not imply a connection abort (e.g. a single io_uring worker thread exits while the fuse connection stays up), so this cannot be left for fuse_abort_conn() to clean up. Ending the req but still routing the entry through fuse_uring_send() is not enough: that leaves a req-less entry on ent_in_userspace, and ent_list_request_expired() dereferences ent->fuse_req unconditionally on the head of that list, which would then NULL-deref. Fix the cancel branch to release the entry directly. Remove it from the queue, complete the io_uring cmd, end the fuse_req, free the entry, and drop its queue_refs (waking the teardown waiter if it was the last). Fixes: c2c9af9a0b13 ("fuse: Allow to queue fg requests through io-uring") Cc: stable@vger.kernel.org Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Assisted-by: kres (claude-opus-4-7) Signed-off-by: Chris Mason <clm@meta.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
7 daysfuse-uring: fix moving cancelled entry to ent_in_userspace listJoanne Koong2-5/+7
fuse_uring_cancel() moves entries that are available (these have no reqs attached) to the ent_in_userspace list. ent_list_request_expired() checks the first entry on ent_in_userspace and dereferences ent->fuse_req unconditionally, which will crash on a cancelled entry that was moved to this list. Fix this by freeing the entry and dropping queue_refs directly in fuse_uring_cancel(). This is safe because cancel is the cancel handler itself - after io_uring_cmd_done(), no more cancels will be dispatched for this command, and teardown serializes with cancel via queue->lock. Since cancel now decrements queue_refs, fuse_uring_abort() must no longer gate fuse_uring_abort_end_requests() on queue_refs > 0, as cancelled entries may have already dropped queue_refs while requests are still queued. Remove the gate so abort always flushes requests and stops queues. Reported-by: Heechan Kang <gganji11@naver.com> Tested-by: Heechan Kang <gganji11@naver.com> Reviewed-by: Bernd Schubert <bernd@bsbernd.com> Fixes: 4fea593e625c ("fuse: optimize over-io-uring request expiration check") Cc: stable@vger.kernel.org Suggested-by: Jian Huang Li <ali@ddn.com> Suggested-by: Horst Birthelmer <horst@birthelmer.de> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
7 daysfuse-uring: check connection abort during ring creationJoanne Koong1-4/+8
Check fch->connected under fch->lock in fuse_uring_create() before attaching a new ring. Without this, a race between fuse_uring_create() and fuse_chan_abort() can result in the ring, queue, and fpq.processing table being created after fuse_uring_abort() has already run, leading to unnecessary allocation and teardown. These are eventually cleaned up by fuse_uring_destruct() but will linger until the process exits, even with the connection aborted. Reviewed-by: Bernd Schubert <bernd@bsbernd.com> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
7 daysfuse-uring: fix race between registration and connection abortionJoanne Koong1-6/+16
This fixes this race: - thread a: io_uring_enter -> register sqe -> fuse_uring_create_ring_ent -> allocate ent but doesn't grab queue_ref yet - thread b: fuse_conn_destroy() -> fuse_chan_abort() -> fuse_uring_abort() is a no-op due to queue ref being 0 - thread a: grabs the queue_ref, queue_ref is now 1, rest of fuse_uring_do_register() logic executes - thread b: fuse_chan_abort() returns, fuse_chan_wait_aborted() now runs and calls "wait_event(ring->stop_waitq, atomic_read(&ring->queue_refs) == 0);" The abort/unmount thread will hang indefinitely in unkillable state as nothing will decrement queue_refs or wake stop_waitq, and the ring, queue, and ent are leaked. Fix this by checking fch->connected under fch->lock after the created ent has grabbed a ref count on the queue. This ensures that in the scenario above, it is guaranteed that we either release the queue ref and wake up stop_waitq (in case fuse_chan_wait_aborted() is already waiting) in fuse_uring_do_register() when we detect !fch->connected, or if the connection is aborted after the check, it is guaranteed that the async teardown worker will be running in the background cleaning up ents and decrementing the ent's ref on the queue, which will unblock the eventual queue and ring teardown. Fixes: 24fe962c86f5 ("fuse: {io-uring} Handle SQEs - register commands") Cc: stable@vger.kernel.org Reviewed-by: Bernd Schubert <bernd@bsbernd.com> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
7 daysfuse-uring: fix data races on ring->readyChris Mason2-3/+5
On weakly-ordered architectures, the store to fiq->ops can be reordered past the store to ring->ready, allowing a CPU that sees ring->ready == true via fuse_uring_ready() to dispatch requests through a stale fiq->ops pointer. Upgrade the store to smp_store_release() and the load in fuse_uring_ready() to smp_load_acquire() so that the preceding WRITE_ONCE(fiq->ops, ...) is visible to any CPU that observes ring->ready == true. Additionally, fuse_uring_do_register() publishes ring->ready with WRITE_ONCE() but the fast-path check reads it with a plain load. This is a marked-vs-unmarked access that KCSAN will flag. Wrap it in READ_ONCE() to mark it without adding unnecessary ordering. Also wrap the fc->ring load in fuse_uring_ready() in READ_ONCE() to prevent the compiler from reloading it between the NULL check and the dereference. Fixes: c2c9af9a0b13 ("fuse: Allow to queue fg requests through io-uring") Cc: stable@vger.kernel.org Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Assisted-by: kres (claude-opus-4-7) Signed-off-by: Chris Mason <clm@meta.com> Reviewed-by: Bernd Schubert <bernd@bsbernd.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
7 daysfuse-uring: fix EFAULT clobber in fuse_uring_commitChris Mason1-6/+3
copy_from_user() returns the number of bytes not copied as an unsigned residual on failure (1..sizeof(struct fuse_out_header)). fuse_uring_commit stores that residual in ssize_t err, sets req->out.h.error to -EFAULT, then jumps to out: with err still holding the positive residual. err = copy_from_user(&req->out.h, &ent->headers->in_out, sizeof(req->out.h)); if (err) { req->out.h.error = -EFAULT; goto out; /* err is the positive residual */ } ... out: fuse_uring_req_end(ent, req, err); fuse_uring_req_end() then runs if (error) req->out.h.error = error; which overwrites the just-assigned -EFAULT with the positive residual. FUSE callers such as fuse_simple_request() test err < 0 to detect failure, so the positive value is interpreted as success and the caller proceeds with an uninitialised or partial req->out.args. Fix by assigning err = -EFAULT in the failure branch before jumping to out, so fuse_uring_req_end() receives a negative errno and sets req->out.h.error to -EFAULT. Fixes: c090c8abae4b ("fuse: Add io-uring sqe commit and fetch support") Cc: stable@vger.kernel.org Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Assisted-by: kres (claude-opus-4-7) Signed-off-by: Chris Mason <clm@meta.com> Reviewed-by: Bernd Schubert <bernd@bsbernd.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
7 daysfuse: back uncached readdir buffers with pagesMatthew R. Ochs1-18/+67
Commit dabb90391028 ("fuse: increase readdir buffer size") changed fuse_readdir_uncached() to size its temporary buffer from ctx->count. This is useful for overlayfs and other in-kernel callers that use INT_MAX to indicate an unlimited directory read. The larger buffer is currently supplied as a kvec output argument. For virtiofs, kvec arguments are copied through req->argbuf, which is allocated with kmalloc(..., GFP_ATOMIC). A large uncached readdir buffer can therefore require a multi-megabyte contiguous atomic allocation before the request is queued. Avoid the large bounce-buffer allocation by backing uncached readdir output with pages and setting out_pages. Transports such as virtiofs can then pass the pages as scatter-gather entries instead of copying the output through argbuf. Map the pages with vm_map_ram() only while parsing the returned dirents. The existing parser can then continue to use a linear kernel mapping. [SzM: separate allocation of pages into a helper function] Fixes: dabb90391028 ("fuse: increase readdir buffer size") Cc: stable@vger.kernel.org Signed-off-by: Matthew R. Ochs <mochs@nvidia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
7 daysvirtiofs: fix UAF on submount umountMiklos Szeredi1-1/+7
iput() called from fuse_release_end() can Oops if the super block has already been destroyed. Normally this is prevented by waiting for num_waiting to go down to zero before commencing with super block shutdown. This only works, however, for the last submount instance, as the wait counter is per connection, not per superblock. Revert to using synchronous release requests for the auto_submounts case, which is virtiofs only at this time. Reported-by: Aurélien Bombo <abombo@microsoft.com> Reported-by: Zhihao Cheng <chengzhihao1@huawei.com> Cc: Greg Kurz <gkurz@redhat.com> Closes: https://github.com/kata-containers/kata-containers/issues/12589 Fixes: 26e5c67deb2e ("fuse: fix livelock in synchronous file put from fuseblk workers") Cc: stable@vger.kernel.org Reviewed-by: Greg Kurz <gkurz@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
7 daysMerge tag 'pull-dcache' of ↵Linus Torvalds2-3/+2
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull dcache updates from Al Viro: - d_alloc_parallel() API change (Neil's with my changes) - NORCU fixes - Reorganization and simplification of dentry eviction logic - Simplifying rcu_read_lock() scopes in fs/dcache.c - Secondary roots work - getting rid of NFS fake root dentries and dealing with remaining shrink_dcache_for_umount() and shrink_dentry_list() races - making cursors NORCU (surprisingly easy) * tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (22 commits) make cursors NORCU nfs: get rid of fake root dentries wind ->s_roots via ->d_sib instead of ->d_hash shrink_dentry_tree(): unify the calls of shrink_dentry_list() shrinking rcu_read_lock() scope in d_alloc_parallel() d_walk(): shrink rcu_read_lock() scope document dentry_kill() adjust calling conventions of lock_for_kill(), fold __dentry_kill() into dentry_kill() Document rcu_read_lock() use in select_collect2() Shift rcu_read_{,un}lock() inside fast_dput() simplify safety for lock_for_kill() slowpath fold lock_for_kill() and __dentry_kill() into common helper fold lock_for_kill() into shrink_kill() shrink_dentry_list(): start with removing from shrink list d_prune_aliases(): make sure to skip NORCU aliases kill d_dispose_if_unused() make to_shrink_list() return whether it has moved dentry to list select_collect(): ignore dentries on shrink lists if they have positive refcounts find_acceptable_alias(): skip NORCU aliases with zero refcount fix a race between d_find_any_alias() and final dput() of NORCU dentries ...
7 daysMerge tag 'vfs-7.2-rc1.misc' of ↵Linus Torvalds1-2/+3
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Reduce pipe->mutex contention by pre-allocating pages outside the lock in anon_pipe_write(). anon_pipe_write() called alloc_page() once per page while holding pipe->mutex. The allocation can sleep doing direct reclaim and runs memcg charging, which extends the critical section and stalls any concurrent reader on the same mutex. Now up to 8 pages are pre-allocated before the mutex is taken, leftovers are recycled into the per-pipe tmp_page[] cache before unlock, and any remainder is released after unlock, keeping the allocator out of the critical section on both sides. On a writers x readers sweep with 64KB writes against a 1 MB pipe throughput improves 6-28% and average write latency drops 5-22%; under memory pressure - when the cost of holding the mutex across reclaim is highest - throughput improves 21-48% and latency drops 17-33%. The microbenchmark is added to selftests. - uaccess/sockptr: fix the ignored_trailing logic in copy_struct_to_user() to behave as documented and the usize check in copy_struct_from_sockptr() for user pointers, and add copy_struct_{from,to}_bounce_buffer() and copy_struct_to_sockptr() helpers for upcoming users (IPPROTO_SMBDIRECT, IPPROTO_QUIC). - bpf: add a sleepable bpf_real_inode() kfunc that resolves the real inode backing a dentry via d_real_inode(). On overlayfs the inode attached to the dentry doesn't carry the underlying device information; this is used by the filesystem restriction BPF program that was merged into systemd. - docs: add guidelines for submitting new filesystems, motivated by the maintenance burden abandoned and untestable filesystems impose on VFS developers, blocking infrastructure work like folio conversions and iomap migration. Fixes: - libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo() and drop the now-redundant assignments in callers. This began as a one-line dma-buf fix for a path_noexec() warning; a pseudo filesystem has no reason not to set SB_I_NOEXEC. All init_pseudo() callers were audited: the only visible effect is on dma-buf where SB_I_NOEXEC silences the warning. - Handle set_blocksize() failures in legacy filesystems (bfs, hpfs, qnx4, jfs, befs, affs, isofs, minix, ntfs3, omfs). Mounting a device with a sector size > PAGE_SIZE crashed roughly half of them; the rest had the same missing error handling pattern. Plus a follow-up releasing the superblock buffer_head when setting the minix v3 block size fails. - mount: honour SB_NOUSER in the new mount API. - fs/fcntl: fix a SOFTIRQ-unsafe lock order in fasync signaling by switching the process-group paths of send_sigio() and send_sigurg() from read_lock(&tasklist_lock) to RCU, matching the single-PID path. - vfs: add an FS_USERNS_DELEGATABLE flag and set it for NFS, fixing delegated NFS mounts (fsopen() in a container with the mount performed by a privileged daemon) that broke when non-init s_user_ns was tied to FS_USERNS_MOUNT. - selftests/namespaces: fix a hang in nsid_test where an unreaped grandchild kept the TAP pipe write-end open, a waitpid(-1) race in listns_efault_test, and a false FAIL on kernels without listns() where the tests should SKIP. - filelock: fix the break_lease() stub signature for CONFIG_FILE_LOCKING=n. - init/initramfs_test: wait for the async initramfs unpacking before running; the test and do_populate_rootfs() share the parser state. - fs/coredump: reduce redundant log noise in validate_coredump_safety(). - iomap: pass the correct length to fserror_report_io() in __iomap_write_begin(). - backing-file: fix the backing_file_open() kerneldoc. Cleanups: - initramfs: refactor the cpio hex header parsing to use hex2bin() instead of the hand-rolled simple_strntoul() which is reverted, and extend the initramfs KUnit tests to cover header fields with 0x prefixes. - Replace __get_free_pages() and friends with kmalloc()/kzalloc() across quota, proc, ocfs2/dlm, nilfs2, nfs, nfsd, libfs, jfs, jbd2, isofs, fuse, select, namespace, configfs, binfmt_misc, bfs, and the do_mounts init code - part of the larger work of replacing page allocator calls with kmalloc(). - Use clear_and_wake_up_bit() in unlock_buffer() and journal_end_buffer_io_sync() instead of open-coding the sequence. - Drop unused VFS exports: unexport drop_super_exclusive(), remove start_removing_user_path_at(), and fold __start_removing_path() into start_removing_path(). - fs/read_write: narrow the __kernel_write() export with EXPORT_SYMBOL_FOR_MODULES(). - vfs: uapi: retire octal and hex constants in favor of (1 << n) for the O_ flags. Finding a free bit for a new flag across the architectures was needlessly hard with the mixed bases. - dcache: add extra sanity checks of dead dentries in dentry_free() via a new DENTRY_WARN_ONCE() that also prints d_flags. - iov_iter: use kmemdup_array() in dup_iter() to harden the allocation against multiplication overflow. - fs/pipe: write to ->poll_usage only once. - vfs: remove an always-taken if-branch in find_next_fd(). - dcache: use kmalloc_flex() for struct external_name in __d_alloc(). - namei: use QSTR() instead of QSTR_INIT() in path_pts(). - sync_file_range: delete dead S_ISLNK code. - Comment fixes: retire a stale comment in fget_task_next() and fix assorted spelling mistakes" * tag 'vfs-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (73 commits) backing-file: fix backing_file_open() kerneldoc parameter iomap: pass the correct len to fserror_report_io in __iomap_write_begin vfs: add FS_USERNS_DELEGATABLE flag and set it for NFS filelock: fix break_lease() stub signature for CONFIG_FILE_LOCKING=n vfs: uapi: retire octal and hex numbers in favor of (1 << n) for O_ flags bpf: add bpf_real_inode() kfunc fs/read_write: Do not export __kernel_write() to the entire world libfs: drop redundant SB_I_NOEXEC/SB_I_NODEV in init_pseudo() callers libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo() mount: honour SB_NOUSER in the new mount API fs/fcntl: fix SOFTIRQ-unsafe lock order in fasync signaling selftests/pipe: add pipe_bench microbenchmark fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write fs: retire stale comment in fget_task_next() fs: fix spelling mistakes in comment bfs: replace get_zeroed_page() with kzalloc() binfmt_misc: replace __get_free_page() with kmalloc() configfs: replace __get_free_pages() with kzalloc() fs/namespace: use __getname() to allocate mntpath buffer fs/select: replace __get_free_page() with kmalloc() ...
13 daysfuse: re-lock request before returning from fuse_ref_folio()Joanne Koong1-1/+1
fuse_ref_folio() unlocks the request but does not re-lock it before returning. fuse_chan_abort() can end the request and the async end callback (eg fuse_writepage_free()) can free the args while the subsequent copy chain logic after fuse_ref_folio() accesses them, leading to use-after-free issues. Fix this by locking the request in fuse_ref_folio() before returning. Fixes: c3021629a0d8 ("fuse: support splice() reading from fuse device") Cc: stable@vger.kernel.org Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
13 daysfuse: re-lock request before replacing page cache folioJoanne Koong1-14/+5
fuse_try_move_folio() unlocks the request on entry but does not re-lock it on the success path. This means fuse_chan_abort() can end the request and free the fuse_io_args (eg fuse_readpages_end()) while the subsequent copy chain logic after fuse_try_move_folio() accesses the fuse_io_args, leading to use-after-free issues. Fix this by calling lock_request() before replace_page_cache_folio(). This ensures the request is locked on the success path which will prevent the fuse_io_args from being freed while the later copying logic runs, and also ensures that the ap->folios[i]->mapping is never null since ap->folios[i] will always point to the newfolio after replace_page_cache_folio(). Fixes: ce534fb05292 ("fuse: allow splice to move pages") Cc: stable@vger.kernel.org Reported-by: Lei Lu <llfamsec@gmail.com> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2026-06-05kill d_dispose_if_unused()Al Viro1-1/+1
Rename to_shrink_list() into __move_to_shrink_list(), document and export it. Switch d_dispose_if_unused() users to that and kill d_dispose_if_unused() itself. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-06-05VFS: use wait_var_event for waiting in d_alloc_parallel()NeilBrown1-2/+1
Parallel lookup starts with a call of d_alloc_parallel(). That primitive either returns a matching hashed dentry or allocates a new one in the in-lookup state and returns it to the caller. Once the caller is done with lookup, it indicates so either by call of d_{splice_alias,add}() or by call of d_done_lookup(); at that point dentry leaves the in-lookup state. If d_alloc_parallel() finds a matching in-lookup dentry, it must wait for that dentry to leave the in-lookup state, one way or another. Currently by supplying wait_queue_head to d_alloc_parallel(). If d_alloc_parallel() creates a new in-lookup dentry, the address of that wait_queue_head is stored in ->d_wait of new dentry and stays there while it's in the in-lookup; subsequent d_alloc_parallel() will wait on the queue found in the matching in-lookup dentry. Transition out of in-lookup state wakes waiters on that queue (if any). That works, but the calling conventions are inconvenient - the caller must supply wait_queue_head and make sure that it survives at least until the new in-lookup dentry leaves the in-lookup state. That amounts to boilerplate in the d_alloc_parallel() callers that are followed by a call of d_lookup_done() in the same function; in cases like nfs asynchronous unlink it gets worse than that. This patch changes d_alloc_parallel() to use wake_up_var_locked() to wake up waiters, and wait_var_event_spinlock() to wait. dentry->d_lock is used for synchronisation as it is already held and the relevant times. That eliminates the need of caller-supplied wait_queue_head, simplifying the calling conventions. Better yet, we only need one bit of information stored in dentry itself: whether there are any waiters to be woken up, and that can be easily stored in ->d_flags; ->d_wait goes away. The reason we need that bit (DCACHE_LOOKUP_WAITERS) is that with wait_var machinery the queues are shared with all kinds of stuff and there's no way tell if any of the waiters have anything to do with our dentry; most of the time none of them will be relevant, so we need to avoid the pointless wakeups. Another benefit of the new scheme comes from the fact that wakeups have to be done outside of write-side critical areas of ->i_dir_seq; with the old scheme we need to carry the value picked from ->d_wait from __d_lookup_unhash() to the place where we actually wake the waiters up. Now we can just leave DCACHE_LOOKUP_WAITERS in ->d_flags until we get to doing wakeups - that's done within the same ->d_lock scope, so we are fine; new bit is accessed only under ->d_lock and it's seen only on dentries with DCACHE_PAR_LOOKUP in ->d_flags. __d_lookup_unhash() no longer needs to re-init ->d_lru. That was previously shared (in a union) with ->d_wait but ->d_wait is now gone so it no longer corrupts ->d_lru. Co-developed-by: Al Viro <viro@zeniv.linux.org.uk> # saner handling of flags Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-05-27fuse: replace __get_free_page() with kmalloc()Mike Rapoport (Microsoft)1-2/+3
fuse_do_ioctl allocates memory for struct iov array using __get_free_page(). kmalloc() is a better API for such use and it also provides better scalability and more debugging possibilities. Replace use of __get_free_page() with kmalloc(). Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Link: https://patch.msgid.link/20260523-b4-fs-v1-12-275e36a83f0e@kernel.org Acked-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-05-22fuse: reject fuse_notify() pagecache ops on directoriesJann Horn1-1/+8
The operations FUSE_NOTIFY_STORE and FUSE_NOTIFY_RETRIEVE allow the FUSE daemon to actively write/read pagecache contents. For directories with FOPEN_CACHE_DIR, the pagecache is used as kernel-internal cache storage, and userspace is not supposed to have direct access to this cache - in particular, fuse_parse_cache() will hit WARN_ON() if the cache contains bogus data. Reject FUSE_NOTIFY_STORE and FUSE_NOTIFY_RETRIEVE on anything other than regular files with -EINVAL. Fixes: 5d7bc7e8680c ("fuse: allow using readdir cache") Cc: stable@vger.kernel.org Signed-off-by: Jann Horn <jannh@google.com> Link: https://patch.msgid.link/20260519-fuse-dir-pagecache-v2-1-5428fa48e175@google.com Acked-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-05-22fuse: limit FUSE_NOTIFY_RETRIEVE to uptodate foliosJann Horn1-0/+4
FUSE_NOTIFY_RETRIEVE must be limited to uptodate folios; !uptodate folios can contain uninitialized data. Since FUSE_NOTIFY_RETRIEVE is intended to only return data that is already in the page cache and not wait for data from the FUSE daemon, treat !uptodate folios as if they weren't present. This only has security impact on systems that don't enable automatic zero-initialization of all page allocations via CONFIG_INIT_ON_ALLOC_DEFAULT_ON or init_on_alloc=1. Cc: stable@kernel.org Fixes: 2d45ba381a74 ("fuse: add retrieve request") Signed-off-by: Jann Horn <jannh@google.com> Link: https://patch.msgid.link/20260519-fuse-retrieve-uptodate-v1-1-a7a1912a37f9@google.com Acked-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-05-11fuse: fix writeback array overflow when max_pages is oneJunxi Qian1-1/+4
fuse_iomap_writeback_range() appends one folio pointer and one fuse_folio_desc for every dirty range that is merged into the current writeback request. The merge decision checks the byte budget against fc->max_pages and fc->max_write, but it does not check whether the folio and descriptor arrays still have another free slot. This is not sufficient for fuseblk, where the filesystem block size can be smaller than PAGE_SIZE. With writeback cache enabled and max_pages negotiated as one, contiguous sub-page dirty ranges can fit within the byte budget while spanning more than one folio. The next append can then write past the one-slot folios and descs arrays. Split the request when the number of already attached folios has reached fc->max_pages. This keeps the folio/descriptor slot accounting in sync with the send decision. Fixes: ef7e7cbb323f ("fuse: use iomap for writeback") Cc: stable@vger.kernel.org Reviewed-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Junxi Qian <qjx1298677004@gmail.com> Link: https://patch.msgid.link/20260506122415.205340-1-qjx1298677004@gmail.com Acked-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-04-27fuse: don't block in fuse_get_dev() for non-sync_init caseJoanne Koong1-3/+9
Commit a8dd5f1b73bc ("fuse: create fuse_dev on /dev/fuse open instead of mount") changed behavior so that fuse_get_dev() now unconditionally blocks waiting for a connection, even in the case where sync_init was not set. Previously, non-sync_init opens returned -EPERM immediately. Restore the previous behavior of returning -EPERM. Fixes: a8dd5f1b73bc ("fuse: create fuse_dev on /dev/fuse open instead of mount") Reported-by: Mark Brown <broonie@kernel.org> Closes: https://lore.kernel.org/all/3c9f8396-41f4-4c88-b883-34bede72b427@sirena.org.uk/ Cc: <stable@vger.kernel.org> Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Tested-by: Mark Brown <broonie@kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2026-04-23Merge tag 'vfs-7.1-rc1.fixes' of ↵Linus Torvalds1-0/+4
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - eventpoll: fix ep_remove() UAF and follow-up cleanup - fs: aio: set VMA_DONTCOPY_BIT in mmap to fix NULL-pointer-dereference error - writeback: Fix use after free in inode_switch_wbs_work_fn() - fuse: reject oversized dirents in page cache - fs: aio: reject partial mremap to avoid Null-pointer-dereference error - nstree: fix func. parameter kernel-doc warnings - fs: Handle multiply claimed blocks more gracefully with mmb * tag 'vfs-7.1-rc1.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: eventpoll: drop vestigial epi->dying flag eventpoll: drop dead bool return from ep_remove_epi() eventpoll: refresh eventpoll_release() fast-path comment eventpoll: move f_lock acquisition into ep_remove_file() eventpoll: fix ep_remove struct eventpoll / struct file UAF eventpoll: move epi_fget() up eventpoll: rename ep_remove_safe() back to ep_remove() eventpoll: drop vestigial __ prefix from ep_remove_{file,epi}() eventpoll: kill __ep_remove() eventpoll: split __ep_remove() eventpoll: use hlist_is_singular_node() in __ep_remove() fs: Handle multiply claimed blocks more gracefully with mmb nstree: fix func. parameter kernel-doc warnings fs: aio: reject partial mremap to avoid Null-pointer-dereference error fuse: reject oversized dirents in page cache writeback: Fix use after free in inode_switch_wbs_work_fn() fs: aio: set VMA_DONTCOPY_BIT in mmap to fix NULL-pointer-dereference error
2026-04-24fuse: reject oversized dirents in page cacheSamuel Page1-0/+4
fuse_add_dirent_to_cache() computes a serialized dirent size from the server-controlled namelen field and copies the dirent into a single page-cache page. The existing logic only checks whether the dirent fits in the remaining space of the current page and advances to a fresh page if not. It never checks whether the dirent itself exceeds PAGE_SIZE. As a result, a malicious