| Age | Commit message (Collapse) | Author | Files | Lines |
|
And document which members they protect.
end_polls() is called with both, outer fch->lock is probably unnecessary,
but doesn't hurt for now.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Move:
- no_interrupt
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Move:
- io_uring
- ring
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Move:
- initialized
- blocked
- blocked_waitq
- connected
- num_waiting
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Move:
- max_background
- num_background
- active_background
- bg_queue
- bg_lock
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
This belongs in the transport layer.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Move function definitions to dev.c, struct definitions to fuse_dev_i.h.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Move the 'fiq' member from fuse_conn to fuse_chan.
Move iqueue related structure definitions and function declarations from
"fuse_i.h" to "fuse_dev_i.h".
Add a fuse_dev_chan_new() helper, that returns a fuse_chan initialized with
the fuse_dev_fiq_ops.
Add a fuse_chan_release() function, that calls fiq->ops->release().
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
The goal is to separate transport layer stuff out from struct fuse_conn,
leaving just the filesystem related members.
Add a new object referenced from fuse_conn. This patch just implements the
allocation and freeing of this object.
Following patches will move transport related members from fuse_conn to
fuse_chan.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
This marks the first step in cleanly separating the transport layer from
the filesystem layer.
Add "dev.h", which will contain the interface definition for the transport
layer.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
When a background request completes via the io_uring path, the
background queue gets flushed to dispatch pending background requests,
but this is done before the connection-level background counters
(fc->num_background, fc->active_background) are properly accounted,
which may reduce effective queue depth to one.
The connection-level counters are decremented in fuse_request_end(), but
flush_bg_queue() flushes the /dev/fuse path queue (fc->bg_queue), not
the io_uring per-queue bg one, which means pending uring background
requests on the queue are never dispatched in this path.
Fix this by accounting the connection-level background counters first
before flushing the queue's background queue. Since
fuse_request_bg_finish() clears FR_BACKGROUND, fuse_request_end() will
skip the background cleanup branch entirely, which avoids any
double-decrements; it will call the wake_up(&req->waitq) branch but this
is effectively a no-op as background requests have no waiters on
req->waitq.
Reviewed-by: Bernd Schubert <bernd@bsbernd.com>
Fixes: 857b0263f30e ("fuse: Allow to queue bg requests through io-uring")
Cc: stable@vger.kernel.org
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
If device_add() succeeds during CUSE initialization but a subsequent
step (cdev_alloc() or cdev_add()) fails, the error path calls
put_device() without first calling device_del(). This leaks the
devtmpfs entry created by device_add(), leaving a stale /dev/<name>
node that persists until reboot.
Since the cuse_conn is never linked into cuse_conntbl on the failure
path, cuse_channel_release() sees cc->dev == NULL and skips
device_unregister(), so no other code path cleans up the node.
This has several consequences:
- The device name is permanently poisoned: any subsequent attempt to
create a CUSE device with the same name hits the stale sysfs entry,
device_add() fails, and the new device is aborted.
- The collision manifests as ENODEV returned to userspace with no
dmesg diagnostic, making it very difficult to debug.
- The failure is self-perpetuating: once a name is leaked, all future
attempts with that name fail identically.
Fix this by introducing an err_dev label that calls device_del() to
undo device_add() before falling through to err_unlock. The existing
err_unlock path from a device_add() failure correctly skips device_del()
since the device was never added.
Testing instructions can be found at the lore link below.
Link: https://lore.kernel.org/all/20260408-wip-cuse-leak-fix-v1-0-1c028d575e97@redhat.com/
Signed-off-by: Alberto Ruiz <aruiz@redhat.com>
Fixes: 151060ac1314 ("CUSE: implement CUSE - Character device in Userspace")
Cc: stable@vger.kernel.org
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Revert the fuse part of commit c9ba789dad15 ("VFS: introduce
start_creating_noperm() and start_removing_noperm()").
Commit c9ba789dad15 ("VFS: introduce start_creating_noperm() and
start_removing_noperm()") caused a regression in FUSE_NOTIFY_INVAL_ENTRY,
which failed to invalidate negative dentries.
This manifests in the filesystem returning -ENOENT for operations on an
existing file.
Fixing it properly while still keeping the start_removing* infrastructure
would add much additional complexity.
Instead revert to the original simple implementation.
The start_removing* infrastructure is needed in VFS to abstract the
filesystem locking. However filesystem code can still safely use the raw
locking primitives without affacting other filesystems.
This is part two of the revert.
Reported-by: Артем Лабазов <123321artyom@gmail.com>
Closes: https://lore.kernel.org/all/CAFbF8N7++zopZuEcsKRxBV_sgOGCbzCY0hOyMw1SiGAtuzGhyQ@mail.gmail.com/
Fixes: c9ba789dad15 ("VFS: introduce start_creating_noperm() and start_removing_noperm()")
Cc: stable@vger.kernel.org # 6.19
Cc: NeilBrown <neilb@ownmail.net>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
This reverts commit cab012375122304a6343c1ed09404e5143b9dc01.
Commit c9ba789dad15 ("VFS: introduce start_creating_noperm() and
start_removing_noperm()") caused a regression in FUSE_NOTIFY_INVAL_ENTRY,
which failed to invalidate negative dentries.
This manifests in the filesystem returning -ENOENT for operations on an
existing file.
Fixing it properly while still keeping the start_removing* infrastructure
would add much additional complexity.
Instead revert to the original simple implementation.
The start_removing* infrastructure is needed in VFS to abstract the
filesystem locking. However filesystem code can still safely use the raw
locking primitives without affacting other filesystems.
This is part one of the revert.
Reported-by: Артем Лабазов <123321artyom@gmail.com>
Closes: https://lore.kernel.org/all/CAFbF8N7++zopZuEcsKRxBV_sgOGCbzCY0hOyMw1SiGAtuzGhyQ@mail.gmail.com/
Fixes: c9ba789dad15 ("VFS: introduce start_creating_noperm() and start_removing_noperm()")
Cc: stable@vger.kernel.org # 6.19
Cc: NeilBrown <neilb@ownmail.net>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
FUSE_NOTIFY_PRUNE validates the nodeid payload length with:
size - sizeof(outarg) != outarg.count * sizeof(u64)
On 32-bit kernels, size_t is also 32 bits, so the daemon-controlled
count multiplication can wrap. A prune notification with count
0x20000000 and no nodeid payload passes the check, enters the copy
loop, and asks the device copy path to read nodeids that are not
present in the userspace write buffer. In QEMU this reaches the
fuse_copy_fill() BUG_ON(!err) path.
Validate the payload length with array_size() instead. That accepts
exactly the same valid messages, but avoids wrapping arithmetic before
the copy loop consumes the count.
Assisted-by: Codex:gpt-5.5-cyber-preview
Fixes: 3f29d59e92a9 ("fuse: add prune notification")
Cc: stable@vger.kernel.org
Signed-off-by: Samuel Moelius <sam.moelius@trailofbits.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
If a copy into the userspace ring buffer fails, a request will be
terminated and fuse_uring_req_end() will set ent->fuse_req to NULL but
it will leave the entry on ent_w_req_queue in FRRS_FUSE_REQ state. This
can lead to a NULL deref if the request expiration logic scans
ent_w_req_queue in the window before the entry is moved off it.
Fix this by taking the entry off ent_w_req_queue and changing its state
from FRRS_FUSE_REQ to FRRS_INVALID before terminating the request.
Fixes: 4fea593e625c ("fuse: optimize over-io-uring request expiration check")
Cc: stable@kernel.org
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
When fuse_resend() moves a request from fpq->processing back to
fiq->pending, it sets FR_PENDING and clears FR_SENT but does not
remove the requests intr_entry from fiq->interrupts. If the
request had FR_INTERRUPTED set from a prior signal, intr_entry
remains dangling on fiq->interrupts. When the requesting task
then receives a fatal signal, fuse_remove_pending_req() sees
FR_PENDING=1, removes the request from fiq->pending and frees it
via the refcount path, also without cleaning intr_entry. The
stale intr_entry causes use-after-free when fuse_read_interrupt()
iterates fiq->interrupts:
- list_del_init(&req->intr_entry) -> UAF write on freed slab
- req->in.h.unique -> UAF read, data leaked to userspace
Remove intr_entry from fiq->interrupts in fuse_resend() for
interrupted requests before they are placed back on fiq->pending.
Add a WARN_ON if the intr_entry is not empty on request destruction.
Fixes: 760eac73f9f6 ("fuse: Introduce a new notification type for resend pending requests")
Cc: stable@vger.kernel.org # 6.9
Signed-off-by: Ji'an Zhou <eilaimemedsnaimel@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Bad userspace might try to trick us and send commit SQEs request
unique / commit-id of requests that are not even send to
fuse-server (io_uring_cmd_done() not called) yet.
fuse_uring_commit_fetch() ends the fuse request when the ring entry
has a wrong state, but that could have caused a use-after-free
with the memcpy operations in fuse_uring_send_in_task().
In order to avoid such races the call of fuse_uring_add_to_pq()
is moved after the copy operations and just before completing
the io-uring request - malicious userspace cannot find the request
anymore until all prepration work in fuse-client/kernel is completed.
This also moves fuse_uring_add_to_pq() a bit up in the code to
avoid a forward declaration. Also not with a preparation commit,
to make it easier to back port to older kernels.
Reported-by: xlabai <xlabai@tencent.com>
Reported-by: Berkant Koc <me@berkoc.com>
Fixes: c090c8abae4b6b ("fuse: Add io-uring sqe commit and fetch support")
Cc: stable@kernel.org # 6.14
Signed-off-by: Bernd Schubert <bernd@bsbernd.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
There are several readers of queue->stopped that check the value
under lock, but fuse_uring_commit_fetch() did not and actually
the value was not set under the lock in fuse_uring_abort_end_requests()
either. Especially in fuse_uring_commit_fetch it is important
to check under a lock, because due to races 'struct fuse_req'
might be freed with fuse_request_end, but another thread/cpu
might already do teardown work.
Cc: stable@kernel.org # 6.14
Fixes: 4a9bfb9b6850fec ("fuse: {io-uring} Handle teardown of ring entries")
Reported-by: Berkant Koc <me@berkoc.com>
Reported-by: xlabai <xlabai@tencent.com>
Signed-off-by: Bernd Schubert <bernd@bsbernd.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
fuse_uring_async_stop_queues() might run when the last reference
on ring->queue_refs was already dropped.
In order to avoid an early destruction a reference on struct fuse_conn
is now taken before starting fuse_uring_async_stop_queues() and that
reference is only released when that delayed work queue terminates.
Fixes: 4a9bfb9b6850 ("fuse: {io-uring} Handle teardown of ring entries")
Cc: stable@kernel.org # 6.14
Reported-by: Berkant Koc <me@berkoc.com>
Signed-off-by: Bernd Schubert <bernd@bsbernd.com>
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
When io_uring delivers task work with tw.cancel set (PF_EXITING,
PF_KTHREAD fallback, or percpu_ref_is_dying on the ring context),
fuse_uring_send_in_task() takes the cancel branch, assigns
-ECANCELED, and falls through to fuse_uring_send(). That path only
flips the entry to FRRS_USERSPACE and completes the io_uring cmd;
it never discharges the ring entry's owning reference to the
fuse_req that fuse_uring_add_req_to_ring_ent() handed it at
dispatch time.
fuse_uring_send_in_task()
tw.cancel == true
err = -ECANCELED
fuse_uring_send(ent, cmd, err, issue_flags)
ent->state = FRRS_USERSPACE
list_move(&ent->list, &queue->ent_in_userspace)
ent->cmd = NULL
io_uring_cmd_done(-ECANCELED)
/* ent->fuse_req still set, req still hashed */
The fuse_req stays linked on fpq->processing[hash] and
fuse_request_end() is never invoked. The originating syscall
thread blocks in D-state in request_wait_answer() until
fuse_abort_conn() runs, which can be the entire connection
lifetime. For FR_BACKGROUND requests fc->num_background is never
decremented either, so repeated cancels inflate the counter until
max_background is hit and all later background ops stall. tw.cancel does
not imply a connection abort (e.g. a single io_uring worker thread exits
while the fuse connection stays up), so this cannot be left for
fuse_abort_conn() to clean up.
Ending the req but still routing the entry through fuse_uring_send()
is not enough: that leaves a req-less entry on ent_in_userspace, and
ent_list_request_expired() dereferences ent->fuse_req unconditionally
on the head of that list, which would then NULL-deref.
Fix the cancel branch to release the entry directly. Remove it from the
queue, complete the io_uring cmd, end the fuse_req, free the entry, and
drop its queue_refs (waking the teardown waiter if it was the last).
Fixes: c2c9af9a0b13 ("fuse: Allow to queue fg requests through io-uring")
Cc: stable@vger.kernel.org
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Assisted-by: kres (claude-opus-4-7)
Signed-off-by: Chris Mason <clm@meta.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
fuse_uring_cancel() moves entries that are available (these have no reqs
attached) to the ent_in_userspace list. ent_list_request_expired()
checks the first entry on ent_in_userspace and dereferences
ent->fuse_req unconditionally, which will crash on a cancelled entry
that was moved to this list.
Fix this by freeing the entry and dropping queue_refs directly in
fuse_uring_cancel(). This is safe because cancel is the cancel handler
itself - after io_uring_cmd_done(), no more cancels will be dispatched
for this command, and teardown serializes with cancel via queue->lock.
Since cancel now decrements queue_refs, fuse_uring_abort() must no
longer gate fuse_uring_abort_end_requests() on queue_refs > 0, as
cancelled entries may have already dropped queue_refs while requests are
still queued. Remove the gate so abort always flushes requests and stops
queues.
Reported-by: Heechan Kang <gganji11@naver.com>
Tested-by: Heechan Kang <gganji11@naver.com>
Reviewed-by: Bernd Schubert <bernd@bsbernd.com>
Fixes: 4fea593e625c ("fuse: optimize over-io-uring request expiration check")
Cc: stable@vger.kernel.org
Suggested-by: Jian Huang Li <ali@ddn.com>
Suggested-by: Horst Birthelmer <horst@birthelmer.de>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Check fch->connected under fch->lock in fuse_uring_create() before
attaching a new ring. Without this, a race between fuse_uring_create()
and fuse_chan_abort() can result in the ring, queue, and fpq.processing
table being created after fuse_uring_abort() has already run, leading
to unnecessary allocation and teardown. These are eventually cleaned up
by fuse_uring_destruct() but will linger until the process exits, even
with the connection aborted.
Reviewed-by: Bernd Schubert <bernd@bsbernd.com>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
This fixes this race:
- thread a: io_uring_enter -> register sqe ->
fuse_uring_create_ring_ent -> allocate ent but doesn't grab queue_ref
yet
- thread b: fuse_conn_destroy() -> fuse_chan_abort() ->
fuse_uring_abort() is a no-op due to queue ref being 0
- thread a: grabs the queue_ref, queue_ref is now 1, rest of
fuse_uring_do_register() logic executes
- thread b: fuse_chan_abort() returns, fuse_chan_wait_aborted() now runs
and calls
"wait_event(ring->stop_waitq, atomic_read(&ring->queue_refs) == 0);"
The abort/unmount thread will hang indefinitely in unkillable state as
nothing will decrement queue_refs or wake stop_waitq, and the ring,
queue, and ent are leaked.
Fix this by checking fch->connected under fch->lock after the created
ent has grabbed a ref count on the queue. This ensures that in the
scenario above, it is guaranteed that we either release the queue ref
and wake up stop_waitq (in case fuse_chan_wait_aborted() is already
waiting) in fuse_uring_do_register() when we detect !fch->connected, or
if the connection is aborted after the check, it is guaranteed that the
async teardown worker will be running in the background cleaning up ents
and decrementing the ent's ref on the queue, which will unblock the
eventual queue and ring teardown.
Fixes: 24fe962c86f5 ("fuse: {io-uring} Handle SQEs - register commands")
Cc: stable@vger.kernel.org
Reviewed-by: Bernd Schubert <bernd@bsbernd.com>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
On weakly-ordered architectures, the store to fiq->ops can be
reordered past the store to ring->ready, allowing a CPU that sees
ring->ready == true via fuse_uring_ready() to dispatch requests
through a stale fiq->ops pointer. Upgrade the store to
smp_store_release() and the load in fuse_uring_ready() to
smp_load_acquire() so that the preceding WRITE_ONCE(fiq->ops, ...)
is visible to any CPU that observes ring->ready == true.
Additionally, fuse_uring_do_register() publishes ring->ready with
WRITE_ONCE() but the fast-path check reads it with a plain load.
This is a marked-vs-unmarked access that KCSAN will flag. Wrap it in
READ_ONCE() to mark it without adding unnecessary ordering.
Also wrap the fc->ring load in fuse_uring_ready() in READ_ONCE() to
prevent the compiler from reloading it between the NULL check and the
dereference.
Fixes: c2c9af9a0b13 ("fuse: Allow to queue fg requests through io-uring")
Cc: stable@vger.kernel.org
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Assisted-by: kres (claude-opus-4-7)
Signed-off-by: Chris Mason <clm@meta.com>
Reviewed-by: Bernd Schubert <bernd@bsbernd.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
copy_from_user() returns the number of bytes not copied as an unsigned
residual on failure (1..sizeof(struct fuse_out_header)). fuse_uring_commit
stores that residual in ssize_t err, sets req->out.h.error to -EFAULT,
then jumps to out: with err still holding the positive residual.
err = copy_from_user(&req->out.h, &ent->headers->in_out,
sizeof(req->out.h));
if (err) {
req->out.h.error = -EFAULT;
goto out; /* err is the positive residual */
}
...
out:
fuse_uring_req_end(ent, req, err);
fuse_uring_req_end() then runs
if (error)
req->out.h.error = error;
which overwrites the just-assigned -EFAULT with the positive residual.
FUSE callers such as fuse_simple_request() test err < 0 to detect
failure, so the positive value is interpreted as success and the
caller proceeds with an uninitialised or partial req->out.args.
Fix by assigning err = -EFAULT in the failure branch before jumping
to out, so fuse_uring_req_end() receives a negative errno and sets
req->out.h.error to -EFAULT.
Fixes: c090c8abae4b ("fuse: Add io-uring sqe commit and fetch support")
Cc: stable@vger.kernel.org
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Assisted-by: kres (claude-opus-4-7)
Signed-off-by: Chris Mason <clm@meta.com>
Reviewed-by: Bernd Schubert <bernd@bsbernd.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Commit dabb90391028 ("fuse: increase readdir buffer size") changed
fuse_readdir_uncached() to size its temporary buffer from ctx->count.
This is useful for overlayfs and other in-kernel callers that use
INT_MAX to indicate an unlimited directory read.
The larger buffer is currently supplied as a kvec output argument. For
virtiofs, kvec arguments are copied through req->argbuf, which is
allocated with kmalloc(..., GFP_ATOMIC). A large uncached readdir buffer
can therefore require a multi-megabyte contiguous atomic allocation
before the request is queued.
Avoid the large bounce-buffer allocation by backing uncached readdir
output with pages and setting out_pages. Transports such as virtiofs can
then pass the pages as scatter-gather entries instead of copying the
output through argbuf.
Map the pages with vm_map_ram() only while parsing the returned dirents.
The existing parser can then continue to use a linear kernel mapping.
[SzM: separate allocation of pages into a helper function]
Fixes: dabb90391028 ("fuse: increase readdir buffer size")
Cc: stable@vger.kernel.org
Signed-off-by: Matthew R. Ochs <mochs@nvidia.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
iput() called from fuse_release_end() can Oops if the super block has
already been destroyed. Normally this is prevented by waiting for
num_waiting to go down to zero before commencing with super block shutdown.
This only works, however, for the last submount instance, as the wait
counter is per connection, not per superblock.
Revert to using synchronous release requests for the auto_submounts case,
which is virtiofs only at this time.
Reported-by: Aurélien Bombo <abombo@microsoft.com>
Reported-by: Zhihao Cheng <chengzhihao1@huawei.com>
Cc: Greg Kurz <gkurz@redhat.com>
Closes: https://github.com/kata-containers/kata-containers/issues/12589
Fixes: 26e5c67deb2e ("fuse: fix livelock in synchronous file put from fuseblk workers")
Cc: stable@vger.kernel.org
Reviewed-by: Greg Kurz <gkurz@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull dcache updates from Al Viro:
- d_alloc_parallel() API change (Neil's with my changes)
- NORCU fixes
- Reorganization and simplification of dentry eviction logic
- Simplifying rcu_read_lock() scopes in fs/dcache.c
- Secondary roots work - getting rid of NFS fake root dentries and
dealing with remaining shrink_dcache_for_umount() and
shrink_dentry_list() races
- making cursors NORCU (surprisingly easy)
* tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (22 commits)
make cursors NORCU
nfs: get rid of fake root dentries
wind ->s_roots via ->d_sib instead of ->d_hash
shrink_dentry_tree(): unify the calls of shrink_dentry_list()
shrinking rcu_read_lock() scope in d_alloc_parallel()
d_walk(): shrink rcu_read_lock() scope
document dentry_kill()
adjust calling conventions of lock_for_kill(), fold __dentry_kill() into dentry_kill()
Document rcu_read_lock() use in select_collect2()
Shift rcu_read_{,un}lock() inside fast_dput()
simplify safety for lock_for_kill() slowpath
fold lock_for_kill() and __dentry_kill() into common helper
fold lock_for_kill() into shrink_kill()
shrink_dentry_list(): start with removing from shrink list
d_prune_aliases(): make sure to skip NORCU aliases
kill d_dispose_if_unused()
make to_shrink_list() return whether it has moved dentry to list
select_collect(): ignore dentries on shrink lists if they have positive refcounts
find_acceptable_alias(): skip NORCU aliases with zero refcount
fix a race between d_find_any_alias() and final dput() of NORCU dentries
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"Features:
- Reduce pipe->mutex contention by pre-allocating pages outside the
lock in anon_pipe_write().
anon_pipe_write() called alloc_page() once per page while holding
pipe->mutex. The allocation can sleep doing direct reclaim and runs
memcg charging, which extends the critical section and stalls any
concurrent reader on the same mutex. Now up to 8 pages are
pre-allocated before the mutex is taken, leftovers are recycled
into the per-pipe tmp_page[] cache before unlock, and any remainder
is released after unlock, keeping the allocator out of the critical
section on both sides. On a writers x readers sweep with 64KB
writes against a 1 MB pipe throughput improves 6-28% and average
write latency drops 5-22%; under memory pressure - when the cost of
holding the mutex across reclaim is highest - throughput improves
21-48% and latency drops 17-33%. The microbenchmark is added to
selftests.
- uaccess/sockptr: fix the ignored_trailing logic in
copy_struct_to_user() to behave as documented and the usize check
in copy_struct_from_sockptr() for user pointers, and add
copy_struct_{from,to}_bounce_buffer() and copy_struct_to_sockptr()
helpers for upcoming users (IPPROTO_SMBDIRECT, IPPROTO_QUIC).
- bpf: add a sleepable bpf_real_inode() kfunc that resolves the real
inode backing a dentry via d_real_inode(). On overlayfs the inode
attached to the dentry doesn't carry the underlying device
information; this is used by the filesystem restriction BPF program
that was merged into systemd.
- docs: add guidelines for submitting new filesystems, motivated by
the maintenance burden abandoned and untestable filesystems impose
on VFS developers, blocking infrastructure work like folio
conversions and iomap migration.
Fixes:
- libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo()
and drop the now-redundant assignments in callers. This began as a
one-line dma-buf fix for a path_noexec() warning; a pseudo
filesystem has no reason not to set SB_I_NOEXEC. All init_pseudo()
callers were audited: the only visible effect is on dma-buf where
SB_I_NOEXEC silences the warning.
- Handle set_blocksize() failures in legacy filesystems (bfs, hpfs,
qnx4, jfs, befs, affs, isofs, minix, ntfs3, omfs). Mounting a
device with a sector size > PAGE_SIZE crashed roughly half of them;
the rest had the same missing error handling pattern. Plus a
follow-up releasing the superblock buffer_head when setting the
minix v3 block size fails.
- mount: honour SB_NOUSER in the new mount API.
- fs/fcntl: fix a SOFTIRQ-unsafe lock order in fasync signaling by
switching the process-group paths of send_sigio() and send_sigurg()
from read_lock(&tasklist_lock) to RCU, matching the single-PID
path.
- vfs: add an FS_USERNS_DELEGATABLE flag and set it for NFS, fixing
delegated NFS mounts (fsopen() in a container with the mount
performed by a privileged daemon) that broke when non-init
s_user_ns was tied to FS_USERNS_MOUNT.
- selftests/namespaces: fix a hang in nsid_test where an unreaped
grandchild kept the TAP pipe write-end open, a waitpid(-1) race in
listns_efault_test, and a false FAIL on kernels without listns()
where the tests should SKIP.
- filelock: fix the break_lease() stub signature for
CONFIG_FILE_LOCKING=n.
- init/initramfs_test: wait for the async initramfs unpacking before
running; the test and do_populate_rootfs() share the parser state.
- fs/coredump: reduce redundant log noise in
validate_coredump_safety().
- iomap: pass the correct length to fserror_report_io() in
__iomap_write_begin().
- backing-file: fix the backing_file_open() kerneldoc.
Cleanups:
- initramfs: refactor the cpio hex header parsing to use hex2bin()
instead of the hand-rolled simple_strntoul() which is reverted, and
extend the initramfs KUnit tests to cover header fields with 0x
prefixes.
- Replace __get_free_pages() and friends with kmalloc()/kzalloc()
across quota, proc, ocfs2/dlm, nilfs2, nfs, nfsd, libfs, jfs, jbd2,
isofs, fuse, select, namespace, configfs, binfmt_misc, bfs, and the
do_mounts init code - part of the larger work of replacing page
allocator calls with kmalloc().
- Use clear_and_wake_up_bit() in unlock_buffer() and
journal_end_buffer_io_sync() instead of open-coding the sequence.
- Drop unused VFS exports: unexport drop_super_exclusive(), remove
start_removing_user_path_at(), and fold __start_removing_path()
into start_removing_path().
- fs/read_write: narrow the __kernel_write() export with
EXPORT_SYMBOL_FOR_MODULES().
- vfs: uapi: retire octal and hex constants in favor of (1 << n) for
the O_ flags. Finding a free bit for a new flag across the
architectures was needlessly hard with the mixed bases.
- dcache: add extra sanity checks of dead dentries in dentry_free()
via a new DENTRY_WARN_ONCE() that also prints d_flags.
- iov_iter: use kmemdup_array() in dup_iter() to harden the
allocation against multiplication overflow.
- fs/pipe: write to ->poll_usage only once.
- vfs: remove an always-taken if-branch in find_next_fd().
- dcache: use kmalloc_flex() for struct external_name in __d_alloc().
- namei: use QSTR() instead of QSTR_INIT() in path_pts().
- sync_file_range: delete dead S_ISLNK code.
- Comment fixes: retire a stale comment in fget_task_next() and fix
assorted spelling mistakes"
* tag 'vfs-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (73 commits)
backing-file: fix backing_file_open() kerneldoc parameter
iomap: pass the correct len to fserror_report_io in __iomap_write_begin
vfs: add FS_USERNS_DELEGATABLE flag and set it for NFS
filelock: fix break_lease() stub signature for CONFIG_FILE_LOCKING=n
vfs: uapi: retire octal and hex numbers in favor of (1 << n) for O_ flags
bpf: add bpf_real_inode() kfunc
fs/read_write: Do not export __kernel_write() to the entire world
libfs: drop redundant SB_I_NOEXEC/SB_I_NODEV in init_pseudo() callers
libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo()
mount: honour SB_NOUSER in the new mount API
fs/fcntl: fix SOFTIRQ-unsafe lock order in fasync signaling
selftests/pipe: add pipe_bench microbenchmark
fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write
fs: retire stale comment in fget_task_next()
fs: fix spelling mistakes in comment
bfs: replace get_zeroed_page() with kzalloc()
binfmt_misc: replace __get_free_page() with kmalloc()
configfs: replace __get_free_pages() with kzalloc()
fs/namespace: use __getname() to allocate mntpath buffer
fs/select: replace __get_free_page() with kmalloc()
...
|
|
fuse_ref_folio() unlocks the request but does not re-lock it before
returning. fuse_chan_abort() can end the request and the async end
callback (eg fuse_writepage_free()) can free the args while the
subsequent copy chain logic after fuse_ref_folio() accesses them,
leading to use-after-free issues.
Fix this by locking the request in fuse_ref_folio() before returning.
Fixes: c3021629a0d8 ("fuse: support splice() reading from fuse device")
Cc: stable@vger.kernel.org
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
fuse_try_move_folio() unlocks the request on entry but does not
re-lock it on the success path. This means fuse_chan_abort() can end the
request and free the fuse_io_args (eg fuse_readpages_end()) while the
subsequent copy chain logic after fuse_try_move_folio() accesses the
fuse_io_args, leading to use-after-free issues.
Fix this by calling lock_request() before replace_page_cache_folio().
This ensures the request is locked on the success path which will
prevent the fuse_io_args from being freed while the later copying logic
runs, and also ensures that the ap->folios[i]->mapping is never null
since ap->folios[i] will always point to the newfolio after
replace_page_cache_folio().
Fixes: ce534fb05292 ("fuse: allow splice to move pages")
Cc: stable@vger.kernel.org
Reported-by: Lei Lu <llfamsec@gmail.com>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
Rename to_shrink_list() into __move_to_shrink_list(), document and
export it. Switch d_dispose_if_unused() users to that and kill
d_dispose_if_unused() itself.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Parallel lookup starts with a call of d_alloc_parallel(). That primitive
either returns a matching hashed dentry or allocates a new one in the
in-lookup state and returns it to the caller. Once the caller is done
with lookup, it indicates so either by call of d_{splice_alias,add}()
or by call of d_done_lookup(); at that point dentry leaves the in-lookup
state.
If d_alloc_parallel() finds a matching in-lookup dentry, it must wait for
that dentry to leave the in-lookup state, one way or another. Currently
by supplying wait_queue_head to d_alloc_parallel(). If d_alloc_parallel()
creates a new in-lookup dentry, the address of that wait_queue_head is stored
in ->d_wait of new dentry and stays there while it's in the in-lookup;
subsequent d_alloc_parallel() will wait on the queue found in the matching
in-lookup dentry. Transition out of in-lookup state wakes waiters on that
queue (if any).
That works, but the calling conventions are inconvenient - the caller must
supply wait_queue_head and make sure that it survives at least until the new
in-lookup dentry leaves the in-lookup state. That amounts to boilerplate
in the d_alloc_parallel() callers that are followed by a call of d_lookup_done()
in the same function; in cases like nfs asynchronous unlink it gets worse than
that.
This patch changes d_alloc_parallel() to use wake_up_var_locked() to
wake up waiters, and wait_var_event_spinlock() to wait. dentry->d_lock
is used for synchronisation as it is already held and the relevant
times.
That eliminates the need of caller-supplied wait_queue_head, simplifying
the calling conventions. Better yet, we only need one bit of information
stored in dentry itself: whether there are any waiters to be woken up,
and that can be easily stored in ->d_flags; ->d_wait goes away.
The reason we need that bit (DCACHE_LOOKUP_WAITERS) is that with wait_var
machinery the queues are shared with all kinds of stuff and there's
no way tell if any of the waiters have anything to do with our dentry;
most of the time none of them will be relevant, so we need to avoid the
pointless wakeups.
Another benefit of the new scheme comes from the fact that wakeups
have to be done outside of write-side critical areas of ->i_dir_seq;
with the old scheme we need to carry the value picked from ->d_wait from
__d_lookup_unhash() to the place where we actually wake the waiters up.
Now we can just leave DCACHE_LOOKUP_WAITERS in ->d_flags until we get
to doing wakeups - that's done within the same ->d_lock scope, so we
are fine; new bit is accessed only under ->d_lock and it's seen only
on dentries with DCACHE_PAR_LOOKUP in ->d_flags.
__d_lookup_unhash() no longer needs to re-init ->d_lru. That was
previously shared (in a union) with ->d_wait but ->d_wait is now gone
so it no longer corrupts ->d_lru.
Co-developed-by: Al Viro <viro@zeniv.linux.org.uk> # saner handling of flags
Signed-off-by: NeilBrown <neil@brown.name>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
fuse_do_ioctl allocates memory for struct iov array using
__get_free_page().
kmalloc() is a better API for such use and it also provides better
scalability and more debugging possibilities.
Replace use of __get_free_page() with kmalloc().
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Link: https://patch.msgid.link/20260523-b4-fs-v1-12-275e36a83f0e@kernel.org
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
The operations FUSE_NOTIFY_STORE and FUSE_NOTIFY_RETRIEVE allow the
FUSE daemon to actively write/read pagecache contents.
For directories with FOPEN_CACHE_DIR, the pagecache is used as
kernel-internal cache storage, and userspace is not supposed to have
direct access to this cache - in particular, fuse_parse_cache() will hit
WARN_ON() if the cache contains bogus data.
Reject FUSE_NOTIFY_STORE and FUSE_NOTIFY_RETRIEVE on anything other than
regular files with -EINVAL.
Fixes: 5d7bc7e8680c ("fuse: allow using readdir cache")
Cc: stable@vger.kernel.org
Signed-off-by: Jann Horn <jannh@google.com>
Link: https://patch.msgid.link/20260519-fuse-dir-pagecache-v2-1-5428fa48e175@google.com
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
FUSE_NOTIFY_RETRIEVE must be limited to uptodate folios; !uptodate folios
can contain uninitialized data.
Since FUSE_NOTIFY_RETRIEVE is intended to only return data that is already
in the page cache and not wait for data from the FUSE daemon, treat
!uptodate folios as if they weren't present.
This only has security impact on systems that don't enable automatic
zero-initialization of all page allocations via
CONFIG_INIT_ON_ALLOC_DEFAULT_ON or init_on_alloc=1.
Cc: stable@kernel.org
Fixes: 2d45ba381a74 ("fuse: add retrieve request")
Signed-off-by: Jann Horn <jannh@google.com>
Link: https://patch.msgid.link/20260519-fuse-retrieve-uptodate-v1-1-a7a1912a37f9@google.com
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
fuse_iomap_writeback_range() appends one folio pointer and one
fuse_folio_desc for every dirty range that is merged into the current
writeback request. The merge decision checks the byte budget against
fc->max_pages and fc->max_write, but it does not check whether the folio
and descriptor arrays still have another free slot.
This is not sufficient for fuseblk, where the filesystem block size can
be smaller than PAGE_SIZE. With writeback cache enabled and max_pages
negotiated as one, contiguous sub-page dirty ranges can fit within the
byte budget while spanning more than one folio. The next append can then
write past the one-slot folios and descs arrays.
Split the request when the number of already attached folios has reached
fc->max_pages. This keeps the folio/descriptor slot accounting in sync
with the send decision.
Fixes: ef7e7cbb323f ("fuse: use iomap for writeback")
Cc: stable@vger.kernel.org
Reviewed-by: Joanne Koong <joannelkoong@gmail.com>
Signed-off-by: Junxi Qian <qjx1298677004@gmail.com>
Link: https://patch.msgid.link/20260506122415.205340-1-qjx1298677004@gmail.com
Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Commit a8dd5f1b73bc ("fuse: create fuse_dev on /dev/fuse open instead of
mount") changed behavior so that fuse_get_dev() now unconditionally
blocks waiting for a connection, even in the case where sync_init was
not set. Previously, non-sync_init opens returned -EPERM immediately.
Restore the previous behavior of returning -EPERM.
Fixes: a8dd5f1b73bc ("fuse: create fuse_dev on /dev/fuse open instead of mount")
Reported-by: Mark Brown <broonie@kernel.org>
Closes: https://lore.kernel.org/all/3c9f8396-41f4-4c88-b883-34bede72b427@sirena.org.uk/
Cc: <stable@vger.kernel.org>
Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
Tested-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- eventpoll: fix ep_remove() UAF and follow-up cleanup
- fs: aio: set VMA_DONTCOPY_BIT in mmap to fix NULL-pointer-dereference
error
- writeback: Fix use after free in inode_switch_wbs_work_fn()
- fuse: reject oversized dirents in page cache
- fs: aio: reject partial mremap to avoid Null-pointer-dereference
error
- nstree: fix func. parameter kernel-doc warnings
- fs: Handle multiply claimed blocks more gracefully with mmb
* tag 'vfs-7.1-rc1.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
eventpoll: drop vestigial epi->dying flag
eventpoll: drop dead bool return from ep_remove_epi()
eventpoll: refresh eventpoll_release() fast-path comment
eventpoll: move f_lock acquisition into ep_remove_file()
eventpoll: fix ep_remove struct eventpoll / struct file UAF
eventpoll: move epi_fget() up
eventpoll: rename ep_remove_safe() back to ep_remove()
eventpoll: drop vestigial __ prefix from ep_remove_{file,epi}()
eventpoll: kill __ep_remove()
eventpoll: split __ep_remove()
eventpoll: use hlist_is_singular_node() in __ep_remove()
fs: Handle multiply claimed blocks more gracefully with mmb
nstree: fix func. parameter kernel-doc warnings
fs: aio: reject partial mremap to avoid Null-pointer-dereference error
fuse: reject oversized dirents in page cache
writeback: Fix use after free in inode_switch_wbs_work_fn()
fs: aio: set VMA_DONTCOPY_BIT in mmap to fix NULL-pointer-dereference error
|
|
fuse_add_dirent_to_cache() computes a serialized dirent size from the
server-controlled namelen field and copies the dirent into a single
page-cache page. The existing logic only checks whether the dirent fits
in the remaining space of the current page and advances to a fresh page
if not. It never checks whether the dirent itself exceeds PAGE_SIZE.
As a result, a malicious |