| Age | Commit message (Collapse) | Author | Files | Lines |
|
Pull 9p updates from Dominique Martinet:
"Asides of the avalanche of LLM-driven fixes, there are a couple of big
changes this cycle:
- negative dentry and symlink cache
- a way out of the unkillable "io_wait_event_killable" (because it
looped around waiting for the request flush to come back from
server; this has been bugging syzcaller folks since forever): I'm
still not 100% sure about this patch, but I think it's as good as
we'll ever get, and will keep testing a bit further in the coming
weeks
The rest is more noisy than usual, but shouldn't cause any trouble"
* tag '9p-for-7.2-rc1' of https://github.com/martinetd/linux:
9p: Add missing read barrier in virtio zero-copy path
net/9p: Replace strlen() strcpy() pair with strscpy()
9p: skip nlink update in cacheless mode to fix WARN_ON
net/9p: fix race condition on rdma->state in trans_rdma.c
9p: v9fs_file_do_lock: replace WARN_ONCE with p9_debug
9p: Enable symlink caching in page cache
9p: Set default negative dentry retention time for cache=loose
9p: Add mount option for negative dentry cache retention
9p: Cache negative dentries for lookup performance
9p: avoid returning ERR_PTR(0) from mkdir operations
9p: avoid putting oldfid in p9_client_walk() error path
net/9p: fix infinite loop in p9_client_rpc on fatal signal
docs/filesystems/9p: fix broken external links
9p: invalidate readdir buffer on seek
9p: use kvzalloc for readdir buffer
net/9p/usbg: Constify struct configfs_item_operations
|
|
v9fs_dec_count() unconditionally calls drop_nlink() on regular files,
even when the inode's nlink is already zero. In cacheless mode the
client refetches inode metadata from the server (the source of truth)
on every operation, so by the time v9fs_remove() returns, the locally
cached nlink may already reflect the post-unlink value:
1. Client initiates unlink, server processes it and sets nlink to 0
2. Client refetches inode metadata (nlink=0) before unlink returns
3. Client's v9fs_remove() completes successfully
4. Client calls v9fs_dec_count() which calls drop_nlink() on nlink=0
This race is easily triggered under heavy unlink workloads, such as
stress-ng's unlink stressor, producing the following warning:
WARNING: fs/inode.c:417 at drop_nlink+0x4c/0xc8
Call trace:
drop_nlink+0x4c/0xc8
v9fs_remove+0x1e0/0x250 [9p]
v9fs_vfs_unlink+0x20/0x38 [9p]
vfs_unlink+0x13c/0x258
...
In cacheless mode the server is authoritative and the inode is on its
way out, so locally adjusting nlink buys nothing. Skip v9fs_dec_count()
entirely when neither CACHE_META nor CACHE_LOOSE is set, which both
avoids the warning and removes a class of nlink races (two concurrent
unlinkers observing nlink > 0 and both calling drop_nlink()) that an
nlink == 0 guard alone would only narrow rather than close.
Fixes: ac89b2ef9b55 ("9p: don't maintain dir i_nlink if the exported fs doesn't either")
Cc: stable@vger.kernel.org
Suggested-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
Message-ID: <20260421-9p-v2-1-48762d294fad@debian.org>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
This warning depends on server-provided data, we should not use
WARN here
Reported-by: Yifei Chu <yifeichu24@gmail.com>
Closes: https://lore.kernel.org/r/CAPJnbgJ7ZK7DCjCfG56hd_iKGePmAzudb4hOWd4=9r32nM+KcA@mail.gmail.com
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Message-ID: <20260529-lock-warn-v1-1-20c29580d61d@codewreck.org>
|
|
Currently, when cache=loose is enabled, file reads are cached in the
page cache, but symlink reads are not. This patch allows the results
of p9_client_readlink() to be stored in the page cache, eliminating
the need for repeated 9P transactions on subsequent symlink accesses.
This change improves performance for workloads that involve frequent
symlink resolution.
Signed-off-by: Remi Pommarel <repk@triplefau.lt>
Message-ID: <982462d17c0c0d2856763266a25eb04d080c1dbb.1779355927.git.repk@triplefau.lt>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
For cache=loose mounts, set the default negative dentry cache retention
time to 24 hours.
Signed-off-by: Remi Pommarel <repk@triplefau.lt>
Message-ID: <b5beca3e70890ab8a4f0b9e99bd69cb97f5cb9eb.1779355927.git.repk@triplefau.lt>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
Introduce a new mount option, negtimeout, for v9fs that allows users
to specify how long negative dentries are retained in the cache. The
retention time can be set in milliseconds (e.g. negtimeout=10000 for
a 10secs retention time) or a negative value (e.g. negtimeout=-1) to
keep negative entries until the buffer cache management removes them.
For consistency reasons, this option should only be used in exclusive
or read-only mount scenarios, aligning with the cache=loose usage.
Signed-off-by: Remi Pommarel <repk@triplefau.lt>
Message-ID: <b2d66500aa5a2f6540347c4aa46a4be10dd01bc6.1779355927.git.repk@triplefau.lt>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
Not caching negative dentries can result in poor performance for
workloads that repeatedly look up non-existent paths. Each such
lookup triggers a full 9P transaction with the server, adding
unnecessary overhead.
A typical example is source compilation, where multiple cc1 processes
are spawned and repeatedly search for the same missing header files
over and over again.
This change enables caching of negative dentries, so that lookups for
known non-existent paths do not require a full 9P transaction. The
cached negative dentries are retained for a configurable duration
(expressed in milliseconds), as specified by the ndentry_timeout
field in struct v9fs_session_info. If set to -1, negative dentries
are cached indefinitely.
This optimization reduces lookup overhead and improves performance for
workloads involving frequent access to non-existent paths.
Signed-off-by: Remi Pommarel <repk@triplefau.lt>
Message-ID: <e542317dd03bbadb5249abd3ea6aecfdca692c19.1779355927.git.repk@triplefau.lt>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
When mkdir succeeds, v9fs_vfs_mkdir_dotl() and v9fs_vfs_mkdir() return
ERR_PTR(0) which is incorrect. They should return NULL instead for
success and ERR_PTR() only with negative error codes for failure.
Return NULL instead of passing to ERR_PTR while err is zero
Fixes smatch warnings:
fs/9p/vfs_inode_dotl.c:420 v9fs_vfs_mkdir_dotl() warn: passing zero to 'ERR_PTR'
fs/9p/vfs_inode.c:695 v9fs_vfs_mkdir() warn: passing zero to 'ERR_PTR'
The v9fs_vfs_mkdir() code was further simplified because v9fs_create()
can never return NULL, so we do not need to check for fid being set
separately, and the error path can be a simple return immediately after
v9fs_create() failure.
There is no intended functional change.
Fixes: 88d5baf69082 ("Change inode_operations.mkdir to return struct dentry *")
Suggested-by: David Laight <david.laight.linux@gmail.com>
Acked-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Signed-off-by: Hongling Zeng <zenghongling@kylinos.cn>
Message-ID: <20260520022650.14217-1-zenghongling@kylinos.cn>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
The per-fid readdir buffer (fid->rdir) is populated lazily and only
refilled when fully drained (rdir->head == rdir->tail). userspace
lseek() on a directory fd updates file->f_pos via generic_file_llseek()
but does not touch the cached buffer, so the next getdents() iterates
the stale cache and emits entries from the previous position instead
of the one the caller asked for.
Track the file position the cached data corresponds to in
struct p9_rdir, and drop the cache on entry to iterate_shared when it
no longer matches ctx->pos. The 9p protocol's Tread/Treaddir already
take an arbitrary offset on every request, so a refill at the new
position is always legal; no .llseek override or seek restriction is
needed.
Reported-by: Pierre Barre <pierre@barre.sh>
Link: https://lore.kernel.org/v9fs/496d10b9-40fe-4f81-8014-37497c37ff63@app.fastmail.com/
Signed-off-by: Pierre Barre <pierre@barre.sh>
Message-ID: <20260512132032.369281-2-pierre@barre.sh>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
The readdir buffer is sized to msize, so kzalloc() can fail under
fragmentation with a page allocation failure in v9fs_alloc_rdir_buf()
/ v9fs_dir_readdir_dotl().
The buffer is only a response sink and is never pack_sg_list()'d,
so kvzalloc() is safe for all transports, unlike the fcall buffers
fixed in e21d451a82f3 ("9p: Use kvmalloc for message buffers on
supported transports").
Signed-off-by: Pierre Barre <pierre@barre.sh>
Message-ID: <20260512132032.369281-1-pierre@barre.sh>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
Fix potential tearing in using ->remote_i_size and ->zero_point by copying
i_size_read() and i_size_write() and using the same seqcount as for i_size.
We need to make sure that netfslib and the filesystems that use it always
hold i_lock whilst updating any of the sizes to prevent i_size_seqcount
from getting corrupted.
Fixes: 4058f742105e ("netfs: Keep track of the actual remote file size")
Fixes: 100ccd18bb41 ("netfs: Optimise away reads above the point at which there can be no data")
Closes: https://sashiko.dev/#/patchset/20260414082004.3756080-1-dhowells%40redhat.com
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://patch.msgid.link/20260512123404.719402-6-dhowells@redhat.com
cc: Paulo Alcantara <pc@manguebit.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Pull 9p updates from Dominique Martinet:
- 9p access flag fix (cannot change access flag since new mount API implem)
- some minor cleanup
* tag '9p-for-7.1-rc1' of https://github.com/martinetd/linux:
9p/trans_xen: replace simple_strto* with kstrtouint
9p/trans_xen: make cleanup idempotent after dataring alloc errors
9p: document missing enum values in kernel-doc comments
9p: fix access mode flags being ORed instead of replaced
9p: fix memory leak in v9fs_init_fs_context error path
|
|
Since commit 1f3e4142c0eb ("9p: convert to the new mount API"),
v9fs_apply_options() applies parsed mount flags with |= onto flags
already set by v9fs_session_init(). For 9P2000.L, session_init sets
V9FS_ACCESS_CLIENT as the default, so when the user mounts with
"access=user", both bits end up set. Access mode checks compare
against exact values, so having both bits set matches neither mode.
This causes v9fs_fid_lookup() to fall through to the default switch
case, using INVALID_UID (nobody/65534) instead of current_fsuid()
for all fid lookups. Root is then unable to chown or perform other
privileged operations.
Fix by clearing the access mask before applying the user's choice.
Fixes: 1f3e4142c0eb ("9p: convert to the new mount API")
Signed-off-by: Pierre Barre <pierre@barre.sh>
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Message-ID: <0ddc72da-d196-4f01-8755-0086f670e779@app.fastmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
Move the assignments of fc->ops and fc->fs_private to right after the
kzalloc, before any fallible operations. Previously these were assigned
at the end of the function, after the kstrdup calls for uname and aname.
If either kstrdup failed, the error path would set fc->need_free but
leave fc->ops NULL, so put_fs_context() would never call v9fs_free_fc()
to free the allocated context and any already-duplicated strings.
Fixes: 1f3e4142c0eb ("9p: convert to the new mount API")
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Sasha Levin <sashal@kernel.org>
Message-ID: <20260225135745.351984-1-sashal@kernel.org>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
On 32-bit architectures, unsigned long is only 32 bits wide, which
causes 64-bit inode numbers to be silently truncated. Several
filesystems (NFS, XFS, BTRFS, etc.) can generate inode numbers that
exceed 32 bits, and this truncation can lead to inode number collisions
and other subtle bugs on 32-bit systems.
Change the type of inode->i_ino from unsigned long to u64 to ensure that
inode numbers are always represented as 64-bit values regardless of
architecture. Update all format specifiers treewide from %lu/%lx to
%llu/%llx to match the new type, along with corresponding local variable
types.
This is the bulk treewide conversion. Earlier patches in this series
handled trace events separately to allow trace field reordering for
better struct packing on 32-bit.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20260304-iino-u64-v3-12-2257ad83d372@kernel.org
Acked-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains a mix of VFS cleanups, performance improvements, API
fixes, documentation, and a deprecation notice.
Scalability and performance:
- Rework pid allocation to only take pidmap_lock once instead of
twice during alloc_pid(), improving thread creation/teardown
throughput by 10-16% depending on false-sharing luck. Pad the
namespace refcount to reduce false-sharing
- Track file lock presence via a flag in ->i_opflags instead of
reading ->i_flctx, avoiding false-sharing with ->i_readcount on
open/close hot paths. Measured 4-16% improvement on 24-core
open-in-a-loop benchmarks
- Use a consume fence in locks_inode_context() to match the
store-release/load-consume idiom, eliminating a hardware fence on
some architectures
- Annotate cdev_lock with __cacheline_aligned_in_smp to prevent
false-sharing
- Remove a redundant DCACHE_MANAGED_DENTRY check in
__follow_mount_rcu() that never fires since the caller already
verifies it, eliminating a 100% mispredicted branch
- Fix a 100% mispredicted likely() in devcgroup_inode_permission()
that became wrong after a prior code reorder
Bug fixes and correctness:
- Make insert_inode_locked() wait for inode destruction instead of
skipping, fixing a corner case where two matching inodes could
exist in the hash
- Move f_mode initialization before file_ref_init() in alloc_file()
to respect the SLAB_TYPESAFE_BY_RCU ordering contract
- Add a WARN_ON_ONCE guard in try_to_free_buffers() for folios with
no buffers attached, preventing a null pointer dereference when
AS_RELEASE_ALWAYS is set but no release_folio op exists
- Fix select restart_block to store end_time as timespec64, avoiding
truncation of tv_sec on 32-bit architectures
- Make dump_inode() use get_kernel_nofault() to safely access inode
and superblock fields, matching the dump_mapping() pattern
API modernization:
- Make posix_acl_to_xattr() allocate the buffer internally since
every single caller was doing it anyway. Reduces boilerplate and
unnecessary error checking across ~15 filesystems
- Replace deprecated simple_strtoul() with kstrtoul() for the
ihash_entries, dhash_entries, mhash_entries, and mphash_entries
boot parameters, adding proper error handling
- Convert chardev code to use guard(mutex) and __free(kfree) cleanup
patterns
- Replace min_t() with min() or umin() in VFS code to avoid silently
truncating unsigned long to unsigned int
- Gate LOOKUP_RCU assertions behind CONFIG_DEBUG_VFS since callers
already check the flag
Deprecation:
- Begin deprecating legacy BSD process accounting (acct(2)). The
interface has numerous footguns and better alternatives exist
(eBPF)
Documentation:
- Fix and complete kernel-doc for struct export_operations, removing
duplicated documentation between ReST and source
- Fix kernel-doc warnings for __start_dirop() and ilookup5_nowait()
Testing:
- Add a kunit test for initramfs cpio handling of entries with
filesize > PATH_MAX
Misc:
- Add missing <linux/init_task.h> include in fs_struct.c"
* tag 'vfs-7.0-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits)
posix_acl: make posix_acl_to_xattr() alloc the buffer
fs: make insert_inode_locked() wait for inode destruction
initramfs_test: kunit test for cpio.filesize > PATH_MAX
fs: improve dump_inode() to safely access inode fields
fs: add <linux/init_task.h> for 'init_fs'
docs: exportfs: Use source code struct documentation
fs: move initializing f_mode before file_ref_init()
exportfs: Complete kernel-doc for struct export_operations
exportfs: Mark struct export_operations functions at kernel-doc
exportfs: Fix kernel-doc output for get_name()
acct(2): begin the deprecation of legacy BSD process accounting
device_cgroup: remove branch hint after code refactor
VFS: fix __start_dirop() kernel-doc warnings
fs: Describe @isnew parameter in ilookup5_nowait()
fs/namei: Remove redundant DCACHE_MANAGED_DENTRY check in __follow_mount_rcu
fs: only assert on LOOKUP_RCU when built with CONFIG_DEBUG_VFS
select: store end_time as timespec64 in restart block
chardev: Switch to guard(mutex) and __free(kfree)
namespace: Replace simple_strtoul with kstrtoul to parse boot params
dcache: Replace simple_strtoul with kstrtoul in set_dhash_entries
...
|
|
Without exception all caller do that. So move the allocation into the
helper.
This reduces boilerplate and removes unnecessary error checking.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://patch.msgid.link/20260115122341.556026-1-mszeredi@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Setting ->setlease() to a NULL pointer now has the same effect as
setting it to simple_nosetlease(). Remove all of the setlease
file_operations that are set to simple_nosetlease, and the function
itself.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20260108-setlease-6-20-v1-24-ea4dec9b67fa@kernel.org
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
With the advent of directory leases, it's necessary to set the
->setlease() handler in directory file_operations to properly deny them.
Fixes: e6d28ebc17eb ("filelock: push the S_ISREG check down to ->setlease handlers")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://patch.msgid.link/20260107-setlease-6-19-v1-3-85f034abcc57@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Pull 9p updates from Dominique Martinet:
- fix a bug with O_APPEND in cached mode causing data to be written
multiple times on server
- use kvmalloc for trans_fd to avoid problems with large msize and
fragmented memory This should hopefully be used in more transports
when time allows
- convert to new mount API
- minor cleanups
* tag '9p-for-6.19-rc1' of https://github.com/martinetd/linux:
9p: fix new mount API cache option handling
9p: fix cache/debug options printing in v9fs_show_options
9p: convert to the new mount API
9p: create a v9fs_context structure to hold parsed options
net/9p: move structures and macros to header files
fs/fs_parse: add back fsparam_u32hex
fs/9p: delete unnnecessary condition
fs/9p: Don't open remote file with APPEND mode when writeback cache is used
net/9p: cleanup: change p9_trans_module->def to bool
9p: Use kvmalloc for message buffers on supported transports
|
|
After commit 4eb3117888a92, 9p needs to be able to accept numerical
cache= mount options as well as the string "shortcuts" because the option
is printed numerically in /proc/mounts rather than by string. This was
missed in the mount API conversion, which used an enum for the shortcuts
and therefore could not handle a numeric equivalent as an argument
to the cache option.
Fix this by removing the enum and reverting to the slightly more
open-coded option handling for Opt_cache, with the reinstated
get_cache_mode() helper.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Message-ID: <48cdeec9-5bb9-4c7a-a203-39bb8e0ef443@redhat.com>
Tested-by: Remi Pommarel <repk@triplefau.lt>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
commit 4eb3117888a92 changed the cache= option to accept either string
shortcuts or bitfield values. It also changed /proc/mounts to emit the
option as the hexadecimal numeric value rather than the shortcut string.
However, by printing "cache=%x" without the leading 0x, shortcuts such
as "cache=loose" will emit "cache=f" and 'f' is not a string that is
parseable by kstrtoint(), so remounting may fail if a remount with
"cache=f" is attempted.
debug=%x has had the same problem since options have been displayed in
c4fac9100456 ("9p: Implement show_options")
Fix these by adding the 0x prefix to the hexadecimal value shown in
/proc/mounts.
Fixes: 4eb3117888a92 ("fs/9p: Rework cache modes and add new options to Documentation")
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Message-ID: <54b93378-dcf1-4b04-922d-c8b4393da299@redhat.com>
[Dominique: use %#x at Al Viro's suggestion, also handle debug]
Tested-by: Remi Pommarel <repk@triplefau.lt>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull fs header updates from Christian Brauner:
"This contains initial work to start splitting up fs.h.
Begin the long-overdue work of splitting up the monolithic fs.h
header. The header has grown to over 3000 lines and includes types and
functions for many different subsystems, making it difficult to
navigate and causing excessive compilation dependencies.
This series introduces new focused headers for superblock-related
code:
- Rename fs_types.h to fs_dirent.h to better reflect its actual
content (directory entry types)
- Add fs/super_types.h containing superblock type definitions
- Add fs/super.h containing superblock function declarations
This is the first step in a longer effort to modularize the VFS
headers.
Cleanups:
- Inode Field Layout Optimization (Mateusz Guzik)
Move inode fields used during fast path lookup closer together to
improve cache locality during path resolution.
- current_umask() Optimization (Mateusz Guzik)
Inline current_umask() and move it to fs_struct.h. This improves
performance by avoiding function call overhead for this
frequently-used function, and places it in a more appropriate
header since it operates on fs_struct"
* tag 'vfs-6.19-rc1.fs_header' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: move inode fields used during fast path lookup closer together
fs: inline current_umask() and move it to fs_struct.h
fs: add fs/super.h header
fs: add fs/super_types.h header
fs: rename fs_types.h to fs_dirent.h
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull writeback updates from Christian Brauner:
"Features:
- Allow file systems to increase the minimum writeback chunk size.
The relatively low minimal writeback size of 4MiB means that
written back inodes on rotational media are switched a lot. Besides
introducing additional seeks, this also can lead to extreme file
fragmentation on zoned devices when a lot of files are cached
relative to the available writeback bandwidth.
This adds a superblock field that allows the file system to
override the default size, and sets it to the zone size for zoned
XFS.
- Add logging for slow writeback when it exceeds
sysctl_hung_task_timeout_secs. This helps identify tasks waiting
for a long time and pinpoint potential issues. Recording the
starting jiffies is also useful when debugging a crashed vmcore.
- Wake up waiting tasks when finishing the writeback of a chunk
Cleanups:
- filemap_* writeback interface cleanups.
Adding filemap_fdatawrite_wbc ended up being a mistake, as all but
the original btrfs caller should be using better high level
interfaces instead.
This series removes all these low-level interfaces, switches btrfs
to a more specific interface, and cleans up other too low-level
interfaces. With this the writeback_control that is passed to the
writeback code is only initialized in three places.
- Remove __filemap_fdatawrite, __filemap_fdatawrite_range, and
filemap_fdatawrite_wbc
- Add filemap_flush_nr helper for btrfs
- Push struct writeback_control into start_delalloc_inodes in btrfs
- Rename filemap_fdatawrite_range_kick to filemap_flush_range
- Stop opencoding filemap_fdatawrite_range in 9p, ocfs2, and mm
- Make wbc_to_tag() inline and use it in fs"
* tag 'vfs-6.19-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: Make wbc_to_tag() inline and use it in fs.
xfs: set s_min_writeback_pages for zoned file systems
writeback: allow the file system to override MIN_WRITEBACK_PAGES
writeback: cleanup writeback_chunk_size
mm: rename filemap_fdatawrite_range_kick to filemap_flush_range
mm: remove __filemap_fdatawrite_range
mm: remove filemap_fdatawrite_wbc
mm: remove __filemap_fdatawrite
mm,btrfs: add a filemap_flush_nr helper
btrfs: push struct writeback_control into start_delalloc_inodes
btrfs: use the local tmp_inode variable in start_delalloc_inodes
ocfs2: don't opencode filemap_fdatawrite_range in ocfs2_journal_submit_inode_data_buffers
9p: don't opencode filemap_fdatawrite_range in v9fs_mmap_vm_close
mm: don't opencode filemap_fdatawrite_range in filemap_invalidate_inode
writeback: Add logging for slow writeback (exceeds sysctl_hung_task_timeout_secs)
writeback: Wake up waiting tasks when finishing the writeback of a chunk.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode updates from Christian Brauner:
"Features:
- Hide inode->i_state behind accessors. Open-coded accesses prevent
asserting they are done correctly. One obvious aspect is locking,
but significantly more can be checked. For example it can be
detected when the code is clearing flags which are already missing,
or is setting flags when it is illegal (e.g., I_FREEING when
->i_count > 0)
- Provide accessors for ->i_state, converts all filesystems using
coccinelle and manual conversions (btrfs, ceph, smb, f2fs, gfs2,
overlayfs, nilfs2, xfs), and makes plain ->i_state access fail to
compile
- Rework I_NEW handling to operate without fences, simplifying the
code after the accessor infrastructure is in place
Cleanups:
- Move wait_on_inode() from writeback.h to fs.h
- Spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
for clarity
- Cosmetic fixes to LRU handling
- Push list presence check into inode_io_list_del()
- Touch up predicts in __d_lookup_rcu()
- ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage
- Assert on ->i_count in iput_final()
- Assert ->i_lock held in __iget()
Fixes:
- Add missing fences to I_NEW handling"
* tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits)
dcache: touch up predicts in __d_lookup_rcu()
fs: push list presence check into inode_io_list_del()
fs: cosmetic fixes to lru handling
fs: rework I_NEW handling to operate without fences
fs: make plain ->i_state access fail to compile
xfs: use the new ->i_state accessors
nilfs2: use the new ->i_state accessors
overlayfs: use the new ->i_state accessors
gfs2: use the new ->i_state accessors
f2fs: use the new ->i_state accessors
smb: use the new ->i_state accessors
ceph: use the new ->i_state accessors
btrfs: use the new ->i_state accessors
Manual conversion to use ->i_state accessors of all places not covered by coccinelle
Coccinelle-based conversion to use ->i_state accessors
fs: provide accessors for ->i_state
fs: spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
fs: move wait_on_inode() from writeback.h to fs.h
fs: add missing fences to I_NEW handling
ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage
...
|
|
There is no good reason to have this as a func call, other than avoiding
the churn of adding fs_struct.h as needed.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251104170448.630414-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Convert 9p to the new mount API. This patch consolidates all parsing
into fs/9p/v9fs.c, which stores all results into a filesystem context
which can be passed to the various transports as needed.
Some of the parsing helper functions such as get_cache_mode() have been
eliminated in favor of using the new mount API's enum param type,
for simplicity.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Message-ID: <20251010214222.1347785-5-sandeen@redhat.com>
[ Dominique: handled source explicitly as per follow-up discussion ]
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
We already know that "retval" is negative, so there is no need to check
again. Also the statement is not indented far enough. Delete it.
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Fixes: 43c36a56ccf6 ("Revert "fs/9p: Refresh metadata in d_revalidate for uncached mode too"")
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Message-ID: <aPtiSJl8EwSfVvqN@stanley.mountain>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
When page cache is used, writebacks are done on a page granularity, and it
is expected that the underlying filesystem (such as v9fs) should respect
the write position. However, currently v9fs will passthrough O_APPEND to
the server even on cached mode. This causes data corruption if a sync or
fstat gets between two writes to the same file.
This patch removes the APPEND flag from the open request we send to the
server when writeback caching is involved. I believe keeping server-side
APPEND is probably fine for uncached mode (even if two fds are opened, one
without O_APPEND and one with it, this should still be fine since they
would use separate fid for the writes).
Signed-off-by: Tingmao Wang <m@maowtm.org>
Fixes: 4eb3117888a9 ("fs/9p: Rework cache modes and add new options to Documentation")
Message-ID: <20251102235631.8724-1-m@maowtm.org>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
Use filemap_fdatawrite_range instead of opencoding the logic using
filemap_fdatawrite_wbc.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251024080431.324236-3-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
This reverts commit 290434474c332a2ba9c8499fe699c7f2e1153280.
That commit broke cache=mmap, a mode that doesn't cache metadata,
but still has writeback cache.
In commit 290434474c33 ("fs/9p: Refresh metadata in d_revalidate
for uncached mode too") we considered metadata cache to be enough to
not look at the server, but in writeback cache too looking at the server
size would make the vfs consider the file has been truncated before the
data has been flushed out, making the following repro fail (nothing is
ever read back, the resulting file ends up with no data written)
```
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
char buf[4096];
int main(int argc, char *argv[])
{
int ret, i;
int fdw, fdr;
if (argc < 2)
return 1;
fdw = openat(AT_FDCWD, argv[1], O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC, 0600);
if (fdw < 0) {
fprintf(stderr, "cannot open fdw\n");
return 1;
}
write(fdw, buf, sizeof(buf));
fdr = openat(AT_FDCWD, argv[1], O_RDONLY|O_CLOEXEC);
if (fdr < 0) {
fprintf(stderr, "cannot open fdr\n");
close(fdw);
return 1;
}
for (i = 0; i < 10; i++) {
ret = read(fdr, buf, sizeof(buf));
fprintf(stderr, "i: %d, read returns %d\n", i, ret);
}
close(fdr);
close(fdw);
return 0;
}
```
There is a fix for this particular reproducer but it looks like there
are other problems around metadata refresh (e.g. around file rename), so
revert this to avoid d_revalidate in uncached mode for now.
Reported-by: Song Liu <song@kernel.org>
Link: https://lkml.kernel.org/r/CAHzjS_u_SYdt5=2gYO_dxzMKXzGMt-TfdE_ueowg-Hq5tRCAiw@mail.gmail.com
Reported-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Link: https://lore.kernel.org/bpf/CAEf4BzZbCE4tLoDZyUf_aASpgAGFj75QMfSXX4a4dLYixnOiLg@mail.gmail.com/
Fixes: 290434474c33 ("fs/9p: Refresh metadata in d_revalidate for uncached mode too")
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
All places were patched by coccinelle with the default expecting that
->i_lock is held, afterwards entries got fixed up by hand to use
unlocked variants as needed.
The script:
@@
expression inode, flags;
@@
- inode->i_state & flags
+ inode_state_read(inode) & flags
@@
expression inode, flags;
@@
- inode->i_state &= ~flags
+ inode_state_clear(inode, flags)
@@
expression inode, flag1, flag2;
@@
- inode->i_state &= ~flag1 & ~flag2
+ inode_state_clear(inode, flag1 | flag2)
@@
expression inode, flags;
@@
- inode->i_state |= flags
+ inode_state_set(inode, flags)
@@
expression inode, flags;
@@
- inode->i_state = flags
+ inode_state_assign(inode, flags)
@@
expression inode, flags;
@@
- flags = inode->i_state
+ flags = inode_state_read(inode)
@@
expression inode, flags;
@@
- READ_ONCE(inode->i_state) & flags
+ inode_state_read(inode) & flags
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Pull 9p updates from Dominique Martinet:
"A bunch of unrelated fixes:
- polling fix for trans fd that ought to have been fixed otherwise
back in March, but apparently came back somewhere else...
- USB transport buffer overflow fix
- Some dentry lifetime rework to handle metadata update for currently
opened files in uncached mode, or inode type change in cached mode
- a double-put on invalid flush found by syzbot
- and finally /sys/fs/9p/caches not advancing buffer and overwriting
itself for large contents
Thanks to everyone involved!"
* tag '9p-for-6.18-rc1' of https://github.com/martinetd/linux:
9p: sysfs_init: don't hardcode error to ENOMEM
9p: fix /sys/fs/9p/caches overwriting itself
9p: clean up comment typos
9p/trans_fd: p9_fd_request: kick rx thread if EPOLLIN
net/9p: fix double req put in p9_fd_cancelled
net/9p: Fix buffer overflow in USB transport layer
fs/9p: Add p9_debug(VFS) in d_revalidate
fs/9p: Invalidate dentry if inode type change detected in cached mode
fs/9p: Refresh metadata in d_revalidate for uncached mode too
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull finish_no_open updates from Al Viro:
"finish_no_open calling conventions change to simplify callers"
* tag 'pull-finish_no_open' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
slightly simplify nfs_atomic_open()
simplify gfs2_atomic_open()
simplify fuse_atomic_open()
simplify nfs_atomic_open_v23()
simplify vboxsf_dir_atomic_open()
simplify cifs_atomic_open()
9p: simplify v9fs_vfs_atomic_open_dotl()
9p: simplify v9fs_vfs_atomic_open()
allow finish_no_open(file, ERR_PTR(-E...))
|
|
v9fs_sysfs_init() always returned -ENOMEM on failure;
return the actual sysfs_create_group() error instead.
Signed-off-by: Randall P. Embry <rpembry@gmail.com>
Message-ID: <20250926-v9fs_misc-v1-3-a8b3907fc04d@codewreck.org>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
caches_show() overwrote its buffer on each iteration,
so only the last cache tag was visible in sysfs output.
Properly append with snprintf(buf + count, …).
Signed-off-by: Randall P. Embry <rpembry@gmail.com>
Message-ID: <20250926-v9fs_misc-v1-2-a8b3907fc04d@codewreck.org>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
Fix a few minor typos in comments (e.g. "trasnport" → "transport").
Signed-off-by: Randall P. Embry <rpembry@gmail.com>
Message-ID: <20250926-v9fs_misc-v1-1-a8b3907fc04d@codewreck.org>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
again, preexisting aliases will always be positive
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
if v9fs_vfs_lookup() returns a preexisting alias, it is guaranteed to be
positive. IOW, in that case we will immediately return finish_no_open(),
leaving only the case res == NULL past that point.
Reviewed-by: NeilBrown <neil@brown.name>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
generic_delete_inode() is rather misleading for what the routine is
doing. inode_just_drop() should be much clearer.
The new naming is inconsistent with generic_drop_inode(), so rename that
one as well with inode_ as the suffix.
No functional changes.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
This was a useful debugging / validation aid, and can explain why a
GETATTR request is made.
Signed-off-by: Tingmao Wang <m@maowtm.org>
Message-ID: <00829a99549e33d26139fa4d756c466629f13e00.1743956147.git.m@maowtm.org>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
This is an extension of the last commit to cached mode as well. While
server-side changes when using cached mode is not expected, when it does
happen we can get things like -EOPNOTSUPP. With this change at least when
we realize the inode has updated (for example after a `touch` on the
client), we can get a new one.
Signed-off-by: Tingmao Wang <m@maowtm.org>
Message-ID: <01afd3c77d5cda181780dc931baa8f3fc54562c8.1743956147.git.m@maowtm.org>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
|
|
Currently if another process keeps a file open, due to existing dentry in
the dcache, other processes will not see updated metadata of that file if
it is changed on the server, even in uncached mode.
This can also manifest as -ENODATA when reading a file that has shrunk on
the server (even if it's re-opened in another process), or -ENOTSUPP if
the file has changed type (e.g. regular file to directory) on the server.
We can end up in a situation where both `readdir` or `read` fails until
the file is closed by all processes using it.
This commit fixes that, and invalidates the dentry altogether if the inode
type is changed (fo |