| Age | Commit message (Collapse) | Author | Files | Lines |
|
Reduce the scope of 'z_erofs_crypto[]' that is not used outside of
'decompressor_crypto.c'.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202512102025.4mWeBSsf-lkp@intel.com/
Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
%pe will print a symbolic error name (e.g,. -ENOMEM), opposed to the
raw errno (e.g,. -12) produced by PTR_ERR().
Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
Both nfs_local_do_read and nfs_local_do_write only return 0 at the
end, so switch them to returning void.
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
Handling -EAGAIN in nfs_local_doio() was introduced with commit
0978e5b85fc08 (nfs_do_local_{read,write} were made to have negative
checks for correspoding iter method) but commit e43e9a3a3d66
since eliminated the possibility for this -EAGAIN early return.
So remove nfs_local_doio()'s -EAGAIN handling that calls
nfs_localio_disable_client() -- while it should never happen from
nfs_do_local_{read,write} this particular -EAGAIN handling is now
"dead" and so it has become a liability.
Fixes: e43e9a3a3d66 ("nfs/localio: refactor iocb initialization")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
nfslocaliod_workqueue is a non-memreclaim workqueue (it isn't
initialized with WQ_MEM_RECLAIM), see commit b9f5dd57f4a5
("nfs/localio: use dedicated workqueues for filesystem read and
write").
Use nfslocaliod_workqueue for LOCALIO's SYNC work.
Also, set PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO in
nfs_local_fsync_work.
Fixes: b9f5dd57f4a5 ("nfs/localio: use dedicated workqueues for filesystem read and write")
Signed-off-by: Mike Snitzer <snitzer@hammerspace.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
LOCALIO is an NFS loopback mount optimization that avoids using the
network for READ, WRITE and COMMIT if the NFS client and server are
determined to be on the same system. But because LOCALIO is still
fundamentally "just NFS loopback mount" it is susceptible to recursion
deadlock via direct reclaim, e.g.: NFS LOCALIO down to XFS and then
back into NFS via nfs_writepages.
Fix LOCALIO's potential for direct reclaim deadlock by ensuring that
all its page cache allocations are done from GFP_NOFS context.
Thanks to Ben Coddington for pointing out commit ad22c7a043c2 ("xfs:
prevent stack overflows from page cache allocation").
Reported-by: John Cagle <john.cagle@hammerspace.com>
Tested-by: Allen Lu <allen.lu@hammerspace.com>
Suggested-by: Benjamin Coddington <bcodding@hammerspace.com>
Fixes: 70ba381e1a43 ("nfs: add LOCALIO support")
Signed-off-by: Mike Snitzer <snitzer@hammerspace.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
Remove the redundant 'force' parameter.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
The current code for handling short writes in localio just truncates the
I/O and then sets an error. While that is close to how the ordinary NFS
code behaves, it does mean there is a chance the data that got written
is lost because it isn't persisted.
To fix this, change localio so that the upper layers can direct the
behaviour to persist any unstable data by rewriting it, and then
continuing writing until an ENOSPC is hit.
Fixes: 70ba381e1a43 ("nfs: add LOCALIO support")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
|
|
[BUG]
After commit aa60fe12b4f4 ("btrfs: zlib: refactor S390x HW acceleration
buffer preparation"), we no longer release the folio of the page cache
of folio returned by btrfs_compress_filemap_get_folio() for S390
hardware acceleration path.
[CAUSE]
Before that commit, we call kumap_local() and folio_put() after handling
each folio.
Although the timing is not ideal (it release previous folio at the
beginning of the loop, and rely on some extra cleanup out of the loop),
it at least handles the folio release correctly.
Meanwhile the refactored code is easier to read, it lacks the call to
release the filemap folio.
[FIX]
Add the missing folio_put() for copy_data_into_buffer().
CC: linux-s390@vger.kernel.org # 6.18+
Fixes: aa60fe12b4f4 ("btrfs: zlib: refactor S390x HW acceleration buffer preparation")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
There is an internal report that over 1000 processes are
waiting at the io_schedule_timeout() of balance_dirty_pages(), causing
a system hang and trigger a kernel coredump.
The kernel is v6.4 kernel based, but the root problem still applies to
any upstream kernel before v6.18.
[CAUSE]
From Jan Kara for his wisdom on the dirty page balance behavior first.
This cgroup dirty limit was what was actually playing the role here
because the cgroup had only a small amount of memory and so the dirty
limit for it was something like 16MB.
Dirty throttling is responsible for enforcing that nobody can dirty
(significantly) more dirty memory than there's dirty limit. Thus when
a task is dirtying pages it periodically enters into balance_dirty_pages()
and we let it sleep there to slow down the dirtying.
When the system is over dirty limit already (either globally or within
a cgroup of the running task), we will not let the task exit from
balance_dirty_pages() until the number of dirty pages drops below the
limit.
So in this particular case, as I already mentioned, there was a cgroup
with relatively small amount of memory and as a result with dirty limit
set at 16MB. A task from that cgroup has dirtied about 28MB worth of
pages in btrfs btree inode and these were practically the only dirty
pages in that cgroup.
So that means the only way to reduce the dirty pages of that cgroup is
to writeback the dirty pages of btrfs btree inode, and only after that
those processes can exit balance_dirty_pages().
Now back to the btrfs part, btree_writepages() is responsible for
writing back dirty btree inode pages.
The problem here is, there is a btrfs internal threshold that if the
btree inode's dirty bytes are below the 32M threshold, it will not
do any writeback.
This behavior is to batch as much metadata as possible so we won't write
back those tree blocks and then later re-COW them again for another
modification.
This internal 32MiB is higher than the existing dirty page size (28MiB),
meaning no writeback will happen, causing a deadlock between btrfs and
cgroup:
- Btrfs doesn't want to write back btree inode until more dirty pages
- Cgroup/MM doesn't want more dirty pages for btrfs btree inode
Thus any process touching that btree inode is put into sleep until
the number of dirty pages is reduced.
Thanks Jan Kara a lot for the analysis of the root cause.
[ENHANCEMENT]
Since kernel commit b55102826d7d ("btrfs: set AS_KERNEL_FILE on the
btree_inode"), btrfs btree inode pages will only be charged to the root
cgroup which should have a much larger limit than btrfs' 32MiB
threshold.
So it should not affect newer kernels.
But for all current LTS kernels, they are all affected by this problem,
and backporting the whole AS_KERNEL_FILE may not be a good idea.
Even for newer kernels I still think it's a good idea to get
rid of the internal threshold at btree_writepages(), since for most cases
cgroup/MM has a better view of full system memory usage than btrfs' fixed
threshold.
For internal callers using btrfs_btree_balance_dirty() since that
function is already doing internal threshold check, we don't need to
bother them.
But for external callers of btree_writepages(), just respect their
requests and write back whatever they want, ignoring the internal
btrfs threshold to avoid such deadlock on btree inode dirty page
balancing.
CC: stable@vger.kernel.org
CC: Jan Kara <jack@suse.cz>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- protect reading super block vs setting block size externally (found
by syzbot)
- make sure no transaction is started in read-only mode even with some
rescue mount option combinations
- fix checksum calculation of backup super blocks when block-group-tree
is enabled
- more extensive mount-time checks of device items that could be left
after device replace and attempting degraded mount
- fix build warning with -Wmaybe-uninitialized on loongarch64-gcc 12
* tag 'for-6.19-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: add extra device item checks at mount
btrfs: fix missing fields in superblock backup with BLOCK_GROUP_TREE
btrfs: reject new transactions if the fs is fully read-only
btrfs: sync read disk super and set block size
btrfs: fix Wmaybe-uninitialized warning in replay_one_buffer()
|
|
The memalloc_nofs_save() and memalloc_nofs_restore() calls are
incorrectly paired in xfs_trans_roll.
Call path:
xfs_trans_alloc()
__xfs_trans_alloc()
// tp->t_pflags = memalloc_nofs_save();
xfs_trans_set_context()
...
xfs_defer_trans_roll()
xfs_trans_roll()
xfs_trans_dup()
// old_tp->t_pflags = 0;
xfs_trans_switch_context()
__xfs_trans_commit()
xfs_trans_free()
// memalloc_nofs_restore(tp->t_pflags);
xfs_trans_clear_context()
The code passes 0 to memalloc_nofs_restore() when committing the original
transaction, but memalloc_nofs_restore() should always receive the
flags returned from the paired memalloc_nofs_save() call.
Before commit 3f6d5e6a468d ("mm: introduce memalloc_flags_{save,restore}"),
calling memalloc_nofs_restore(0) would unset the PF_MEMALLOC_NOFS flag,
which could cause memory allocation deadlocks[1].
Fortunately, after that commit, memalloc_nofs_restore(0) does nothing,
so this issue is currently harmless.
Fixes: 756b1c343333 ("xfs: use current->journal_info for detecting transaction recursion")
Link: https://lore.kernel.org/linux-xfs/20251104131857.1587584-1-leo.lilong@huawei.com [1]
Signed-off-by: Wenwu Hou <hwenwur@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Zones in the beginning of the address space are typically mapped to
higer bandwidth tracks on HDDs than those at the end of the address
space. So, in stead of allocating zones "round robin" across the whole
address space, always allocate the zone with the lowest index.
This increases average write bandwidth for overwrite workloads
when less than the full capacity is being used. At ~50% utilization
this improves bandwidth for a random file overwrite benchmark
with 128MiB files and 256MiB zone capacity by 30%.
Running the same benchmark with small 2-8 MiB files at 67% capacity
shows no significant difference in performance. Due to heavy
fragmentation the whole zone range is in use, greatly limiting the
number of free zones with high bw.
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Large block support was merged upstream in 6.12 (Dec 2024) and metadata
directories was merged in 6.13 (Jan 2025). We've not received any
serious complaints about the ondisk formats of these two features in the
past year, so let's remove the experimental warnings.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Unwind the callback based programming model by querying the cached
zone information using blkdev_get_zone_info.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Any used block must have been written, this reject used blocks > write
pointer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Currently xfs_zone_validate mixes validating the software zone state in
the XFS realtime group with validating the hardware state reported in
struct blk_zone and deriving the write pointer from that.
Move all code that works on the realtime group to xfs_init_zone, and only
keep the hardware state validation in xfs_zone_validate. This makes the
code more clear, and allows for better reuse in userspace.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Move the two methods to query the write pointer out of xfs_init_zone into
the callers, so that xfs_init_zone doesn't have to bother with the
blk_zone structure and instead operates purely at the XFS realtime group
level.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Add a helper to figure the on-disk size of a group, accounting for the
XFS_SB_FEAT_INCOMPAT_ZONE_GAPS feature if needed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Add the missing forward declaration for struct blk_zone in xfs_zones.h.
This avoids headaches with the order of header file inclusion to avoid
compilation errors.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
The calling convention of xfs_attr_leaf_hasname() is problematic, because
it returns a NULL buffer when xfs_attr3_leaf_read fails, a valid buffer
when xfs_attr3_leaf_lookup_int returns -ENOATTR or -EEXIST, and a
non-NULL buffer pointer for an already released buffer when
xfs_attr3_leaf_lookup_int fails with other error values.
Fix this by simply open coding xfs_attr_leaf_hasname in the callers, so
that the buffer release code is done by each caller of
xfs_attr3_leaf_read.
Cc: stable@vger.kernel.org # v5.19+
Fixes: 07120f1abdff ("xfs: Add xfs_has_attr and subroutines")
Reported-by: Mark Tinguely <mark.tinguely@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
I learned a few things this year: first, blk_status_to_errno can return
ENODATA for critical media errors; and second, the scrub code doesn't
mark data structures as corrupt on ENODATA or EIO.
Currently, scrub failing to capture these errors isn't all that
impactful -- the checking code will exit to userspace with EIO/ENODATA,
and xfs_scrub will log a complaint and exit with nonzero status. Most
people treat fsck tools failing as a sign that the fs is corrupt, but
online fsck should mark the metadata bad and keep moving.
Cc: stable@vger.kernel.org # v4.15
Fixes: 4700d22980d459 ("xfs: create helpers to record and deal with scrub problems")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
The double buffering where just one scratch area is used at a time does
not efficiently use the available memory. It was originally implemented
when GC I/O could happen out of order, but that was removed before
upstream submission to avoid fragmentation. Now that all GC I/Os are
processed in order, just use a number of buffers as a simple ring buffer.
For a synthetic benchmark that fills 256MiB HDD zones and punches out
holes to free half the space this leads to a decrease of GC time by
a little more than 25%.
Thanks to Hans Holmberg <hans.holmberg@wdc.com> for testing and
benchmarking.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Replace our somewhat fragile code to reuse the bio, which caused a
regression in the past with the block layer bio_reuse helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
The xfs.h header conflicts with the public xfs.h in xfsprogs, leading
to a spurious difference in all shared libxfs files that have to
include libxfs_priv.h in userspace. Directly include xfs_platform.h so
that we can add a header of the same name to xfsprogs and remove this
major annoyance for the shared code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Move the global defines from xfs.h to xfs_platform.h to prepare for
removing xfs.h.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Ensure we have all kernel headers included by the time we do our own
thing, just like the rest of the tree.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Rename xfs_linux.h to prepare for including including it directly
from source files including those shared with xfsprogs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Add a new xlog_write_space_advance that returns the current place in the
iclog that data is written to, and advances the various counters by the
amount taken from xlog_write_iovec, and also use it xlog_write_partial,
which open codes the counter adjustments, but misses the asserts.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
We need enough space for the length we copy into the iclog, not just
some space, so tighten up the check a bit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Various places check how much space is left in the current iclog,
add a helper for that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
The xlog_write chain passes around the same seven variables that are
often passed by reference. Add a xlog_write_data structure to contain
them to improve code generation and readability.
This change increases the generated code size by about 140 bytes for my
x86_64 build, which is hopefully worth the much easier to follow code:
$ size fs/xfs/xfs_log.o*
text data bss dec hex filename
29300 1730 176 31206 79e6 fs/xfs/xfs_log.o
29160 1730 176 31066 795a fs/xfs/xfs_log.o.old
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
When xlog_write_partial splits a log region over multiple iclogs, it
has to include the continuation ophder in the length requested for the
new iclog. Currently is simply adds that to the request, which makes
the accounting of the used space below look slightly different from the
other users of iclog space that decrement it.
To prepare for more code sharing, add the ophdr size to the len variable
that tracks the number of bytes still are left in this xlog_write
operation before the calling xlog_write_get_more_iclog_space, and then
decrement it later when consuming that space.
This changes the value of len when xlog_write_get_more_iclog_space
returns an error, but as nothing looks at len in that case the
difference doesn't matter.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
The log_vec is a private type for the log/CIL code and should not be
exposed to anything else.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
This structure is now only used by the core logging and CIL code.
Also remove the unused typedef.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Export a higher level interface to format log items. The xlog_format_buf
structure is hidden inside xfs_log_cil.c and only accessed using two
helpers (and a wrapper build on top), hiding details of log iovecs from
the log items. This also allows simply using an index into lv_iovecp
instead of keeping a cursor vec.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
lv_bytes is mostly just use by the CIL code, but has crept into the
low-level log writing code to decide on a full or partial iclog
write. Ensure it is valid even for the special log writes that don't
go through the CIL by initializing it in xlog_write_one_vec.
Note that even without this fix, the checkpoint commits would never
trigger a partial iclog write, as they have no payload beyond the
opheader.
The unmount record on the other hand could in theory trigger a an
overflow of the iclog, but given that is has never been seen in
the wild this has probably been masked by the small size of it
and the fact that the unmount process does multiple log forces
before writing the unmount record and we thus usually operate on
an empty or almost empty iclog.
Fixes: 110dc24ad2ae ("xfs: log vector rounding leaks log space")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Add a wrapper for xlog_write for the two callers who need to build a
log_vec and add it to a single-entry chain instead of duplicating the
code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
Remove unused inode parameter from fat_cache_alloc().
Link: https://lkml.kernel.org/r/20251201214403.90604-2-lalitshankarch@gmail.com
Signed-off-by: Lalit Shankar Chowdhury <lalitshankarch@gmail.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Remove <linux/hex.h> from <linux/kernel.h> and update all users/callers of
hex.h interfaces to directly #include <linux/hex.h> as part of the process
of putting kernel.h on a diet.
Removing hex.h from kernel.h means that 36K C source files don't have to
pay the price of parsing hex.h for the roughly 120 C source files that
need it.
This change has been build-tested with allmodconfig on most ARCHes. Also,
all users/callers of <linux/hex.h> in the entire source tree have been
updated if needed (if not already #included).
Link: https://lkml.kernel.org/r/20251215005206.2362276-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Annotate flexible array members of 'struct ocfs2_local_alloc' and 'struct
ocfs2_inline_data' with '__counted_by_le()' attribute to improve array
bounds checking when CONFIG_UBSAN_BOUNDS is enabled, and prefer the
convenient 'memset()' over an explicit loop to simplify
'ocfs2_clear_local_alloc()'.
Link: https://lkml.kernel.org/r/20251021105518.119953-1-dmantipov@yandex.ru
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
syzbot constructed a corrupted image, which resulted in el->l_count from
the b-tree extent block being 0. Since the length of the l_recs array
depends on l_count, reading its member e_blkno triggered the out-of-bounds
access reported by syzbot in [1].
The loop terminates when l_count is 0, similar to when next_free is 0.
[1]
UBSAN: array-index-out-of-bounds in fs/ocfs2/alloc.c:1838:11
index 0 is out of range for type 'struct ocfs2_extent_rec[] __counted_by(l_count)' (aka 'struct ocfs2_extent_rec[]')
Call Trace:
__ocfs2_find_path+0x606/0xa40 fs/ocfs2/alloc.c:1838
ocfs2_find_leaf+0xab/0x1c0 fs/ocfs2/alloc.c:1946
ocfs2_get_clusters_nocache+0x172/0xc60 fs/ocfs2/extent_map.c:418
ocfs2_get_clusters+0x505/0xa70 fs/ocfs2/extent_map.c:631
ocfs2_extent_map_get_blocks+0x202/0x6a0 fs/ocfs2/extent_map.c:678
ocfs2_read_virt_blocks+0x286/0x930 fs/ocfs2/extent_map.c:1001
ocfs2_read_dir_block fs/ocfs2/dir.c:521 [inline]
ocfs2_find_entry_el fs/ocfs2/dir.c:728 [inline]
ocfs2_find_entry+0x3e4/0x2090 fs/ocfs2/dir.c:1120
ocfs2_find_files_on_disk+0xdf/0x310 fs/ocfs2/dir.c:2023
ocfs2_lookup_ino_from_name+0x52/0x100 fs/ocfs2/dir.c:2045
_ocfs2_get_system_file_inode fs/ocfs2/sysfile.c:136 [inline]
ocfs2_get_system_file_inode+0x326/0x770 fs/ocfs2/sysfile.c:112
ocfs2_init_global_system_inodes+0x319/0x660 fs/ocfs2/super.c:461
ocfs2_initialize_super fs/ocfs2/super.c:2196 [inline]
ocfs2_fill_super+0x4432/0x65b0 fs/ocfs2/super.c:993
get_tree_bdev_flags+0x40e/0x4d0 fs/super.c:1691
vfs_get_tree+0x92/0x2a0 fs/super.c:1751
fc_mount fs/namespace.c:1199 [inline]
Link: https://lkml.kernel.org/r/tencent_4D99464FA28D9225BE0DBA923F5DF6DD8C07@qq.com
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Reported-by: syzbot+151afab124dfbc5f15e6@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=151afab124dfbc5f15e6
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When the filesystem is being mounted, the kernel panics while the data
regarding slot map allocation to the local node, is being written to the
disk. This occurs because the value of slot map buffer head block number,
which should have been greater than or equal to `OCFS2_SUPER_BLOCK_BLKNO`
(evaluating to 2) is less than it, indicative of disk metadata corruption.
This triggers BUG_ON(bh->b_blocknr < OCFS2_SUPER_BLOCK_BLKNO) in
ocfs2_write_block(), causing the kernel to panic.
This is fixed by introducing function ocfs2_validate_slot_map_block() to
validate slot map blocks. It first checks if the buffer head passed to it
is up to date and valid, else it panics the kernel at that point itself.
Further, it contains an if condition block, which checks if
`bh->b_blocknr` is lesser than `OCFS2_SUPER_BLOCK_BLKNO`; if yes, then
ocfs2_error is called, which prints the error log, for debugging purposes,
and the return value of ocfs2_error() is returned. If the if condition is
false, value 0 is returned by ocfs2_validate_slot_map_block().
This function is used as validate function in calls to ocfs2_read_blocks()
in ocfs2_refresh_slot_info() and ocfs2_map_slot_buffers().
Link: https://lkml.kernel.org/r/20251215184600.13147-1-activprithvi@gmail.com
Signed-off-by: Prithvi Tambewagh <activprithvi@gmail.com>
Reported-by: syzbot+c818e5c4559444f88aa0@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c818e5c4559444f88aa0
Tested-by: <syzbot+c818e5c4559444f88aa0@syzkaller.appspotmail.com>
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
After introducing 2f26f58df041 ("ocfs2: annotate flexible array members
with __counted_by_le()"), syzbot has reported the following issue:
UBSAN: array-index-out-of-bounds in fs/ocfs2/xattr.c:1955:3
index 2 is out of range for type 'struct ocfs2_xattr_entry[]
__counted_by(xh_count)' (aka 'struct ocfs2_xattr_entry[]')
...
Call Trace:
<TASK>
dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120
ubsan_epilogue+0xa/0x40 lib/ubsan.c:233
__ubsan_handle_out_of_bounds+0xe9/0xf0 lib/ubsan.c:455
ocfs2_xa_remove_entry+0x36d/0x3e0 fs/ocfs2/xattr.c:1955
...
To address this issue, 'xh_entries[]' member removal should be performed
before actually changing 'xh_count', thus making sure that all array
accesses matches the boundary checks performed by UBSAN.
Link: https://lkml.kernel.org/r/20251211155949.774485-1-dmantipov@yandex.ru
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Reported-by: syzbot+cf96bc82a588a27346a8@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=cf96bc82a588a27346a8
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Deepanshu Kartikey <kartikey406@gmail.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When reading an inode from disk, ocfs2_validate_inode_block() performs
various sanity checks but does not validate the size of inline data. If
the filesystem is corrupted, an inode's i_size can exceed the actual
inline data capacity (id_count).
This causes ocfs2_dir_foreach_blk_id() to iterate beyond the inline data
buffer, triggering a use-after-free when accessing directory entries from
freed memory.
In the syzbot report:
- i_size was 1099511627576 bytes (~1TB)
- Actual inline data capacity (id_count) is typically <256 bytes
- A garbage rec_len (54648) caused ctx->pos to jump out of bounds
- This triggered a UAF in ocfs2_check_dir_entry()
Fix by adding a validation check in ocfs2_validate_inode_block() to ensure
inodes with inline data have i_size <= id_count. This catches the
corruption early during inode read and prevents all downstream code from
operating on invalid data.
Link: https://lkml.kernel.org/r/20251212052132.16750-1-kartikey406@gmail.com
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Reported-by: syzbot+c897823f699449cc3eb4@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c897823f699449cc3eb4
Tested-by: syzbot+c897823f699449cc3eb4@syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/20251211115231.3560028-1-kartikey406@gmail.com/T/ [v1]
Link: https://lore.kernel.org/all/20251212040400.6377-1-kartikey406@gmail.com/T/ [v2]
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add validation in ocfs2_validate_inode_block() to check that if an inode
has OCFS2_HAS_REFCOUNT_FL set, it must also have a valid i_refcount_loc.
A corrupted filesystem image can have this inconsistent state, which later
triggers a BUG_ON in ocfs2_remove_refcount_tree() when the inode is being
wiped during unlink.
Catch this corruption early during inode validation to fail gracefully
instead of crashing the kernel.
Link: https://lkml.kernel.org/r/20251212055826.20929-1-kartikey406@gmail.com
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Reported-by: syzbot+6d832e79d3efe1c46743@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=6d832e79d3efe1c46743
Tested-by: syzbot+6d832e79d3efe1c46743@syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/20251208084407.3021466-1-kartikey406@gmail.com/T/ [v1]
Link: https://lore.kernel.org/all/20251212045646.9988-1-kartikey406@gmail.com/T/ [v2]
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
'struct configfs_item_operations' and 'configfs_group_operations' are not
modified in this driver.
Constifying these structures moves some data to a read-only section, so
increases overall security, especially when the structure holds some
function pointers.
On a x86_64, with allmodconfig, as an example:
Before:
======
text data bss dec hex filename
74011 19312 5280 98603 1812b fs/ocfs2/cluster/heartbeat.o
After:
=====
text data bss dec hex filename
74171 19152 5280 98603 1812b fs/ocfs2/cluster/heartbeat.o
Link: https://lkml.kernel.org/r/7c7c00ba328e5e514d8debee698154039e9640dd.1765708880.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
After ocfs2 gained the ability to reclaim suballocator free block group
(BGs), a suballocator block group may be released. This change causes the
xfstest case generic/426 to fail.
generic/426 expects return value -ENOENT or -ESTALE, but the current code
triggers -EROFS.
Call stack before ocfs2 gained the ability to reclaim bg:
ocfs2_fh_to_dentry //or ocfs2_fh_to_parent
ocfs2_get_dentry
+ ocfs2_test_inode_bit
| ocfs2_test_suballoc_bit
| + ocfs2_read_group_descriptor //Since ocfs2 never releases the bg,
| | //the bg block was always found.
| + *res = ocfs2_test_bit //unlink was called, and the bit is zero
|
+ if (!set) //because the above *res is 0
status = -ESTALE //the generic/426 expected return value
Current call stack that triggers -EROFS:
ocfs2_get_dentry
ocfs2_test_inode_bit
ocfs2_test_suballoc_bit
ocfs2_read_group_descriptor
+ if reading a released bg, validation fails and triggers -EROFS
How to fix:
Since the read BG is already released, we must avoid triggering -EROFS.
With this commit, we use ocfs2_read_hint_group_descriptor() to detect the
released BG block. This approach quietly handles this type of error and
returns -EINVAL, which triggers the caller's existing conversion path to
-ESTALE.
[dan.carpenter@linaro.org: fix uninitialized variable]
Link: https://lkml.kernel.org/r/dc37519fd2470909f8c65e26c5131b8b6dde2a5c.1766043917.git.dan.carpenter@linaro.org
Link: https://lkml.kernel.org/r/20251212074505.25962-3-heming.zhao@suse.com
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Su Yue <glass.su@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "ocfs2: give ocfs2 the ability to reclaim suballocator free
bg", v6.
This patch (of 2):
The current ocfs2 code can't reclaim suballocator block group space. In
some cases, this causes ocfs2 to hold onto a lot of space. For example,
when creating lots of small files, the space is held/managed by the
'//inode_alloc'. After the user deletes all the small files, the space
never returns to the '//global_bitmap'. This issue prevents ocfs2 from
providing the needed space even when there is enough free space in a small
ocfs2 volume.
This patch gives ocfs2 the ability to reclaim suballocator free space when
the block group is freed. For performance reasons, this patch keeps the
first suballocator block group active.
Link: https://lkml.kernel.org/r/20251212074505.25962-2-heming.zhao@suse.com
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Su Yue <glass.su@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|