aboutsummaryrefslogtreecommitdiff
path: root/fs/proc
AgeCommit message (Collapse)AuthorFilesLines
5 hoursMerge tag 'mm-nonmm-stable-2026-06-21-10-22' of ↵Linus Torvalds2-26/+41
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "taskstats: fix TGID dead-thread stat retention" (Yiyang Chen) Fix a taskstats TGID aggregation bug where fields added in the TGID query path were not preserved after thread exit, and adds a kselftest covering the regression. - "lib/tests: string_helpers: Slight improvements" (Andy Shevchenko) Improve lib/tests/string_helpers_kunit.c a little - "lib/base64: decode fixes" (Josh Law) Address minor issues in lib/base64.c - "selftests/filelock: Make output more kselftestish" (Mark Brown) Make the output from the ofdlocks test a bit easier for tooling to work with. Also ignore the generated file - "uaccess: unify inline vs outline copy_{from,to}_user() selection" (Yury Norov) Simplify the usercopy code by removing the selectability of inlining copy_{from,to}_user(). - "ocfs2: validate inline xattr header consumers" (ZhengYuan Huang) Fix a number of possible issues in the ocfs2 xattr code - "lib and lib/cmdline enhancements" (Dmitry Antipov) Provide additional robustness checking in the cmdline handling code and its in-kernel testing and selftests - "cleanup the RAID6 P/Q library" (Christoph Hellwig) Clean up the RAID6 P/Q library to match the recent updates to the RAID 5 XOR library and other CRC/crypto libraries - "ocfs2: harden inode validators against forged metadata" (Michael Bommarito) Add three structural checks to OCFS2 dinode validation so malformed on-disk fields are rejected before ocfs2_populate_inode() copies them into the in-core inode - "lib/raid: replace __get_free_pages() call with kmalloc()" (Mike Rapoport) Clean up the lib/raid code by using kmalloc() in more places * tag 'mm-nonmm-stable-2026-06-21-10-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (108 commits) ocfs2: fix circular locking dependency in ocfs2_dio_end_io_write ocfs2: fix NULL h_transaction deref in ocfs2_assure_trans_credits lib: interval_tree_test: validate benchmark parameters ocfs2: avoid moving extents to occupied clusters treewide: fix transposed "sign" typos and update spelling.txt ocfs2: fix UBSAN array-index-out-of-bounds in ocfs2_sum_rightmost_rec fat: reject BPB volumes whose data area starts beyond total sectors selftests/uevent: increase __UEVENT_BUFFER_SIZE to avoid ENOBUFS on busy systems lib/test_firmware: allocate the configured into_buf size fs: efs: remove unneeded debug prints checkpatch: cuppress warnings when Reported-by: is followed by Link: MAINTAINERS: add Alexander as a kcov reviewer mailmap: update Alexander Sverdlin's Email addresses fs: fat: inode: replace sprintf() with scnprintf() ocfs2: fix out-of-bounds write in ocfs2_remove_refcount_extent ocfs2: fix race between ocfs2_control_install_private() and ocfs2_control_release() ocfs2/dlm: require a ref for locking_state debugfs open ocfs2: reject FITRIM ranges shorter than a cluster ocfs2: validate fast symlink target during inode read ocfs2: add journal NULL check in ocfs2_checkpoint_inode() ...
2 daysMerge tag 'mm-stable-2026-06-18-09-26' of ↵Linus Torvalds1-46/+223
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "selftests/mm: clean up build output and verbosity" (Li Wang) Remove some noise from the MM selftests build - "mm: Free contiguous order-0 pages efficiently" (Ryan Roberts) Speed up the freeing of a batch of 0-order pages by first scanning them for coalescing opportunities. This is applicable to vfree() and to the releasing of frozen pages - "mm/damon: introduce DAMOS failed region quota charge ratio" (SeongJae Park) Address a DAMOS usability issue: The DAMOS quota often exhausts prematurely because it charges for all memory attempted, causing slow and inconsistent performance when actions fail on unreclaimable memory. To fix this, a new feature lets users set a smaller, flexible quota charge ratio (via a numerator and denominator) for failed regions. Since failed actions cause less overhead, reducing their quota cost ensures more predictable and efficient DAMOS processing - "selftests/cgroup: improve zswap tests robustness and support large page sizes" (Li Wang) Fix various spurious failures and improves the overall robustness of the cgroup zswap selftests - "fix MAP_DROPPABLE not supported errno" (Anthony Yznaga) Fix an issue in the mlock selftests on arm32 - "mm: huge_memory: clean up defrag sysfs with shared" (Breno Leitao) Some maintenance work in the huge_memory code - "treewide: fixup gfp_t printks" (Brendan Jackman) Use the special vprintf() gfp_t conversion in various places - "mm: Fix vmemmap optimization accounting and initialization" (Muchun Song) Fix several bugs in the vmemmap optimization, mainly around incorrect page accounting and memmap initialization in the DAX and memory hotplug paths. It also fixes pageblock migratetype initialization and struct page initialization for ZONE_DEVICE compound pages - "mm/damon: repost non-hotfix reviewed patches in damon/next tree" A sprinkle of unrelated minor bugfixes for DAMON - "mm: remove page_mapped()" (David Hildenbrand) Remove this function from the tree, replacing it with folio_mapped() - "mm/damon: let DAMON be paused and resumed" (SeongJae Park) Allow DAMON to be paused and resumed without losing its current state - "kasan: hw_tags: Disable tagging for stack and page-tables" (Muhammad Usama Anjum) Simplify and speed up kasan by removing its ineffective tagging of stacks and page tables - "mm/damon/reclaim,lru_sort: monitor all system rams by default" (SeongJae Park) Simplify deployment on diverse hardware like NUMA systems by updating DAMON_RECLAIM and DAMON_LRU_SORT to automatically monitor the physical address range covering all System RAM areas by default, replacing the overly restrictive behavior that only targeted the single largest memory block to save on negligible overhead - "mm/damon/sysfs: document filters/ directory as deprecated" (SeongJae Park) Update some DAMON docs - "mm: use spinlock guards for zone lock" (Dmitry Ilvokhin) Switch zone->lock handling over to using the guard() mechanisms - "mm/filemap: tighten mmap_miss hit accounting" (fujunjie) Fix a flaw where the mmap_miss counter over-credited page cache hits during fault-arounds and page-fault retries. This results in significant reduction of redundant synchronous mmap readahead I/O, drastically cutting down execution time and gigabytes read for sparse random or strided memory access workloads - "selftests/cgroup: Fix false positive failures in test_percpu_basic" (Li Wang) Fix a couple of false-positives in the cgroup kmem selftests - "mm/damon/reclaim: support monitoring intervals auto-tuning" (SeongJae Park) Add a new parameter to DAMON permitting DAMON_RECLAIM to automatically tune DAMON's sampling and aggregation intervals - "mm/damon/stat: add kdamond_pid parameter" (SeongJae Park) Change DAMON_STAT to provide the pid of its kdamond - "mm/kmemleak: dedupe verbose scan output" (Breno Leitao) Remove large amounts of duplicated backtraces from the verbose-mode kmemleak output - "mm: remove CONFIG_HAVE_BOOTMEM_INFO_NODE (Part 1)" (David Hildenbrand) Reduce our use of CONFIG_HAVE_BOOTMEM_INFO_NODE, with a view to removing it entirely in a later series - "mm/damon: validate min_region_size to be power of 2" (Liew Rui Yan) Prevent users from passing a non-power-of-2 value of `addr_unit', as this later results in undesirable behavior - "mm: document read_pages and simplify usage" (Frederick Mayle) - "tools/mm/page-types: Fix misc bugs" (Ye Liu) Fix three issues in tools/mm/page-types.c - "mm: misc cleanups from __GFP_UNMAPPED series" (Brendan Jackman) Implement several cleanups in the page allocator and related code - "mm, swap: swap table phase IV: unify allocation" (Kairui Song) Unify the allocation and charging of anon and shmem swap in folios, provides better synchronization, consolidates the metadata management, hence dropping the static array and map, and improves performance - "mm/damon: introduce data attributes monitoring" (SeongJae Park( Extend DAMON to monitor general data attributes other than accesses - "mm/vmalloc: free unused pages on vrealloc() shrink" (Shivam Kalra) Implement the TODO in vrealloc() to unmap and free unused pages when shrinking across a page boundary - "mm/damon: documentation and comment fixes" (niecheng) - "remove mmap_action success, error hooks" (Lorenzo Stoakes) Eliminate custom hooks from mmap_action by removing the problematic success_hook which allowed drivers to improperly access uninitialized VMAs. It replaces the error_hook with a simple error-code field and updates the memory char driver accordingly - "mm/damon: minor improvements for code readability and tests" (SeongJae Park) - "mm/damon: fix macro arguments and clarify quota goals doc" (Maksym Shcherba) - "userfaultfd: merge fs/userfaultfd.c into mm/userfaultfd.c" (Mike Rapoport) - "mm/mglru: improve reclaim loop and dirty folio" (Kairui Song and others) Clean up and slightly improves MGLRU's reclaim loop and dirty writeback handling. Large performance improvements are measured - "use vma locks for proc/pid/{smaps|numa_maps} reads" (Suren Baghdasaryan) Use per-vma locks when reading /proc/pid/smaps and numa_maps similar to reduce contention on central mmap_lock - "refactors thpsize_shmem_enabled_store() and thpsize_shmem_enabled_show()" (Ran Xiaokai) Some cleanup work in the THP code - "selftests/memfd: fix compilation warnings" (Konstantin Khorenko) Fix a few build glitches in the memfd selftest code. - "memcg: shrink obj_stock_pcp and cache multiple objcgs" (Shakeel Butt) Resolve a 68% performance regression caused by NUMA-node cache thrashing around struct obj_stock_pcp by shrinking its existing fields and expanding it into a multi-slot array that caches up to five obj_cgroup pointers per CPU, allowing per-node variants of the same memcg to coexist within a single 64-byte cache line. - "zram: writeback fixes" (Sergey Senozhatsky) address a couple of unrelated zram writeback issues - "mm: switch THP shrinker to list_lru" (Johannes Weiner) Resolve NUMA-awareness issues and streamlines callsite interaction by refactoring and extending the list_lru API to completely replace the complex, open-coded deferred split queue for Transparent Huge Pages - "mm: improve large folio readahead for exec memory" (Usama Arif) Improve large-folio readahead on systems like 64K-page arm64 by preventing the mmap_miss check from permanently disabling target-oriented VM_EXEC readahead, and by generalizing the force_thp_readahead gate to support mappings with any usefully large maximum folio order under the cache cap. - "userfaultfd/pagemap: pre-existing fixes" (Kiryl Shutsemau) Fix a bunch of minor issues in the userfaultfd/pagemap, all of which were flagged by Sashiko review of proposed new material - "mm/sparse-vmemmap: Provide generic vmemmap_set_pmd() and vmemmap_check_pmd()" (Muchun Song) Provide generic versions of these two functions so the four arch-specific implementations can be removed. - "mm/swap, PM: hibernate: fix swapoff race in uswsusp by pinning swap device" (Youngjun Park) Address a uswsusp-vs-swapoff race and reduces the swap device reference taking/releasing frequency. - "mm/hmm: A fix and a selftest" (Dev Jain) * tag 'mm-stable-2026-06-18-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits) selftests/mm/hmm-tests: test pagemap reads of PMD device-private entries fs/proc/task_mmu: do not warn on seeing non-migration pmd entry lib/test_hmm: check alloc_page_vma() return value and handle OOM mm/compaction: cap compact_gap() at COMPACT_CLUSTER_MAX mm/swap: remove redundant swap device reference in alloc/free mm/swap, PM: hibernate: fix swapoff race in uswsusp by pinning swap device mm/filemap: use folio_next_index() for start vmalloc: fix NULL pointer dereference in is_vm_area_hugepages() sparc/mm: drop vmemmap_check_pmd helper and use generic code loongarch/mm: drop vmemmap_check_pmd helper and use generic code riscv/mm: drop vmemmap_pmd helpers and use generic code arm64/mm: drop vmemmap_pmd helpers and use generic code mm/sparse-vmemmap: provide generic vmemmap_set_pmd() and vmemmap_check_pmd() rust: page: mark Page::nid as inline userfaultfd: build __VMA_UFFD_FLAGS from config-gated masks userfaultfd: gate must_wait writability check on pte_present() mm/huge_memory: preserve pmd_swp_uffd_wp on device-private PMD downgrade fs/proc/task_mmu: fix hugetlb self-deadlock in pagemap_scan_pte_hole() fs/proc/task_mmu: use huge_page_size() in pagemap_scan_hugetlb_entry() fs/proc/task_mmu: fix make_uffd_wp_huge_pte() prot-update race ...
7 daysMerge tag 'timers-nohz-2026-06-13' of ↵Linus Torvalds2-42/+6
gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip Pull NOHZ updates from Thomas Gleixner: - Fix a long standing TOCTOU in get_cpu_sleep_time_us() - Make the CPU offline NOHZ handling more robust by disabling NOHZ on the outgoing CPU early instead of creating unneeded state which needs to be undone. - Unify idle CPU time accounting instead of having two different accounting mechanisms. These two different mechanisms are not really independent, but the different properties can in the worst case cause that gloabl idle time can be observed going backwards. - Consolidate the idle/iowait time retrieval interfaces instead of converting back and forth between them. - Make idle interrupt time accounting more robust. The original code assumes that interrupt time accouting is enabled and therefore stops elapsing idle time while an interrupt is handled in NOHZ dyntick state. That assumption is not correct as interrupt time accounting can be disabled at compile and runtime. - Fix an accounting error between dyntick idle time and dyntick idle steal time. The stolen time is not accounted and therefore idle time becomes inaccurate. The stolen time is now accounted after the fact as there is no way to predict the steal time upfront. * tag 'timers-nohz-2026-06-13' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: sched/cputime: Handle dyntick-idle steal time correctly sched/cputime: Handle idle irqtime gracefully sched/cputime: Provide get_cpu_[idle|iowait]_time_us() off-case tick/sched: Consolidate idle time fetching APIs tick/sched: Account tickless idle cputime only when tick is stopped tick/sched: Remove unused fields tick/sched: Move dyntick-idle cputime accounting to cputime code tick/sched: Remove nohz disabled special case in cputime fetch tick/sched: Unify idle cputime accounting s390/time: Prepare to stop elapsing in dynticks-idle powerpc/time: Prepare to stop elapsing in dynticks-idle sched/cputime: Correctly support generic vtime idle time sched/cputime: Remove superfluous and error prone kcpustat_field() parameter sched/idle: Handle offlining first in idle loop tick/sched: Fix TOCTOU in nohz idle time fetch
7 daysMerge tag 'irq-core-2026-06-13' of ↵Linus Torvalds2-5/+3
gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip Pull interrupt core updates from Thomas Gleixner: - Rework of /proc/interrupt handling: /proc/interrupts was subject to micro optimizations for a long time, but most of the low hanging fruit was left on the table. This rework addresses the major time consuming issues: - Printing a long series of zeros one by one via a format string instead of counting subsequent zeros and emitting a string constant. - Simplify and cache the conditions whether interrupts should be printed - Use a proper iteration over the interrupt descriptor xarray instead of walking and testing one by one. - Provide helper functions for the architecture code to emit the architecture specific counters - Convert the counter structure in x86 to an array, which simplifies the output and add mechanisms to suppress unused architecture interrupts, which just occupy space for nothing. Adopt the new core mechanisms. This adjusts the gdb scripts related to interrupt counter statistics to work with the new mechanisms. - Prevent a string overflow in the /proc/irq/$N/ directory name creation code. * tag 'irq-core-2026-06-13' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: x86/irq: Add missing 's' back to thermal event printout genirq/proc: Speed up /proc/interrupts iteration genirq/proc: Runtime size the chip name genirq: Expose irq_find_desc_at_or_after() in core code genirq: Add rcuref count to struct irq_desc genirq/proc: Increase default interrupt number precision to four genirq: Calculate precision only when required genirq: Cache the condition for /proc/interrupts exposure genirq/manage: Make NMI cleanup RT safe genirq: Expose nr_irqs in core code scripts/gdb: Update x86 interrupts to the array based storage x86/irq: Move IOAPIC misrouted and PIC/APIC error counts into irq_stats x86/irq: Suppress unlikely interrupt stats by default x86/irq: Make irqstats array based genirq/proc: Utilize irq_desc::tot_count to avoid evaluation genirq/proc: Avoid formatting zero counts in /proc/interrupts x86/irq: Optimize interrupts decimals printing genirq/proc: Size interrupt directory names for 10-digit interrupt numbers
7 daysMerge tag 'pull-dcache' of ↵Linus Torvalds2-4/+2
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull dcache updates from Al Viro: - d_alloc_parallel() API change (Neil's with my changes) - NORCU fixes - Reorganization and simplification of dentry eviction logic - Simplifying rcu_read_lock() scopes in fs/dcache.c - Secondary roots work - getting rid of NFS fake root dentries and dealing with remaining shrink_dcache_for_umount() and shrink_dentry_list() races - making cursors NORCU (surprisingly easy) * tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (22 commits) make cursors NORCU nfs: get rid of fake root dentries wind ->s_roots via ->d_sib instead of ->d_hash shrink_dentry_tree(): unify the calls of shrink_dentry_list() shrinking rcu_read_lock() scope in d_alloc_parallel() d_walk(): shrink rcu_read_lock() scope document dentry_kill() adjust calling conventions of lock_for_kill(), fold __dentry_kill() into dentry_kill() Document rcu_read_lock() use in select_collect2() Shift rcu_read_{,un}lock() inside fast_dput() simplify safety for lock_for_kill() slowpath fold lock_for_kill() and __dentry_kill() into common helper fold lock_for_kill() into shrink_kill() shrink_dentry_list(): start with removing from shrink list d_prune_aliases(): make sure to skip NORCU aliases kill d_dispose_if_unused() make to_shrink_list() return whether it has moved dentry to list select_collect(): ignore dentries on shrink lists if they have positive refcounts find_acceptable_alias(): skip NORCU aliases with zero refcount fix a race between d_find_any_alias() and final dput() of NORCU dentries ...
7 daysMerge tag 'vfs-7.2-rc1.procfs' of ↵Linus Torvalds8-114/+138
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull procfs updates from Christian Brauner: - Revamp fs/filesystems.c The file was a mess with a hand-rolled linked list in desperate need of a cleanup. The filesystems list is now RCU-ified, /proc files can be marked permanent from outside fs/proc/, and the string emitted when reading /proc/filesystems is pre-generated and cached instead of pointer-chasing and printfing entry by entry on every read. The file is read frequently because libselinux reads it and is linked into numerous frequently used programs (even ones you would not suspect, like sed!). Scalability also improves since reference maintenance on open/close is bypassed. open+read+close cycle single-threaded (ops/s): before: 442732 after: 1063462 (+140%) open+read+close cycle with 20 processes (ops/s): before: 606177 after: 3300576 (+444%) A follow-up patch adds missing unlocks in some corner cases and tidies things up. - Relax the mount visibility check for subset=pid mounts When procfs is mounted with subset=pid, all static files become unavailable and only the dynamic pid information is accessible. In that case there is no point in imposing the full mount visibility restrictions on the mounter - everything that can be hidden in procfs is already inaccessible. These restrictions prevented procfs from being mounted inside rootless containers since almost all container implementations overmount parts of procfs to hide certain directories. As part of this /proc/self/net is only shown in subset=pid mounts for CAP_NET_ADMIN, reconfiguring subset=pid is rejected, the SB_I_USERNS_VISIBLE superblock flag is replaced with an FS_USERNS_MOUNT_RESTRICTED filesystem flag, fully visible mounts are recorded in a list, and the mount restrictions are finally documented. - Protect ptrace_may_access() with exec_update_lock in procfs Most uses of ptrace_may_access() in procfs should hold exec_update_lock to avoid TOCTOU issues with concurrent privileged execve() (like setuid binary execution). This fixes the easy cases - the owner and visibility checks and the FD link permission checks - with the gnarlier ones to follow later. * tag 'vfs-7.2-rc1.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: fix ups and tidy ups to /proc/filesystems caching proc: protect ptrace_may_access() with exec_update_lock (FD links) proc: protect ptrace_may_access() with exec_update_lock (part 1) docs: proc: add documentation about mount restrictions proc: handle subset=pid separately in userns visibility checks proc: prevent reconfiguring subset=pid proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN fs: cache the string generated by reading /proc/filesystems sysfs: remove trivial sysfs_get_tree() wrapper fs: RCU-ify filesystems list fs: move SB_I_USERNS_VISIBLE to FS_USERNS_MOUNT_RESTRICTED proc: allow to mark /proc files permanent outside of fs/proc/ namespace: record fully visible mounts in list
7 daysMerge tag 'vfs-7.2-rc1.misc' of ↵Linus Torvalds1-8/+8
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Reduce pipe->mutex contention by pre-allocating pages outside the lock in anon_pipe_write(). anon_pipe_write() called alloc_page() once per page while holding pipe->mutex. The allocation can sleep doing direct reclaim and runs memcg charging, which extends the critical section and stalls any concurrent reader on the same mutex. Now up to 8 pages are pre-allocated before the mutex is taken, leftovers are recycled into the per-pipe tmp_page[] cache before unlock, and any remainder is released after unlock, keeping the allocator out of the critical section on both sides. On a writers x readers sweep with 64KB writes against a 1 MB pipe throughput improves 6-28% and average write latency drops 5-22%; under memory pressure - when the cost of holding the mutex across reclaim is highest - throughput improves 21-48% and latency drops 17-33%. The microbenchmark is added to selftests. - uaccess/sockptr: fix the ignored_trailing logic in copy_struct_to_user() to behave as documented and the usize check in copy_struct_from_sockptr() for user pointers, and add copy_struct_{from,to}_bounce_buffer() and copy_struct_to_sockptr() helpers for upcoming users (IPPROTO_SMBDIRECT, IPPROTO_QUIC). - bpf: add a sleepable bpf_real_inode() kfunc that resolves the real inode backing a dentry via d_real_inode(). On overlayfs the inode attached to the dentry doesn't carry the underlying device information; this is used by the filesystem restriction BPF program that was merged into systemd. - docs: add guidelines for submitting new filesystems, motivated by the maintenance burden abandoned and untestable filesystems impose on VFS developers, blocking infrastructure work like folio conversions and iomap migration. Fixes: - libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo() and drop the now-redundant assignments in callers. This began as a one-line dma-buf fix for a path_noexec() warning; a pseudo filesystem has no reason not to set SB_I_NOEXEC. All init_pseudo() callers were audited: the only visible effect is on dma-buf where SB_I_NOEXEC silences the warning. - Handle set_blocksize() failures in legacy filesystems (bfs, hpfs, qnx4, jfs, befs, affs, isofs, minix, ntfs3, omfs). Mounting a device with a sector size > PAGE_SIZE crashed roughly half of them; the rest had the same missing error handling pattern. Plus a follow-up releasing the superblock buffer_head when setting the minix v3 block size fails. - mount: honour SB_NOUSER in the new mount API. - fs/fcntl: fix a SOFTIRQ-unsafe lock order in fasync signaling by switching the process-group paths of send_sigio() and send_sigurg() from read_lock(&tasklist_lock) to RCU, matching the single-PID path. - vfs: add an FS_USERNS_DELEGATABLE flag and set it for NFS, fixing delegated NFS mounts (fsopen() in a container with the mount performed by a privileged daemon) that broke when non-init s_user_ns was tied to FS_USERNS_MOUNT. - selftests/namespaces: fix a hang in nsid_test where an unreaped grandchild kept the TAP pipe write-end open, a waitpid(-1) race in listns_efault_test, and a false FAIL on kernels without listns() where the tests should SKIP. - filelock: fix the break_lease() stub signature for CONFIG_FILE_LOCKING=n. - init/initramfs_test: wait for the async initramfs unpacking before running; the test and do_populate_rootfs() share the parser state. - fs/coredump: reduce redundant log noise in validate_coredump_safety(). - iomap: pass the correct length to fserror_report_io() in __iomap_write_begin(). - backing-file: fix the backing_file_open() kerneldoc. Cleanups: - initramfs: refactor the cpio hex header parsing to use hex2bin() instead of the hand-rolled simple_strntoul() which is reverted, and extend the initramfs KUnit tests to cover header fields with 0x prefixes. - Replace __get_free_pages() and friends with kmalloc()/kzalloc() across quota, proc, ocfs2/dlm, nilfs2, nfs, nfsd, libfs, jfs, jbd2, isofs, fuse, select, namespace, configfs, binfmt_misc, bfs, and the do_mounts init code - part of the larger work of replacing page allocator calls with kmalloc(). - Use clear_and_wake_up_bit() in unlock_buffer() and journal_end_buffer_io_sync() instead of open-coding the sequence. - Drop unused VFS exports: unexport drop_super_exclusive(), remove start_removing_user_path_at(), and fold __start_removing_path() into start_removing_path(). - fs/read_write: narrow the __kernel_write() export with EXPORT_SYMBOL_FOR_MODULES(). - vfs: uapi: retire octal and hex constants in favor of (1 << n) for the O_ flags. Finding a free bit for a new flag across the architectures was needlessly hard with the mixed bases. - dcache: add extra sanity checks of dead dentries in dentry_free() via a new DENTRY_WARN_ONCE() that also prints d_flags. - iov_iter: use kmemdup_array() in dup_iter() to harden the allocation against multiplication overflow. - fs/pipe: write to ->poll_usage only once. - vfs: remove an always-taken if-branch in find_next_fd(). - dcache: use kmalloc_flex() for struct external_name in __d_alloc(). - namei: use QSTR() instead of QSTR_INIT() in path_pts(). - sync_file_range: delete dead S_ISLNK code. - Comment fixes: retire a stale comment in fget_task_next() and fix assorted spelling mistakes" * tag 'vfs-7.2-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (73 commits) backing-file: fix backing_file_open() kerneldoc parameter iomap: pass the correct len to fserror_report_io in __iomap_write_begin vfs: add FS_USERNS_DELEGATABLE flag and set it for NFS filelock: fix break_lease() stub signature for CONFIG_FILE_LOCKING=n vfs: uapi: retire octal and hex numbers in favor of (1 << n) for O_ flags bpf: add bpf_real_inode() kfunc fs/read_write: Do not export __kernel_write() to the entire world libfs: drop redundant SB_I_NOEXEC/SB_I_NODEV in init_pseudo() callers libfs: set SB_I_NOEXEC and SB_I_NODEV by default in init_pseudo() mount: honour SB_NOUSER in the new mount API fs/fcntl: fix SOFTIRQ-unsafe lock order in fasync signaling selftests/pipe: add pipe_bench microbenchmark fs/pipe: pre-allocate pages outside pipe->mutex in anon_pipe_write fs: retire stale comment in fget_task_next() fs: fix spelling mistakes in comment bfs: replace get_zeroed_page() with kzalloc() binfmt_misc: replace __get_free_page() with kmalloc() configfs: replace __get_free_pages() with kzalloc() fs/namespace: use __getname() to allocate mntpath buffer fs/select: replace __get_free_page() with kmalloc() ...
13 daysfs/proc/task_mmu: do not warn on seeing non-migration pmd entryDev Jain1-1/+0
Patch series "mm/hmm: A fix and a selftest", v3. Patch 1 fixes a stale warning present from the time when only migration softleaf entries were supported at the PMD level. Patch 2 adds some code into hmm-tests.c which exercises the pagemap path for PMD device-private entries. This patch (of 2): pagemap_pmd_range_thp() warns if a non-present PMD is not a migration entry. This became false once device-private entries at the PMD level were added. Therefore, remove the stale migration-only assertion. Link: https://lore.kernel.org/20260604055308.1947679-1-dev.jain@arm.com Link: https://lore.kernel.org/20260604055308.1947679-2-dev.jain@arm.com Fixes: a30b48bf1b24 ("mm/migrate_device: implement THP migration of zone device pages") Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Balbir Singh <balbirs@nvidia.com> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Tested-by: Lorenzo Stoakes <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Oscar Salvador (SUSE) <osalvador@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
13 daysfs/proc/task_mmu: fix hugetlb self-deadlock in pagemap_scan_pte_hole()Kiryl Shutsemau (Meta)1-1/+58
A PAGEMAP_SCAN ioctl requesting PM_SCAN_WP_MATCHING on a hugetlb VMA hangs the calling thread, unkillably, as soon as the scan reaches an unpopulated part of the range: do_pagemap_scan() walk_page_range() walk_hugetlb_range() hugetlb_vma_lock_read() # take the vma lock for read ... pagemap_scan_pte_hole() # ... ->pte_hole() for a hole uffd_wp_range() change_protection() hugetlb_change_protection() hugetlb_vma_lock_write() # ... and block taking it for write walk_hugetlb_range() holds the hugetlb vma lock for read across the whole walk. A present entry goes to ->hugetlb_entry(); an unpopulated one goes to ->pte_hole(), i.e. pagemap_scan_pte_hole(). To write-protect the hole that handler calls uffd_wp_range(), which on a hugetlb VMA reaches hugetlb_change_protection() and takes the same vma lock for write. The thread then blocks in down_write() waiting for the read lock it is itself holding. The populated path avoids this: pagemap_scan_hugetlb_entry() write-protects the entry inline under the page-table lock and never enters hugetlb_change_protection(). Do the same for holes. Fault in the page table and install the uffd-wp marker directly with make_uffd_wp_huge_pte() under the page-table lock, rather than routing through uffd_wp_range(). That is the same sequence hugetlb_change_protection() runs for an unpopulated entry, minus the vma write lock -- which is safe to skip because PMD sharing is disabled on uffd-wp VMAs (hugetlb_unshare_all_pmds() runs at registration), leaving nothing for that lock to serialise against. Link: https://lore.kernel.org/20260529172331.356655-4-kas@kernel.org Fixes: 52526ca7fdb9 ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs") Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reported-by: Sashiko AI review <sashiko-bot@kernel.org> Assisted-by: Claude:claude-opus-4-8 Cc: David Hildenbrand <david@kernel.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Balbir Singh <balbirs@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
13 daysfs/proc/task_mmu: use huge_page_size() in pagemap_scan_hugetlb_entry()Kiryl Shutsemau (Meta)1-1/+1
The partial-page check compares against HPAGE_SIZE (PMD_SIZE), which is wrong for gigantic hugetlb hstates (e.g. 1G). The walker hands the callback a huge_page_size()-sized range, never start + HPAGE_SIZE, so the comparison always declares it partial and aborts the WP. Compare against the actual hstate's page size. Link: https://lore.kernel.org/20260529172331.356655-3-kas@kernel.org Fixes: 52526ca7fdb9 ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs") Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reported-by: Sashiko AI review <sashiko-bot@kernel.org> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: David Hildenbrand <david@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Balbir Singh <balbirs@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
13 daysfs/proc/task_mmu: fix make_uffd_wp_huge_pte() prot-update raceKiryl Shutsemau (Meta)1-4/+8
Patch series "userfaultfd/pagemap: pre-existing fixes". These are pre-existing bug fixes that were carried at the front of the userfaultfd RWP working-set-tracking series up to v5 [1]. Per review feedback that fixes should not sit in the middle of a feature series, they are split out and sent on their own; the RWP series is reposted rebased on top of this. All six were flagged by the Sashiko AI review of the RWP series and carry Reported-by: Sashiko AI review <sashiko-bot@kernel.org>. They are independent of RWP, apply to mm-new directly, and carry Cc: stable@. 1: fs/proc/task_mmu: a missing huge_ptep_modify_prot_start() in make_uffd_wp_huge_pte() can lose hardware Dirty/Accessed updates when PAGEMAP_SCAN write-protects a hugetlb PTE. 2: fs/proc/task_mmu: pagemap_scan_hugetlb_entry() compares the range against HPAGE_SIZE rather than the hstate page size, so it never write-protects gigantic hugetlb pages. 3: fs/proc/task_mmu: PAGEMAP_SCAN with PM_SCAN_WP_MATCHING over an unpopulated hugetlb range self-deadlocks -- pagemap_scan_pte_hole() calls uffd_wp_range() while walk_hugetlb_range() holds the hugetlb vma lock for read, and hugetlb_change_protection() then takes it for write. Install the marker inline instead. 4: mm/huge_memory: change_non_present_huge_pmd() drops pmd_swp_uffd_wp on a device-private PMD permission downgrade, silently losing the uffd-wp marker. 5: userfaultfd: must_wait() applies pte_write() to a locklessly read PTE without checking pte_present(), so swap/migration entries decode random offset bits and a thread can stay parked on a stale fault. 6: userfaultfd: __VMA_UFFD_FLAGS feeds VMA_UFFD_MINOR_BIT (41) to mk_vma_flags() unconditionally, an out-of-bounds write into the single-word vma_flags_t on 32-bit. Build the mask from config-gated per-mode masks so an unavailable bit is never materialised. This patch (of 6): make_uffd_wp_huge_pte() arms the UFFD_WP bit on a present HugeTLB PTE by calling huge_ptep_modify_prot_commit() with a ptent snapshot that was fetched without the corresponding huge_ptep_modify_prot_start(). The start helper is what atomically clears the entry so the kernel-owned snapshot stays consistent until the commit; without it, the hardware may set Dirty or Accessed in the live PTE between the original read and the commit, and huge_ptep_modify_prot_commit() (whose generic implementation just calls set_huge_pte_at()) then writes the stale snapshot back over the live hardware bits, losing the update. The non-hugetlb sibling make_uffd_wp_pte() does this correctly via ptep_modify_prot_start() / ptep_modify_prot_commit(). Mirror that pattern for the present-PTE branch. The migration case stays as-is -- migration entries are non-present, so there's no hardware update to race against. Link: https://lore.kernel.org/20260529172331.356655-1-kas@kernel.org Link: https://lore.kernel.org/20260529172331.356655-2-kas@kernel.org Link: https://lore.kernel.org/all/20260526130509.2748441-1-kirill@shutemov.name/ [1] Fixes: 52526ca7fdb9 ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs") Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reported-by: Sashiko AI review <sashiko-bot@kernel.org> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: David Hildenbrand <david@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Balbir Singh <balbirs@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-05proc: protect ptrace_may_access() with exec_update_lock (FD links)Jann Horn3-89/+59
proc_pid_get_link() and proc_pid_readlink() currently look up the task from the pid once, then do the ptrace access check on that task, then look up the task from the pid a second time to do the actual access. That's racy in several ways. To fix it, pass the task to the ->proc_get_link() handler, and instead of proc_fd_access_allowed(), introduce a new helper call_proc_get_link() that looks up and locks the task, does the access check, and calls ->proc_get_link(). Fixes: 778c1144771f ("[PATCH] proc: Use sane permission checks on the /proc/<pid>/fd/ symlinks") Cc: stable@vger.kernel.org Signed-off-by: Jann Horn <jannh@google.com> Link: https://patch.msgid.link/20260518-procfs-lockfix-part1-v1-2-5c3d20e0ac33@google.com Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-06-05proc: protect ptrace_may_access() with exec_update_lock (part 1)Jann Horn3-19/+40
Fix the easy cases where procfs currently calls ptrace_may_access() without exec_update_lock protection, where the fix is to simply add the extra lock or use mm_access(): - do_task_stat(): grab exec_update_lock - proc_pid_wchan(): grab exec_update_lock - proc_map_files_lookup(): use mm_access() instead of get_task_mm() - proc_map_files_readdir(): use mm_access() instead of get_task_mm() - proc_ns_get_link(): grab exec_update_lock - proc_ns_readlink(): grab exec_update_lock Fixes: f83ce3e6b02d ("proc: avoid information leaks to non-privileged processes") Cc: stable@vger.kernel.org Signed-off-by: Jann Horn <jannh@google.com> Link: https://patch.msgid.link/20260518-procfs-lockfix-part1-v1-1-5c3d20e0ac33@google.com Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-06-05VFS: use wait_var_event for waiting in d_alloc_parallel()NeilBrown2-4/+2
Parallel lookup starts with a call of d_alloc_parallel(). That primitive either returns a matching hashed dentry or allocates a new one in the in-lookup state and returns it to the caller. Once the caller is done with lookup, it indicates so either by call of d_{splice_alias,add}() or by call of d_done_lookup(); at that point dentry leaves the in-lookup state. If d_alloc_parallel() finds a matching in-lookup dentry, it must wait for that dentry to leave the in-lookup state, one way or another. Currently by supplying wait_queue_head to d_alloc_parallel(). If d_alloc_parallel() creates a new in-lookup dentry, the address of that wait_queue_head is stored in ->d_wait of new dentry and stays there while it's in the in-lookup; subsequent d_alloc_parallel() will wait on the queue found in the matching in-lookup dentry. Transition out of in-lookup state wakes waiters on that queue (if any). That works, but the calling conventions are inconvenient - the caller must supply wait_queue_head and make sure that it survives at least until the new in-lookup dentry leaves the in-lookup state. That amounts to boilerplate in the d_alloc_parallel() callers that are followed by a call of d_lookup_done() in the same function; in cases like nfs asynchronous unlink it gets worse than that. This patch changes d_alloc_parallel() to use wake_up_var_locked() to wake up waiters, and wait_var_event_spinlock() to wait. dentry->d_lock is used for synchronisation as it is already held and the relevant times. That eliminates the need of caller-supplied wait_queue_head, simplifying the calling conventions. Better yet, we only need one bit of information stored in dentry itself: whether there are any waiters to be woken up, and that can be easily stored in ->d_flags; ->d_wait goes away. The reason we need that bit (DCACHE_LOOKUP_WAITERS) is that with wait_var machinery the queues are shared with all kinds of stuff and there's no way tell if any of the waiters have anything to do with our dentry; most of the time none of them will be relevant, so we need to avoid the pointless wakeups. Another benefit of the new scheme comes from the fact that wakeups have to be done outside of write-side critical areas of ->i_dir_seq; with the old scheme we need to carry the value picked from ->d_wait from __d_lookup_unhash() to the place where we actually wake the waiters up. Now we can just leave DCACHE_LOOKUP_WAITERS in ->d_flags until we get to doing wakeups - that's done within the same ->d_lock scope, so we are fine; new bit is accessed only under ->d_lock and it's seen only on dentries with DCACHE_PAR_LOOKUP in ->d_flags. __d_lookup_unhash() no longer needs to re-init ->d_lru. That was previously shared (in a union) with ->d_wait but ->d_wait is now gone so it no longer corrupts ->d_lru. Co-developed-by: Al Viro <viro@zeniv.linux.org.uk> # saner handling of flags Signed-off-by: NeilBrown <neil@brown.name> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2026-06-04fs/proc/task_mmu: read proc/pid/{smaps|numa_maps} under per-vma lockSuren Baghdasaryan1-39/+156
Patch series "use vma locks for proc/pid/{smaps|numa_maps} reads", v2. Use per-vma locks when reading /proc/pid/smaps and /proc/pid/numa_maps similar to /proc/pid/maps to reduce contention on central mmap_lock. One major difference between maps and smaps/numa_maps reading is that the latter executes page table walk which can't be done under RCU due to a possibility of sleeping. Therefore we drop RCU read lock before this walk while keeping the VMA locked. After the walk we retake RCU read lock, reset VMA iterator and proceed with the next VMA. The last two patches extend /proc/pid/maps test to cover /proc/pid/smaps reading during concurrent address space modification. This patch (of 3): proc/pid/{smaps|numa_maps} can be read using the combination of RCU and VMA read locks, similar to proc/pid/maps. RCU is required to safely traverse the VMA tree and VMA lock stabilizes the VMA being processed and the pagetable walk. Link: https://lore.kernel.org/20260426062718.1238437-1-surenb@google.com Link: https://lore.kernel.org/20260426062718.1238437-2-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Liam R. Howlett <liam@infradead.org> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Shuah Khan <shuah@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-06-02tick/sched: Consolidate idle time fetching APIsFrederic Weisbecker2-42/+6
Fetching the idle cputime is available through a variety of accessors all over the place depending on the different accounting flavours and needs: - idle vtime generic accounting can be accessed by kcpustat_field(), kcpustat_cpu_fetch(), get_idle/iowait_time() and get_cpu_idle/iowait_time_us() - dynticks-idle accounting can only be accessed by get_idle/iowait_time() or get_cpu_idle/iowait_time_us() - CONFIG_NO_HZ_COMMON=n idle accounting can be accessed by kcpustat_field() kcpustat_cpu_fetch(), or get_idle/iowait_time() but not by get_cpu_idle/iowait_time_us() Moreover get_idle/iowait_time() relies on get_cpu_idle/iowait_time_us() with a non-sensical conversion to microseconds and back to nanoseconds on the way. Start consolidating the APIs with removing get_idle/iowait_time() and make kcpustat_field() and kcpustat_cpu_fetch() work for all cases. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Link: https://patch.msgid.link/20260508131647.43868-13-frederic@kernel.org
2026-05-28proc: use strnlen() for name validation in __proc_createThorsten Blum1-3/+7
Replace strlen(fn) with strnlen(fn, NAME_MAX + 1) when validating the final path component in __proc_create(). This preserves the existing name limit while bounding the length scan to one byte past the maximum name length. Handle empty names separately, and treat names longer than NAME_MAX as too long. Link: https://lore.kernel.org/20260421122648.56723-2-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Thorsten Blum <thorsten.blum@linux.dev> Cc: wangzijie <wangzijie1@honor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28proc: rewrite next_tgid()Alexey Dobriyan1-24/+32
* deduplicate "iter.tgid += 1" line, Right now it is done once inside next_tgid() itself and second time inside "for" loop. * deduplicate next_tgid() call itself with different loop style: auto it = make_tgid_iter(); while (next_tgid(&it)) { ... } gcc seems to inline it twice: $ ./scripts/bloat-o-meter ../vmlinux-000 ../obj/vmlinux add/remove: 0/1 grow/shrink: 1/0 up/down: 100/-245 (-145) Function old new delta proc_pid_readdir 531 631 +100 next_tgid 245 - -245 * make tgid_iter.pid_ns const it never changes during readdir anyway [akpm@linux-foundation.org: remove newline] Link: https://lore.kernel.org/20260422191745.435556-2-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-28proc: add tgid_iter.pid_ns memberAlexey Dobriyan1-6/+9
next_tgid() accepts pid namespace as an argument, but it never changes during readdir (which would be unthinkable thing to do anyway). Move it inside iterator type and hide from direct usage. Link: https://lore.kernel.org/20260422191745.435556-1-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-27proc: replace __get_free_page() with kmalloc()Mike Rapoport (Microsoft)1-8/+8
A few functions in fs/proc/base.c use __get_free_page() to allocate a temporary buffer. kmalloc() is a better API for such use and it also provides better scalability and more debugging possibilities. Replace use of __get_free_page() with kmalloc(). Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Link: https://patch.msgid.link/20260523-b4-fs-v1-2-275e36a83f0e@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
2026-05-26genirq/proc: Speed up /proc/interrupts iterationThomas Gleixner1-1/+3
Reading /proc/interrupts iterates over the interrupt number space one by one and looks up the descriptors one by one. That's just a waste of time. When CONFIG_GENERIC_IRQ_SHOW is enabled this can utilize the maple tree and cache the descriptor pointer efficiently for the sequence file operations. Implement a CONFIG_GENERIC_IRQ_SHOW specific version in the core code and leave the fs/proc/ variant for the legacy architectures which ignore generic code. This reduces the time wasted for looking up the next record significantly. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Dmitry Ilvokhin <d@ilvokhin.com> Link: https://patch.msgid.link/20260517194932.165280601@kernel.org
2026-05-26x86/irq: Move IOAPIC misrouted and PIC/APIC error counts into irq_statsThomas Gleixner1-4/+0
The special treatment of these counts is just adding extra code for no real value. The irq_stats mechanism allows to suppress output of counters, which should never happen by default and provides a mechanism to enable them for the rare case that they occur. Move the IOAPIC misrouted and the PIC/APIC error counts into irq_stats, mark them suppressed by default and update the sites which increment them. This changes the output format of 'ERR' and 'MIS' in case there are events to the regular per CPU display format and otherwise suppresses them completely. As a side effect this removes the arch_cpu_stat() mechanism from proc/stat which was only there to account for the error interrupts on x86 and missed to take the misrouted ones into account. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Radu Rendec <radu@rendec.net> Link: https://patch.msgid.link/20260517194931.361942103@kernel.org
2026-05-26exec_state: relocate dumpable informationChristian Brauner (Amutable)1-23/+16
The dumpable flag captured at execve() is consulted by __ptrace_may_access() and several /proc owner / visibility checks. It lives on mm_struct today, which exit_mm() clears from the task long before the task itself is reaped. exec_state is anchored to the execve() that established the current privilege domain. CLONE_VM siblings refcount-share the parent's exec_state via copy_exec_state(); non-CLONE_VM clones allocate a fresh exec_state inheriting the parent's dumpable mode and user_ns reference via task_exec_state_copy(). execve() allocates a fresh instance (via alloc_task_exec_state() in begin_new_exec()) and installs it under task_lock + exec_update_lock with task_exec_state_replace(). init_task uses a static instance. The dumpable mode now lives on task->exec_state->dumpable. task->mm->flags no longer carries dumpability; MMF_DUMPABLE_MASK is removed, but MMF_DUMPABLE_BITS is reserved so MMF_DUMP_FILTER_* bit positions remain stable for the /proc/<pid>/coredump_filter ABI. The task->user_dumpable cache bit and its assignment in exit_mm() are removed; readers go through get_dumpable