| Age | Commit message (Collapse) | Author | Files | Lines |
|
When affinity changes cause an increase of the number of CPUs allowed for
tasks which are related to a MM, that might results in a situation where
the ownership mode can go back from per CPU mode to per task mode.
As affinity changes happen with runqueue lock held there is no way to do
the actual mode change and required fixup right there.
Add the infrastructure to defer it to a workqueue. The scheduled work can
race with a fork() or exit(). Whatever happens first takes care of it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251119172550.216484739@linutronix.de
|
|
... to avoid header recursion hell.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251119172550.152813625@linutronix.de
|
|
CIDs are either owned by tasks or by CPUs. The ownership mode depends on
the number of tasks related to a MM and the number of CPUs on which these
tasks are theoretically allowed to run on. Theoretically because that
number is the superset of CPU affinities of all tasks which only grows and
never shrinks.
Switching to per CPU mode happens when the user count becomes greater than
the maximum number of CIDs, which is calculated by:
opt_cids = min(mm_cid::nr_cpus_allowed, mm_cid::users);
max_cids = min(1.25 * opt_cids, nr_cpu_ids);
The +25% allowance is useful for tight CPU masks in scenarios where only a
few threads are created and destroyed to avoid frequent mode
switches. Though this allowance shrinks, the closer opt_cids becomes to
nr_cpu_ids, which is the (unfortunate) hard ABI limit.
At the point of switching to per CPU mode the new user is not yet visible
in the system, so the task which initiated the fork() runs the fixup
function: mm_cid_fixup_tasks_to_cpu() walks the thread list and either
transfers each tasks owned CID to the CPU the task runs on or drops it into
the CID pool if a task is not on a CPU at that point in time. Tasks which
schedule in before the task walk reaches them do the handover in
mm_cid_schedin(). When mm_cid_fixup_tasks_to_cpus() completes it's
guaranteed that no task related to that MM owns a CID anymore.
Switching back to task mode happens when the user count goes below the
threshold which was recorded on the per CPU mode switch:
pcpu_thrs = min(opt_cids - (opt_cids / 4), nr_cpu_ids / 2);
This threshold is updated when a affinity change increases the number of
allowed CPUs for the MM, which might cause a switch back to per task mode.
If the switch back was initiated by a exiting task, then that task runs the
fixup function. If it was initiated by a affinity change, then it's run
either in the deferred update function in context of a workqueue or by a
task which forks a new one or by a task which exits. Whatever happens
first. mm_cid_fixup_cpus_to_task() walks through the possible CPUs and
either transfers the CPU owned CIDs to a related task which runs on the CPU
or drops it into the pool. Tasks which schedule in on a CPU which the walk
did not cover yet do the handover themselves.
This transition from CPU to per task ownership happens in two phases:
1) mm:mm_cid.transit contains MM_CID_TRANSIT. This is OR'ed on the task
CID and denotes that the CID is only temporarily owned by the
task. When it schedules out the task drops the CID back into the
pool if this bit is set.
2) The initiating context walks the per CPU space and after completion
clears mm:mm_cid.transit. After that point the CIDs are strictly
task owned again.
This two phase transition is required to prevent CID space exhaustion
during the transition as a direct transfer of ownership would fail if
two tasks are scheduled in on the same CPU before the fixup freed per
CPU CIDs.
When mm_cid_fixup_cpus_to_tasks() completes it's guaranteed that no CID
related to that MM is owned by a CPU anymore.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251119172550.088189028@linutronix.de
|
|
The MM CID management has two fundamental requirements:
1) It has to guarantee that at no given point in time the same CID is
used by concurrent tasks in userspace.
2) The CID space must not exceed the number of possible CPUs in a
system. While most allocators (glibc, tcmalloc, jemalloc) do not
care about that, there seems to be at least some LTTng library
depending on it.
The CID space compaction itself is not a functional correctness
requirement, it is only a useful optimization mechanism to reduce the
memory foot print in unused user space pools.
The optimal CID space is:
min(nr_tasks, nr_cpus_allowed);
Where @nr_tasks is the number of actual user space threads associated to
the mm and @nr_cpus_allowed is the superset of all task affinities. It is
growth only as it would be insane to take a racy snapshot of all task
affinities when the affinity of one task changes just do redo it 2
milliseconds later when the next task changes it's affinity.
That means that as long as the number of tasks is lower or equal than the
number of CPUs allowed, each task owns a CID. If the number of tasks
exceeds the number of CPUs allowed it switches to per CPU mode, where the
CPUs own the CIDs and the tasks borrow them as long as they are scheduled
in.
For transition periods CIDs can go beyond the optimal space as long as they
don't go beyond the number of possible CPUs.
The current upstream implementation adds overhead into task migration to
keep the CID with the task. It also has to do the CID space consolidation
work from a task work in the exit to user space path. As that work is
assigned to a random task related to a MM this can inflict unwanted exit
latencies.
Implement the context switch parts of a strict ownership mechanism to
address this.
This removes most of the work from the task which schedules out. Only
during transitioning from per CPU to per task ownership it is required to
drop the CID when leaving the CPU to prevent CID space exhaustion. Other
than that scheduling out is just a single check and branch.
The task which schedules in has to check whether:
1) The ownership mode changed
2) The CID is within the optimal CID space
In stable situations this results in zero work. The only short disruption
is when ownership mode changes or when the associated CID is not in the
optimal CID space. The latter only happens when tasks exit and therefore
the optimal CID space shrinks.
That mechanism is strictly optimized for the common case where no change
happens. The only case where it actually causes a temporary one time spike
is on mode changes when and only when a lot of tasks related to a MM
schedule exactly at the same time and have eventually to compete on
allocating a CID from the bitmap.
In the sysbench test case which triggered the spinlock contention in the
initial CID code, __schedule() drops significantly in perf top on a 128
Core (256 threads) machine when running sysbench with 255 threads, which
fits into the task mode limit of 256 together with the parent thread:
Upstream rseq/perf branch +CID rework
0.42% 0.37% 0.32% [k] __schedule
Increasing the number of threads to 256, which puts the test process into
per CPU mode looks about the same.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251119172550.023984859@linutronix.de
|
|
The MM CID management has two fundamental requirements:
1) It has to guarantee that at no given point in time the same CID is
used by concurrent tasks in userspace.
2) The CID space must not exceed the number of possible CPUs in a
system. While most allocators (glibc, tcmalloc, jemalloc) do not care
about that, there seems to be at least librseq depending on it.
The CID space compaction itself is not a functional correctness
requirement, it is only a useful optimization mechanism to reduce the
memory foot print in unused user space pools.
The optimal CID space is:
min(nr_tasks, nr_cpus_allowed);
Where @nr_tasks is the number of actual user space threads associated to
the mm and @nr_cpus_allowed is the superset of all task affinities. It is
growth only as it would be insane to take a racy snapshot of all task
affinities when the affinity of one task changes just do redo it 2
milliseconds later when the next task changes its affinity.
That means that as long as the number of tasks is lower or equal than the
number of CPUs allowed, each task owns a CID. If the number of tasks
exceeds the number of CPUs allowed it switches to per CPU mode, where the
CPUs own the CIDs and the tasks borrow them as long as they are scheduled
in.
For transition periods CIDs can go beyond the optimal space as long as they
don't go beyond the number of possible CPUs.
The current upstream implementation adds overhead into task migration to
keep the CID with the task. It also has to do the CID space consolidation
work from a task work in the exit to user space path. As that work is
assigned to a random task related to a MM this can inflict unwanted exit
latencies.
This can be done differently by implementing a strict CID ownership
mechanism. Either the CIDs are owned by the tasks or by the CPUs. The
latter provides less locality when tasks are heavily migrating, but there
is no justification to optimize for overcommit scenarios and thereby
penalizing everyone else.
Provide the basic infrastructure to implement this:
- Change the UNSET marker to BIT(31) from ~0U
- Add the ONCPU marker as BIT(30)
- Add the TRANSIT marker as BIT(29)
That allows to check for ownership trivially and provides a simple check for
UNSET as well. The TRANSIT marker is required to prevent CID space
exhaustion when switching from per CPU to per task mode.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251119172549.960252358@linutronix.de
|
|
Prepare for the new CID management scheme which puts the CID ownership
transition into the fork() and exit() slow path by serializing
sched_mm_cid_fork()/exit() with it, so task list and cpu mask walks can be
done in interruptible and preemptible code.
The contention on it is not worse than on other concurrency controls in the
fork()/exit() machinery.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251119172549.895826703@linutronix.de
|
|
Reading mm::mm_users and mm:::mm_cid::nr_cpus_allowed every time to compute
the maximal CID value is just wasteful as that value is only changing on
fork(), exit() and eventually when the affinity changes.
So it can be easily precomputed at those points and provided in mm::mm_cid
for consumption in the hot path.
But there is an issue with using mm::mm_users for accounting because that
does not necessarily reflect the number of user space tasks as other kernel
code can take temporary references on the MM which skew the picture.
Solve that by adding a users counter to struct mm_mm_cid, which is modified
by fork() and exit() and used for precomputing under mm_mm_cid::lock.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251119172549.832764634@linutronix.de
|
|
It's getting bigger soon, so just move it out of line to the rest of the
code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251119172549.769636491@linutronix.de
|
|
There is no need anymore to keep this under sighand lock as the current
code and the upcoming replacement are not depending on the exit state of a
task anymore.
That allows to use a mutex in the exit path.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251119172549.706439391@linutronix.de
|
|
This is truly a bitmap and just conveniently uses a cpumask because the
maximum size of the bitmap is nr_cpu_ids.
But that prevents to do searches for a zero bit in a limited range, which
is helpful to provide an efficient mechanism to consolidate the CID space
when the number of users decreases.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Yury Norov (NVIDIA) <yury.norov@gmail.com>
Link: https://patch.msgid.link/20251119172549.642866767@linutronix.de
|
|
Reevaluating num_possible_cpus() over and over does not make sense. That
becomes a constant after init as cpu_possible_mask is marked ro_after_init.
Cache the value during initialization and provide that for consumption.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://patch.msgid.link/20251119172549.578653738@linutronix.de
|
|
A CPU system wakeup QoS limit may have been requested by user space. To
avoid breaking this constraint when entering a low power state during
s2idle, let's start to take into account the QoS limit.
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dhruva Gole <d-gole@ti.com>
Reviewed-by: Kevin Hilman (TI) <khilman@baylibre.com>
Tested-by: Kevin Hilman (TI) <khilman@baylibre.com>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://patch.msgid.link/20251125112650.329269-5-ulf.hansson@linaro.org
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
A CPU system wakeup QoS limit may have been requested by user space. To
avoid breaking this constraint when entering a low power state during
s2idle through genpd, let's extend the corresponding genpd governor for
CPUs. More precisely, during s2idle let the genpd governor select a
suitable domain idle state, by taking into account the QoS limit.
Reviewed-by: Dhruva Gole <d-gole@ti.com>
Reviewed-by: Kevin Hilman (TI) <khilman@baylibre.com>
Tested-by: Kevin Hilman (TI) <khilman@baylibre.com>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://patch.msgid.link/20251125112650.329269-3-ulf.hansson@linaro.org
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Some platforms supports multiple low power states for CPUs that can be used
when entering system-wide suspend. Currently we are always selecting the
deepest possible state for the CPUs, which can break the system wakeup
latency constraint that may be required for a use case.
Let's take the first step towards addressing this problem, by introducing
an interface for user space, that allows us to specify the CPU system
wakeup QoS limit. Subsequent changes will start taking into account the new
QoS limit.
Reviewed-by: Dhruva Gole <d-gole@ti.com>
Reviewed-by: Kevin Hilman (TI) <khilman@baylibre.com>
Tested-by: Kevin Hilman (TI) <khilman@baylibre.com>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://patch.msgid.link/20251125112650.329269-2-ulf.hansson@linaro.org
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
This function is used to establish the "private interconnect" between the
VFIO DMABUF exporter and the iommufd DMABUF importer. This is intended to
be a temporary API until the core DMABUF interface is improved to natively
support a private interconnect and revocable negotiation.
This function should only be called by iommufd when trying to map a
DMABUF. For now iommufd will only support VFIO DMABUFs.
The following improvements are needed in the DMABUF API to generically
support more exporters with iommufd/kvm type importers that cannot use the
DMA API:
1) Revoke semantics. VFIO needs to be able to prevent access to the MMIO
during FLR, and so it will use dma_buf_move_notify() to prevent
access. iommmufd does not support fault handling so it cannot
implement the full move_notify. Instead if revoke is negotiated the
exporter promises not to use move_notify() unless the importer can
experiance failures. iommufd will unmap the dmabuf from the iommu page
tables while it is revoked.
2) Private interconnect negotiation. iommufd will only be able to map
a "private interconnect" that provides a phys_addr_t and a
struct p2pdma_provider * to describe the memory. It cannot use a DMA
mapped scatterlist since it is directly calling iommu_map().
3) NULL device during dma_buf_dynamic_attach(). Since iommufd doesn't use
the DMA API it doesn't have a DMAable struct device to pass here.
Link: https://patch.msgid.link/r/1-v2-b2c110338e3f+5c2-iommufd_dmabuf_jgg@nvidia.com
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shuai Xue <xueshuai@linux.alibaba.com>
Acked-by: Alex Williamson <alex@shazbot.org>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
Merges series "mempool_alloc_bulk and various mempool improvements v3"
from Christoph Hellwig.
From the cover letter [1]:
This series adds a bulk version of mempool_alloc that makes allocating
multiple objects deadlock safe.
The initial users is the blk-crypto-fallback code:
https://lore.kernel.org/linux-block/20251031093517.1603379-1-hch@lst.de/
with which v1 was posted, but I also have a few other users in mind.
Link: https://lore.kernel.org/all/20251113084022.1255121-1-hch@lst.de/ [1]
|
|
Merge series "Prepare slab for memdescs" by Matthew Wilcox.
From the cover letter [1]:
When we separate struct folio, struct page and struct slab from each
other, converting to folios then to slabs will be nonsense. It made
sense under the 'folio is just a head page' interpretation, but with
full separation, page_folio() will return NULL for a page which belongs
to a slab.
This patch series removes almost all mentions of folio from slab.
There are a few folio_test_slab() invocations left around the tree that
I haven't decided how to handle yet. We're not yet quite at the point
of separately allocating struct slab, but that's what I'll be working
on next.
Link: https://lore.kernel.org/all/20251113000932.1589073-1-willy@infradead.org/ [1]
|
|
Merge series "slab: preparatory cleanups before adding sheaves to all
caches" [1]
Cleanups that were written as part of the full sheaves conversion, which
is not fully ready yet, but they are useful on their own.
Link: https://lore.kernel.org/all/20251105-sheaves-cleanups-v1-0-b8218e1ac7ef@suse.cz/ [1]
|
|
soc/drivers
Reset controller updates for v6.19
* Add support for LAN969x, eic770 and RZ/G3S reset controllers,
for the RZ/G3S USB-PHY reset controller, and for the remaining
TH1520 reset controllers.
* Drop legacy reset control lookup code.
* Include linux/bits.h from linux/reset.h to make it self-contained.
* tag 'reset-for-v6.19' of https://git.pengutronix.de/git/pza/linux:
Documentation: reset: Remove reset_controller_add_lookup()
reset: fix BIT macro reference
reset: rzg2l-usbphy-ctrl: Fix a NULL vs IS_ERR() bug in probe
reset: th1520: Support reset controllers in more subsystems
reset: th1520: Prepare for supporting multiple controllers
dt-bindings: reset: thead,th1520-reset: Add controllers for more subsys
dt-bindings: reset: thead,th1520-reset: Remove non-VO-subsystem resets
reset: remove legacy reset lookup code
clk: davinci: psc: drop unused reset lookup
reset: rzg2l-usbphy-ctrl: Add support for RZ/G3S SoC
reset: rzg2l-usbphy-ctrl: Add support for USB PWRRDY
dt-bindings: reset: renesas,rzg2l-usbphy-ctrl: Document RZ/G3S support
reset: eswin: Add eic7700 reset driver
dt-bindings: reset: eswin: Documentation for eic7700 SoC
reset: sparx5: add LAN969x support
dt-bindings: reset: microchip: Add LAN969x support
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux into soc/drivers
Qualcomm driver updates for v6.19
Support for hardware-keymanager v1 support for wrapped keys is introduce
in the ICE driver.
Support for the new Kaanapali mobile platform is added to last-level
cache controller, pd-mapper, and UBWC drivers.
UBWC driver gains support for the Monaco and Glymur platforms.
The PMIC GLINK driver is extended to handle the differences found in
targets where the related firmware runs on the SoCCP.
Support for running on targets without initialized SMEM is provided, by
reworking the SMEM driver to differentiate between "not yet probed" and
"probed but there was no SMEM". An unwanted WARN_ON() that triggered if
clients asked for a SMEM item beyond the currently running system's
limit, was removed, to allow new use cases to gracefully fail on old
targets.
The Qualcomm socinfo driver is extended with support for version 20
through 23 and support for providing version information about more than
32 remote processors. Identifiers for QCS6490 and SM8850 are also added.
Additionally, a number of smaller bug fixes and cleanups in PBS, OCMEM,
GSBI, TZMEM, and MDT-loader are included.
* tag 'qcom-drivers-for-6.19' of https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux: (31 commits)
soc: qcom: mdt_loader: rename 'firmware' parameter of qcom_mdt_load()
soc: qcom: mdt_loader: merge __qcom_mdt_load() and qcom_mdt_load_no_init()
soc: qcom: socinfo: Add reserve field to support future extension
soc: qcom: socinfo: Add support for new fields in revision 20
dt-bindings: firmware: qcom,scm: Document SCM on Kaanapali SOC
soc: qcom: socinfo: add support to extract more than 32 image versions
soc: qcom: smem: drop the WARN_ON() on SMEM item validation
soc: qcom: ubwc: Add config for Kaanapali
soc: qcom: socinfo: Add SoC ID for QCS6490
dt-bindings: arm: qcom,ids: Add SoC ID for QCS6490
soc: qcom: ice: Add HWKM v1 support for wrapped keys
soc: qcom: smem: better track SMEM uninitialized state
err.h: add INIT_ERR_PTR() macro
soc: qcom: smem: fix hwspinlock resource leak in probe error paths
dt-bindings: soc: qcom,aoss-qmp: Document the Glymur AOSS side channel
dt-bindings: soc: qcom,aoss-qmp: Document the Kaanapali AOSS channel
soc: qcom: ubwc: Add QCS8300 UBWC cfg
dt-bindings: firmware: qcom,scm: Document Glymur scm
soc: qcom: socinfo: Add SM8850 SoC ID
dt-bindings: arm: qcom,ids: Add SoC ID for SM8850
...
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
Add a missing struct short description and a missing leading " *" to
lp855x.h to avoid kernel-doc warnings:
Warning: include/linux/platform_data/lp855x.h:126 missing initial short
description on line:
* struct lp855x_platform_data
Warning: include/linux/platform_data/lp855x.h:131 bad line:
Only valid when mode is PWM_BASED.
Fixes: 7be865ab8634 ("backlight: new backlight driver for LP855x devices")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Daniel Thompson (RISCstar) <danielt@kernel.org>
Link: https://patch.msgid.link/20251111060916.1995920-1-rdunlap@infradead.org
Signed-off-by: Lee Jones <lee@kernel.org>
|
|
Since 0f67b56d84b4c ("clocksource/drivers/arm_arch_timer_mmio: Switch
over to standalone driver"), acpi_arch_timer_mem_init() is unused.
Remove it.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Acked-by: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
The arguments of strends() must not be NULL so annotate the function
with the nonnull attribute.
Suggested-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/r/20251118-strends-follow-up-v1-2-d3f8ef750f59@linaro.org
Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
|
|
1. inode_bit_waitqueue() was somehow placed between __inode_add_lru() and
inode_add_lru(). move it up
2. assert ->i_lock is held in __inode_add_lru instead of just claiming it is
needed
3. s/__inode_add_lru/__inode_lru_list_add/ for consistency with itself
(inode_lru_list_del()) and similar routines for sb and io list
management
4. push list presence check into inode_lru_list_del(), just like sb and
io list
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251029131428.654761-2-mjguzik@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
In the inode hash code grab the state while ->i_lock is held. If found
to be set, synchronize the sleep once more with the lock held.
In the real world the flag is not set most of the time.
Apart from being simpler to reason about, it comes with a minor speed up
as now clearing the flag does not require the smp_mb() fence.
While here rename wait_on_inode() to wait_on_new_inode() to line it up
with __wait_on_freeing_inode().
Christian Brauner <brauner@kernel.org> says:
As per the discussion in [1] I folded in the diff sent in [2].
Link: https://lore.kernel.org/69238e4d.a70a0220.d98e3.006e.GAE@google.com [1]
Link: https://lore.kernel.org/c2kpawomkbvtahjm7y5mposbhckb7wxthi3iqy5yr22ggpucrm@ufvxwy233qxo [2]
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251010221737.1403539-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
The FILS status codes are set to 108/109, but the IEEE 802.11-2020
spec defines them as 112/113. Update the enum so it matches the
specification and keeps the kernel consistent with standard values.
Fixes: a3caf7440ded ("cfg80211: Add support for FILS shared key authentication offload")
Signed-off-by: Ria Thomas <ria.thomas@morsemicro.com>
Reviewed-by: Jeff Johnson <jeff.johnson@oss.qualcomm.com>
Link: https://patch.msgid.link/20251124125637.3936154-1-ria.thomas@morsemicro.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux into soc/drivers
syscore: Changes for v6.19-rc1
Add a parameter to syscore operations to allow passing contextual data,
which in turn enables refactoring of drivers to make them independent of
global data. This initially only contains the API changes along with the
updates for existing drivers. Subsequent work will make use of this to
improve drivers.
* tag 'tegra-for-6.19-syscore' of git://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux:
syscore: Pass context data to callbacks
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
This was added by commit 099ada2c8726 ("io_uring/rw: add write support
for IOCB_DIO_CALLER_COMP") and disabled a little later by commit
838b35bb6a89 ("io_uring/rw: disable IOCB_DIO_CALLER_COMP") because it
didn't work. Remove all the related code that sat unused for 2 years.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251113170633.1453259-2-hch@lst.de
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux into soc/drivers
Samsung SoC drivers for v6.19
1. ChipID driver: Add support for identifying Exynos8890 and Exynos9610.
2. PMU driver: Allow specifying list of valid registers for the custom
regmap used on Google GS101 SoC. The PMU (Power Management Unit) on
that SoC uses more complex access to registers than simple MMIO and
invalid registers trigger aborts halting the system.
3. Few minor cleanups.
4. Several new bindings for compatible devices.
* tag 'samsung-drivers-6.19' of https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux:
dt-bindings: soc: samsung: exynos-pmu: allow mipi-phy subnode for Exynos7870 PMU
soc: samsung: exynos-chipid: use a local dev variable
dt-bindings: soc: samsung: exynos-sysreg: add gs101 hsi0 and misc compatibles
dt-bindings: soc: samsung: exynos-sysreg: add power-domains
soc: samsung: gs101-pmu: implement access tables for read and write
soc: samsung: exynos-pmu: move some gs101 related code into new file
soc: samsung: exynos-pmu: allow specifying read & write access tables for secure regmap
dt-bindings: samsung: exynos-sysreg: add exynos7870 sysregs
soc: samsung: exynos-chipid: add exynos8890 SoC support
dt-bindings: hwinfo: samsung,exynos-chipid: add exynos8890-chipid compatible
dt-bindings: soc: samsung: exynos-pmu: add exynos8890 compatible
soc: samsung: exynos-pmu: Annotate online/offline functions with __must_hold
soc: samsung: exynos-chipid: Add exynos9610 SoC support
dt-bindings: hwinfo: samsung,exynos-chipid: add exynos9610 compatible
dt-bindings: soc: samsung: exynos-sysreg: Add Exynos990 PERIC0/1 compatibles
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
Trivial fix.
Signed-off-by: Askar Safin <safinaskar@gmail.com>
Link: https://patch.msgid.link/20251120195140.571608-1-safinaskar@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
In a recent commit, I inadvertently changed a comparison from being an
unsigned comparison (on 64-bit systems) to being a signed comparison
(which it had always been on 32-bit systems). This led to a sporadic
fstests failure.
To make sure this comparison is always unsigned, introduce a new type,
uoff_t which is the unsigned version of loff_t. Generally file sizes
are restricted to being a signed integer, but in these two places it is
convenient to pass -1 to indicate "up to the end of the file".
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251123220518.1447261-1-willy@infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Make the enum name in kernel-doc match the code to prevent kernel-doc warnings:
Warning: include/linux/cc_platform.h:106 Enum value
'CC_ATTR_GUEST_SEV_SNP' not described in enum 'cc_attr'
Warning: include/linux/cc_platform.h:106 Excess enum value
'%CC_ATTR_SEV_SNP' description in 'cc_attr'
Fixes: f742b90e61bb ("x86/mm: Extend cc_attr to include AMD SEV-SNP")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://patch.msgid.link/20251125022730.3163679-1-rdunlap@infradead.org
|
|
According to Dan Carpenter, smatch detects issue with size parameter given
to pci_rebar_size_supported():
drivers/pci/rebar.c:142 pci_rebar_size_supported()
error: undefined (user controlled) shift '(((1))) << size'
The problem is this call tree, which uses the 'size' from the user to shift
in BIT() without validating it:
__resource_resize_store # takes 'buf' from user sysfs write
kstrtoul(buf, 0, &size) # converts to unsigned long
pci_resize_resource # truncates to int
pci_rebar_size_supported # BIT(size) without validation
There could be similar problems also with pci_resize_resource() parameter
values coming from drivers.
Add 'size' validation to pci_rebar_size_supported().
There seems to be no SZ_128T prior to this so add one to be able to specify
the largest size supported by the kernel (PCIe r7.0 spec already defines
sizes even beyond 128TB but kernel does not yet support them).
The issue looks older than the introduction of pci_rebar_size_supported()
by bb1fabd0d94e ("PCI: Add pci_rebar_size_supported() helper").
It would be also nice to convert 'size' unsigned too everywhere, maybe even
u8 but that is left as further work.
Fixes: 8bb705e3e79d ("PCI: Add pci_resize_resource() for resizing BARs")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/aSA1WiRG3RuhqZMY@stanley.mountain/
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
[bhelgaas: commit log, add report URL]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Link: https://patch.msgid.link/20251124153740.2995-1-ilpo.jarvinen@linux.intel.com
|
|
sysctl bits are mostly-read values.
Link: https://lkml.kernel.org/r/20251121194859.265259-2-gourry@gourry.net
Signed-off-by: Gregory Price <gourry@gourry.net>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Some platforms can customize the PTE/PMD entry uffd-wp bit making it
unavailable even if the architecture provides the resource. This patch
adds a macro API pgtable_supports_uffd_wp() that allows architectures to
define their specific implementations to check if the uffd-wp bit is
available on which device the kernel is running.
Also this patch is removing "ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP" and
"ifdef CONFIG_PTE_MARKER_UFFD_WP" in favor of pgtable_supports_uffd_wp()
and uffd_supports_wp_marker() checks respectively that default to
IS_ENABLED(CONFIG_HAVE_ARCH_USERFAULTFD_WP) and
"IS_ENABLED(CONFIG_HAVE_ARCH_USERFAULTFD_WP) &&
IS_ENABLED(CONFIG_PTE_MARKER_UFFD_WP)" if not overridden by the
architecture, no change in behavior is expected.
Link: https://lkml.kernel.org/r/20251113072806.795029-3-zhangchunyan@iscas.ac.cn
Signed-off-by: Chunyan Zhang <zhangchunyan@iscas.ac.cn>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Jones <ajones@ventanamicro.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Conor Dooley <conor.dooley@microchip.com>
Cc: Conor Dooley <conor@kernel.org>
Cc: Deepak Gupta <debug@rivosinc.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yuanchu Xie <yuanchu@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm: Add soft-dirty and uffd-wp support for RISC-V", v15.
This patchset adds support for Svrsw60t59b [1] extension which is ratified
now, also add soft dirty and userfaultfd write protect tracking for
RISC-V.
The patches 1 and 2 add macros to allow architectures to define their own
checks if the soft-dirty / uffd_wp PTE bits are available, in other words
for RISC-V, the Svrsw60t59b extension is supported on which device the
kernel is running. Also patch1-2 are removing "ifdef
CONFIG_MEM_SOFT_DIRTY" "ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP" and "ifdef
CONFIG_PTE_MARKER_UFFD_WP" in favor of checks which if not overridden by
the architecture, no change in behavior is expected.
This patchset has been tested with kselftest mm suite in which soft-dirty,
madv_populate, test_unmerge_uffd_wp, and uffd-unit-tests run and pass, and
no regressions are observed in any of the other tests.
This patch (of 6):
Some platforms can customize the PTE PMD entry soft-dirty bit making it
unavailable even if the architecture provides the resource.
Add an API which architectures can define their specific implementations
to detect if soft-dirty bit is available on which device the kernel is
running.
This patch is removing "ifdef CONFIG_MEM_SOFT_DIRTY" in favor of
pgtable_supports_soft_dirty() checks that defaults to
IS_ENABLED(CONFIG_MEM_SOFT_DIRTY), if not overridden by the architecture,
no change in behavior is expected.
We make sure to never set VM_SOFTDIRTY if !pgtable_supports_soft_dirty(),
so we will never run into VM_SOFTDIRTY checks.
[lorenzo.stoakes@oracle.com: fix VMA selftests]
Link: https://lkml.kernel.org/r/dac6ddfe-773a-43d5-8f69-021b9ca4d24b@lucifer.local
Link: https://lkml.kernel.org/r/20251113072806.795029-1-zhangchunyan@iscas.ac.cn
Link: https://lkml.kernel.org/r/20251113072806.795029-2-zhangchunyan@iscas.ac.cn
Link: https://github.com/riscv-non-isa/riscv-iommu/pull/543 [1]
Signed-off-by: Chunyan Zhang <zhangchunyan@iscas.ac.cn>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Conor Dooley <conor@kernel.org>
Cc: Deepak Gupta <debug@rivosinc.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: Andrew Jones <ajones@ventanamicro.com>
Cc: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
__lruvec_stat_mod_folio() is already safe against irqs, so there is no
need to have a separate interface (i.e. lruvec_stat_mod_folio) which
wraps calls to it with irq disabling and reenabling. Let's rename
__lruvec_stat_mod_folio() to lruvec_stat_mod_folio().
Link: https://lkml.kernel.org/r/20251110232008.1352063-5-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
__mod_lruvec_state() is already safe against irqs, so there is no need to
have a separate interface (i.e. mod_lruvec_state) which wraps calls to it
with irq disabling and reenabling. Let's rename __mod_lruvec_state() to
mod_lruvec_state().
Link: https://lkml.kernel.org/r/20251110232008.1352063-4-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
__mod_lruvec_kmem_state() is already safe against irqs, so there is no
need to have a separate interface (i.e. mod_lruvec_kmem_state) which
wraps calls to it with irq disabling and reenabling. Let's rename
__mod_lruvec_kmem_state() to mod_lruvec_kmem_state().
Link: https://lkml.kernel.org/r/20251110232008.1352063-3-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "memcg: cleanup the memcg stats interfaces".
The memcg stats are safe against irq (and nmi) context and thus does not
require disabling irqs. However for some stats which are also maintained
at node level, it is using irq unsafe interface and thus requiring the
users to still disables irqs or use interfaces which explicitly disables
irqs. Let's move memcg code to use irq safe node level stats function
which is already optimized for architectures with HAVE_CMPXCHG_LOCAL (all
major ones), so there will not be any performance penalty for its usage.
This patch (of 4):
The memcg stats are safe against irq (and nmi) context and thus does not
require disabling irqs. However some code paths for memcg stats also
update the node level stats and use irq unsafe interface and thus require
the users to disable irqs. However node level stats, on architectures
with HAVE_CMPXCHG_LOCAL (all major ones), has interface which does not
require irq disabling. Let's move memcg stats code to start using that
interface for node level stats.
Link: https://lkml.kernel.org/r/20251110232008.1352063-1-shakeel.butt@linux.dev
Link: https://lkml.kernel.org/r/20251110232008.1352063-2-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Wei |