diff options
59 files changed, 6769 insertions, 266 deletions
diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl index 99bb3faf7a0e..6b4e8c7a963d 100644 --- a/Documentation/ABI/testing/sysfs-bus-cxl +++ b/Documentation/ABI/testing/sysfs-bus-cxl @@ -242,7 +242,7 @@ Description: decoding a Host Physical Address range. Note that this number may be elevated without any regionX objects active or even enumerated, as this may be due to decoders established by - platform firwmare or a previous kernel (kexec). + platform firmware or a previous kernel (kexec). What: /sys/bus/cxl/devices/decoderX.Y @@ -572,7 +572,7 @@ Description: What: /sys/bus/cxl/devices/regionZ/accessY/read_bandwidth - /sys/bus/cxl/devices/regionZ/accessY/write_banwidth + /sys/bus/cxl/devices/regionZ/accessY/write_bandwidth Date: Jan, 2024 KernelVersion: v6.9 Contact: linux-cxl@vger.kernel.org diff --git a/Documentation/driver-api/cxl/access-coordinates.rst b/Documentation/driver-api/cxl/access-coordinates.rst deleted file mode 100644 index b07950ea30c9..000000000000 --- a/Documentation/driver-api/cxl/access-coordinates.rst +++ /dev/null @@ -1,91 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 -.. include:: <isonum.txt> - -================================== -CXL Access Coordinates Computation -================================== - -Shared Upstream Link Calculation -================================ -For certain CXL region construction with endpoints behind CXL switches (SW) or -Root Ports (RP), there is the possibility of the total bandwidth for all -the endpoints behind a switch being more than the switch upstream link. -A similar situation can occur within the host, upstream of the root ports. -The CXL driver performs an additional pass after all the targets have -arrived for a region in order to recalculate the bandwidths with possible -upstream link being a limiting factor in mind. - -The algorithm assumes the configuration is a symmetric topology as that -maximizes performance. When asymmetric topology is detected, the calculation -is aborted. An asymmetric topology is detected during topology walk where the -number of RPs detected as a grandparent is not equal to the number of devices -iterated in the same iteration loop. The assumption is made that subtle -asymmetry in properties does not happen and all paths to EPs are equal. - -There can be multiple switches under an RP. There can be multiple RPs under -a CXL Host Bridge (HB). There can be multiple HBs under a CXL Fixed Memory -Window Structure (CFMWS). - -An example hierarchy: - -> CFMWS 0 -> | -> _________|_________ -> | | -> ACPI0017-0 ACPI0017-1 -> GP0/HB0/ACPI0016-0 GP1/HB1/ACPI0016-1 -> | | | | -> RP0 RP1 RP2 RP3 -> | | | | -> SW 0 SW 1 SW 2 SW 3 -> | | | | | | | | -> EP0 EP1 EP2 EP3 EP4 EP5 EP6 EP7 - -Computation for the example hierarchy: - -Min (GP0 to CPU BW, - Min(SW 0 Upstream Link to RP0 BW, - Min(SW0SSLBIS for SW0DSP0 (EP0), EP0 DSLBIS, EP0 Upstream Link) + - Min(SW0SSLBIS for SW0DSP1 (EP1), EP1 DSLBIS, EP1 Upstream link)) + - Min(SW 1 Upstream Link to RP1 BW, - Min(SW1SSLBIS for SW1DSP0 (EP2), EP2 DSLBIS, EP2 Upstream Link) + - Min(SW1SSLBIS for SW1DSP1 (EP3), EP3 DSLBIS, EP3 Upstream link))) + -Min (GP1 to CPU BW, - Min(SW 2 Upstream Link to RP2 BW, - Min(SW2SSLBIS for SW2DSP0 (EP4), EP4 DSLBIS, EP4 Upstream Link) + - Min(SW2SSLBIS for SW2DSP1 (EP5), EP5 DSLBIS, EP5 Upstream link)) + - Min(SW 3 Upstream Link to RP3 BW, - Min(SW3SSLBIS for SW3DSP0 (EP6), EP6 DSLBIS, EP6 Upstream Link) + - Min(SW3SSLBIS for SW3DSP1 (EP7), EP7 DSLBIS, EP7 Upstream link)))) - -The calculation starts at cxl_region_shared_upstream_perf_update(). A xarray -is created to collect all the endpoint bandwidths via the -cxl_endpoint_gather_bandwidth() function. The min() of bandwidth from the -endpoint CDAT and the upstream link bandwidth is calculated. If the endpoint -has a CXL switch as a parent, then min() of calculated bandwidth and the -bandwidth from the SSLBIS for the switch downstream port that is associated -with the endpoint is calculated. The final bandwidth is stored in a -'struct cxl_perf_ctx' in the xarray indexed by a device pointer. If the -endpoint is direct attached to a root port (RP), the device pointer would be an -RP device. If the endpoint is behind a switch, the device pointer would be the -upstream device of the parent switch. - -At the next stage, the code walks through one or more switches if they exist -in the topology. For endpoints directly attached to RPs, this step is skipped. -If there is another switch upstream, the code takes the min() of the current -gathered bandwidth and the upstream link bandwidth. If there's a switch -upstream, then the SSLBIS of the upstream switch. - -Once the topology walk reaches the RP, whether it's direct attached endpoints -or walking through the switch(es), cxl_rp_gather_bandwidth() is called. At -this point all the bandwidths are aggregated per each host bridge, which is -also the index for the resulting xarray. - -The next step is to take the min() of the per host bridge bandwidth and the -bandwidth from the Generic Port (GP). The bandwidths for the GP is retrieved -via ACPI tables SRAT/HMAT. The min bandwidth are aggregated under the same -ACPI0017 device to form a new xarray. - -Finally, the cxl_region_update_bandwidth() is called and the aggregated -bandwidth from all the members of the last xarray is updated for the -access coordinates residing in the cxl region (cxlr) context. diff --git a/Documentation/driver-api/cxl/allocation/dax.rst b/Documentation/driver-api/cxl/allocation/dax.rst new file mode 100644 index 000000000000..c6f7a5da832f --- /dev/null +++ b/Documentation/driver-api/cxl/allocation/dax.rst @@ -0,0 +1,60 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========== +DAX Devices +=========== +CXL capacity exposed as a DAX device can be accessed directly via mmap. +Users may wish to use this interface mechanism to write their own userland +CXL allocator, or to managed shared or persistent memory regions across multiple +hosts. + +If the capacity is shared across hosts or persistent, appropriate flushing +mechanisms must be employed unless the region supports Snoop Back-Invalidate. + +Note that mappings must be aligned (size and base) to the dax device's base +alignment, which is typically 2MB - but maybe be configured larger. + +:: + + #include <stdio.h> + #include <stdlib.h> + #include <stdint.h> + #include <sys/mman.h> + #include <fcntl.h> + #include <unistd.h> + + #define DEVICE_PATH "/dev/dax0.0" // Replace DAX device path + #define DEVICE_SIZE (4ULL * 1024 * 1024 * 1024) // 4GB + + int main() { + int fd; + void* mapped_addr; + + /* Open the DAX device */ + fd = open(DEVICE_PATH, O_RDWR); + if (fd < 0) { + perror("open"); + return -1; + } + + /* Map the device into memory */ + mapped_addr = mmap(NULL, DEVICE_SIZE, PROT_READ | PROT_WRITE, + MAP_SHARED, fd, 0); + if (mapped_addr == MAP_FAILED) { + perror("mmap"); + close(fd); + return -1; + } + + printf("Mapped address: %p\n", mapped_addr); + + /* You can now access the device through the mapped address */ + uint64_t* ptr = (uint64_t*)mapped_addr; + *ptr = 0x1234567890abcdef; // Write a value to the device + printf("Value at address %p: 0x%016llx\n", ptr, *ptr); + + /* Clean up */ + munmap(mapped_addr, DEVICE_SIZE); + close(fd); + return 0; + } diff --git a/Documentation/driver-api/cxl/allocation/hugepages.rst b/Documentation/driver-api/cxl/allocation/hugepages.rst new file mode 100644 index 000000000000..1023c6922829 --- /dev/null +++ b/Documentation/driver-api/cxl/allocation/hugepages.rst @@ -0,0 +1,32 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========== +Huge Pages +========== + +Contiguous Memory Allocator +=========================== +CXL Memory onlined as SystemRAM during early boot is eligible for use by CMA, +as the NUMA node hosting that capacity will be `Online` at the time CMA +carves out contiguous capacity. + +CXL Memory deferred to the CXL Driver for configuration cannot have its +capacity allocated by CMA - as the NUMA node hosting the capacity is `Offline` +at :code:`__init` time - when CMA carves out contiguous capacity. + +HugeTLB +======= +Different huge page sizes allow different memory configurations. + +2MB Huge Pages +-------------- +All CXL capacity regardless of configuration time or memory zone is eligible +for use as 2MB huge pages. + +1GB Huge Pages +-------------- +CXL capacity onlined in :code:`ZONE_NORMAL` is eligible for 1GB Gigantic Page +allocation. + +CXL capacity onlined in :code:`ZONE_MOVABLE` is not eligible for 1GB Gigantic +Page allocation. diff --git a/Documentation/driver-api/cxl/allocation/page-allocator.rst b/Documentation/driver-api/cxl/allocation/page-allocator.rst new file mode 100644 index 000000000000..7b8fe1b8d5bb --- /dev/null +++ b/Documentation/driver-api/cxl/allocation/page-allocator.rst @@ -0,0 +1,85 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================== +The Page Allocator +================== + +The kernel page allocator services all general page allocation requests, such +as :code:`kmalloc`. CXL configuration steps affect the behavior of the page +allocator based on the selected `Memory Zone` and `NUMA node` the capacity is +placed in. + +This section mostly focuses on how these configurations affect the page +allocator (as of Linux v6.15) rather than the overall page allocator behavior. + +NUMA nodes and mempolicy +======================== +Unless a task explicitly registers a mempolicy, the default memory policy +of the linux kernel is to allocate memory from the `local NUMA node` first, +and fall back to other nodes only if the local node is pressured. + +Generally, we expect to see local DRAM and CXL memory on separate NUMA nodes, +with the CXL memory being non-local. Technically, however, it is possible +for a compute node to have no local DRAM, and for CXL memory to be the +`local` capacity for that compute node. + + +Memory Zones +============ +CXL capacity may be onlined in :code:`ZONE_NORMAL` or :code:`ZONE_MOVABLE`. + +As of v6.15, the page allocator attempts to allocate from the highest +available and compatible ZONE for an allocation from the local node first. + +An example of a `zone incompatibility` is attempting to service an allocation +marked :code:`GFP_KERNEL` from :code:`ZONE_MOVABLE`. Kernel allocations are +typically not migratable, and as a result can only be serviced from +:code:`ZONE_NORMAL` or lower. + +To simplify this, the page allocator will prefer :code:`ZONE_MOVABLE` over +:code:`ZONE_NORMAL` by default, but if :code:`ZONE_MOVABLE` is depleted, it +will fallback to allocate from :code:`ZONE_NORMAL`. + + +Zone and Node Quirks +==================== +Let's consider a configuration where the local DRAM capacity is largely onlined +into :code:`ZONE_NORMAL`, with no :code:`ZONE_MOVABLE` capacity present. The +CXL capacity has the opposite configuration - all onlined in +:code:`ZONE_MOVABLE`. + +Under the default allocation policy, the page allocator will completely skip +:code:`ZONE_MOVABLE` as a valid allocation target. This is because, as of +Linux v6.15, the page allocator does (approximately) the following: :: + + for (each zone in local_node): + + for (each node in fallback_order): + + attempt_allocation(gfp_flags); + +Because the local node does not have :code:`ZONE_MOVABLE`, the CXL node is +functionally unreachable for direct allocation. As a result, the only way +for CXL capacity to be used is via `demotion` in the reclaim path. + +This configuration also means that if the DRAM ndoe has :code:`ZONE_MOVABLE` +capacity - when that capacity is depleted, the page allocator will actually +prefer CXL :code:`ZONE_MOVABLE` pages over DRAM :code:`ZONE_NORMAL` pages. + +We may wish to invert this priority in future Linux versions. + +If `demotion` and `swap` are disabled, Linux will begin to cause OOM crashes +when the DRAM nodes are depleted. See the reclaim section for more details. + + +CGroups and CPUSets +=================== +Finally, assuming CXL memory is reachable via the page allocation (i.e. onlined +in :code:`ZONE_NORMAL`), the :code:`cpusets.mems_allowed` may be used by +containers to limit the accessibility of certain NUMA nodes for tasks in that +container. Users may wish to utilize this in multi-tenant systems where some +tasks prefer not to use slower memory. + +In the reclaim section we'll discuss some limitations of this interface to +prevent demotions of shared data to CXL memory (if demotions are enabled). + diff --git a/Documentation/driver-api/cxl/allocation/reclaim.rst b/Documentation/driver-api/cxl/allocation/reclaim.rst new file mode 100644 index 000000000000..f40f1cae391a --- /dev/null +++ b/Documentation/driver-api/cxl/allocation/reclaim.rst @@ -0,0 +1,51 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======= +Reclaim +======= +Another way CXL memory can be utilized *indirectly* is via the reclaim system +in :code:`mm/vmscan.c`. Reclaim is engaged when memory capacity on the system +becomes pressured based on global and cgroup-local `watermark` settings. + +In this section we won't discuss the `watermark` configurations, just how CXL +memory can be consumed by various pieces of reclaim system. + +Demotion +======== +By default, the reclaim system will prefer swap (or zswap) when reclaiming +memory. Enabling :code:`kernel/mm/numa/demotion_enabled` will cause vmscan +to opportunistically prefer distant NUMA nodes to swap or zswap, if capacity +is available. + +Demotion engages the :code:`mm/memory_tier.c` component to determine the +next demotion node. The next demotion node is based on the :code:`HMAT` +or :code:`CDAT` performance data. + +cpusets.mems_allowed quirk +-------------------------- +In Linux v6.15 and below, demotion does not respect :code:`cpusets.mems_allowed` +when migrating pages. As a result, if demotion is enabled, vmscan cannot +guarantee isolation of a container's memory from nodes not set in mems_allowed. + +In Linux v6.XX and up, demotion does attempt to respect +:code:`cpusets.mems_allowed`; however, certain classes of shared memory +originally instantiated by another cgroup (such as common libraries - e.g. +libc) may still be demoted. As a result, the mems_allowed interface still +cannot provide perfect isolation from the remote nodes. + +ZSwap and Node Preference +========================= +In Linux v6.15 and below, ZSwap allocates memory from the local node of the +processor for the new pages being compressed. Since pages being compressed +are typically cold, the result is a cold page becomes promoted - only to +be later demoted as it ages off the LRU. + +In Linux v6.XX, ZSwap tries to prefer the node of the page being compressed +as the allocation target for the compression page. This helps prevent +thrashing. + +Demotion with ZSwap +=================== +When enabling both Demotion and ZSwap, you create a situation where ZSwap +will prefer the slowest form of CXL memory by default until that tier of +memory is exhausted. diff --git a/Documentation/driver-api/cxl/devices/device-types.rst b/Documentation/driver-api/cxl/devices/device-types.rst new file mode 100644 index 000000000000..f5e4330c1cfe --- /dev/null +++ b/Documentation/driver-api/cxl/devices/device-types.rst @@ -0,0 +1,165 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +Devices and Protocols +===================== + +The type of CXL device (Memory, Accelerator, etc) dictates many configuration steps. This section +covers some basic background on device types and on-device resources used by the platform and OS +which impact configuration. + +Protocols +========= + +There are three core protocols to CXL. For the purpose of this documentation, +we will only discuss very high level definitions as the specific hardware +details are largely abstracted away from Linux. See the CXL specification +for more details. + +CXL.io +------ +The basic interaction protocol, similar to PCIe configuration mechanisms. +Typically used for initialization, configuration, and I/O access for anything +other than memory (CXL.mem) or cache (CXL.cache) operations. + +The Linux CXL driver exposes access to .io functionalty via the various sysfs +interfaces and /dev/cxl/ devices (which exposes direct access to device +mailboxes). + +CXL.cache +--------- +The mechanism by which a device may coherently access and cache host memory. + +Largely transparent to Linux once configured. + +CXL.mem +--------- +The mechanism by which the CPU may coherently access and cache device memory. + +Largely transparent to Linux once configured. + + +Device Types +============ + +Type-1 +------ + +A Type-1 CXL device: + +* Supports cxl.io and cxl.cache protocols +* Implements a fully coherent cache +* Allows Device-to-Host coherence and Host-to-Device snoops. +* Does NOT have host-managed device memory (HDM) + +Typical examples of type-1 devices is a Smart NIC - which may want to +directly operate on host-memory (DMA) to store incoming packets. These +devices largely rely on CPU-attached memory. + +Type-2 +------ + +A Type-2 CXL Device: + +* Supports cxl.io, cxl.cache, and cxl.mem protocols +* Optionally implements coherent cache and Host-Managed Device Memory +* Is typically an accelerator device w/ high bandwidth memory. + +The primary difference between a type-1 and type-2 device is the presence +of host-managed device memory, which allows the device to operate on a +local memory bank - while the CPU sill has coherent DMA to the same memory. + +The allows things like GPUs to expose their memory via DAX devices or file +descriptors, allows drivers and programs direct access to device memory +rather than use block-transfer semantics. + +Type-3 +------ + +A Type-3 CXL Device + +* Supports cxl.io and cxl.mem +* Implements Host-Managed Device Memory +* May provide either Volatile or Persistent memory capacity (or both). + +A basic example of a type-3 device is a simple memory expander, whose +local memory capacity is exposed to the CPU for access directly via +basic coherent DMA. + +Switch +------ + +A CXL switch is a device capacity of routing any CXL (and by extension, PCIe) +protocol between an upstream, downstream, or peer devices. Many devices, such +as Multi-Logical Devices, imply the presence of switching in some manner. + +Logical Devices and Heads +------------------------- + +A CXL device may present one or more "Logical Devices" to one or more hosts +(via physical "Heads"). + +A Single-Logical Device (SLD) is a device which presents a single device to +one or more heads. + +A Multi-Logical Device (MLD) is a device which may present multiple devices +to one or more devices. + +A Single-Headed Device exposes only a single physical connection. + +A Multi-Headed Device exposes multiple physical connections. + +MHSLD +~~~~~ +A Multi-Headed Single-Logical Device (MHSLD) exposes a single logical +device to multiple heads which may be connected to one or more discrete +hosts. An example of this would be a simple memory-pool which may be +statically configured (prior to boot) to expose portions of its memory +to Linux via :doc:`CEDT <../platform/acpi/cedt>`. + +MHMLD +~~~~~ +A Multi-Headed Multi-Logical Device (MHMLD) exposes multiple logical +devices to multiple heads which may be connected to one or more discrete +hosts. An example of this would be a Dynamic Capacity Device or which +may be configured at runtime to expose portions of its memory to Linux. + +Example Devices +=============== + +Memory Expander +--------------- +The simplest form of Type-3 device is a memory expander. A memory expander +exposes Host-Managed Device Memory (HDM) to Linux. This memory may be +Volatile or Non-Volatile (Persistent). + +Memory Expanders will typically be considered a form of Single-Headed, +Single-Logical Device - as its form factor will typically be an add-in-card +(AIC) or some other similar form-factor. + +The Linux CXL driver provides support for static or dynamic configuration of +basic memory expanders. The platform may program decoders prior to OS init +(e.g. auto-decoders), or the user may program the fabric if the platform +defers these operations to the OS. + +Multiple Memory Expanders may be added to an external chassis and exposed to +a host via a head attached to a CXL switch. This is a "memory pool", and +would be considered an MHSLD or MHMLD depending on the management capabilities +provided by the switch platform. + +As of v6.14, Linux does not provide a formalized interface to manage non-DCD +MHSLD or MHMLD devices. + +Dynamic Capacity Device (DCD) +----------------------------- + +A Dynamic Capacity Device is a Type-3 device which provides dynamic management +of memory capacity. The basic premise of a DCD to provide an allocator-like |
