aboutsummaryrefslogtreecommitdiff
path: root/include/linux/memory_hotplug.h
AgeCommit message (Collapse)AuthorFilesLines
2025-10-14mm/memory_hotplug: Remove MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiersSumanth Korikkar1-17/+1
MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE memory notifiers were introduced to prepare the transition of memory to and from a physically accessible state. This enhancement was crucial for implementing the "memmap on memory" feature for s390. With introduction of dynamic (de)configuration of hotpluggable memory, memory can be brought to accessible state before add_memory(). Memory can be brought to inaccessible state before remove_memory(). Hence, there is no need of MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE memory notifiers anymore. This basically reverts commit c5f1e2d18909 ("mm/memory_hotplug: introduce MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers") Additionally, apply minor adjustments to the function parameters of move_pfn_range_to_zone() and mhp_supports_memmap_on_memory() to ensure compatibility with the latest branch. Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
2025-07-13mm/page_alloc: add support for initializing pageblock as isolatedZi Yan1-1/+2
MIGRATE_ISOLATE is a standalone bit, so a pageblock cannot be initialized to just MIGRATE_ISOLATE. Add init_pageblock_migratetype() to enable initialize a pageblock with a migratetype and isolated. Link: https://lkml.kernel.org/r/20250617021115.2331563-4-ziy@nvidia.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Richard Chang <richardycc@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm: add build-time option for hotplug memory default online typeGregory Price1-2/+3
Memory hotplug presently auto-onlines memory into a zone the kernel deems appropriate if CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y. The memhp_default_state boot param enables runtime config, but it's not possible to do this at build-time. Remove CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE, and replace it with CONFIG_MHP_DEFAULT_ONLINE_TYPE_* choices that sync with the boot param. Selections: CONFIG_MHP_DEFAULT_ONLINE_TYPE_OFFLINE => mhp_default_online_type = "offline" Memory will not be onlined automatically. CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_AUTO => mhp_default_online_type = "online" Memory will be onlined automatically in a zone deemed. appropriate by the kernel. CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_KERNEL => mhp_default_online_type = "online_kernel" Memory will be onlined automatically. The zone may allow kernel data (e.g. ZONE_NORMAL). CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_MOVABLE => mhp_default_online_type = "online_movable" Memory will be onlined automatically. The zone will be ZONE_MOVABLE. Default to CONFIG_MHP_DEFAULT_ONLINE_TYPE_OFFLINE to match the existing default CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n behavior. Existing users of CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y should use CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_AUTO. [gourry@gourry.net: update KConfig comments] Link: https://lkml.kernel.org/r/20241226182918.648799-1-gourry@gourry.net Link: https://lkml.kernel.org/r/20241220210709.300066-1-gourry@gourry.net Signed-off-by: Gregory Price <gourry@gourry.net> Acked-by: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-03mm: drop CONFIG_HAVE_ARCH_NODEDATA_EXTENSIONMike Rapoport (Microsoft)1-48/+0
There are no users of HAVE_ARCH_NODEDATA_EXTENSION left, so arch_alloc_nodedata() and arch_refresh_nodedata() are not needed anymore. Replace the call to arch_alloc_nodedata() in free_area_init() with a new helper alloc_offline_node_data(), remove arch_refresh_nodedata() and cleanup include/linux/memory_hotplug.h from the associated ifdefery. Link: https://lkml.kernel.org/r/20240807064110.1003856-9-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64 Acked-by: Dan Williams <dan.j.williams@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David S. Miller <davem@davemloft.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Samuel Holland <samuel.holland@sifive.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm/memory_hotplug: skip adjust_managed_page_count() for PageOffline() pages ↵David Hildenbrand1-2/+2
when offlining We currently have a hack for virtio-mem in place to handle memory offlining with PageOffline pages for which we already adjusted the managed page count. Let's enlighten memory offlining code so we can get rid of that hack, and document the situation. Link: https://lkml.kernel.org/r/20240607090939.89524-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: Alexander Potapenko <glider@google.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Eugenio Pérez <eperezma@redhat.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Marco Elver <elver@google.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Wei Liu <wei.liu@kernel.org> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-22mm/memory_hotplug: export mhp_supports_memmap_on_memory()Vishal Verma1-0/+6
In preparation for adding sysfs ABI to toggle memmap_on_memory semantics for drivers adding memory, export the mhp_supports_memmap_on_memory() helper. This allows drivers to check if memmap_on_memory support is available before trying to request it, and display an appropriate message if it isn't available. As part of this, remove the size argument to this - with recent updates to allow memmap_on_memory for larger ranges, and the internal splitting of altmaps into respective memory blocks, the size argument is meaningless. [akpm@linux-foundation.org: fix build] Link: https://lkml.kernel.org/r/20240124-vv-dax_abi-v7-4-20d16cb8d23d@intel.com Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Li Zhijian <lizhijian@fujitsu.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-02-21mm/memory_hotplug: introduce MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiersSumanth Korikkar1-1/+17
Patch series "implement "memmap on memory" feature on s390". This series provides "memmap on memory" support on s390 platform. "memmap on memory" allows struct pages array to be allocated from the hotplugged memory range instead of allocating it from main system memory. s390 currently preallocates struct pages array for all potentially possible memory, which ensures memory onlining always succeeds, but with the cost of significant memory consumption from the available system memory during boottime. In certain extreme configuration, this could lead to ipl failure. "memmap on memory" ensures struct pages array are populated from self contained hotplugged memory range instead of depleting the available system memory and this could eliminate ipl failure on s390 platform. On other platforms, system might go OOM when the physically hotplugged memory depletes the available memory before it is onlined. Hence, "memmap on memory" feature was introduced as described in commit a08a2ae34613 ("mm,memory_hotplug: allocate memmap from the added memory range"). Unlike other architectures, s390 memory blocks are not physically accessible until it is online. To make it physically accessible two new memory notifiers MEM_PREPARE_ONLINE / MEM_FINISH_OFFLINE are added and this notifier lets the hypervisor inform that the memory should be made physically accessible. This allows for "memmap on memory" initialization during memory hotplug onlining phase, which is performed before calling MEM_GOING_ONLINE notifier. Patch 1 introduces MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE memory notifiers to prepare the transition of memory to and from a physically accessible state. New mhp_flag MHP_OFFLINE_INACCESSIBLE is introduced to ensure altmap cannot be written when adding memory - before it is set online. This enhancement is crucial for implementing the "memmap on memory" feature for s390 in a subsequent patch. Patches 2 allocates vmemmap pages from self-contained memory range for s390. It allocates memory map (struct pages array) from the hotplugged memory range, rather than using system memory by passing altmap to vmemmap functions. Patch 3 removes unhandled memory notifier types on s390. Patch 4 implements MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE memory notifiers on s390. MEM_PREPARE_ONLINE memory notifier makes memory block physical accessible via sclp assign command. The notifier ensures self-contained memory maps are accessible and hence enabling the "memmap on memory" on s390. MEM_FINISH_OFFLINE memory notifier shifts the memory block to an inaccessible state via sclp unassign command. Patch 5 finally enables MHP_MEMMAP_ON_MEMORY on s390. This patch (of 5): Introduce MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE memory notifiers to prepare the transition of memory to and from a physically accessible state. This enhancement is crucial for implementing the "memmap on memory" feature for s390 in a subsequent patch. Platforms such as x86 can support physical memory hotplug via ACPI. When there is physical memory hotplug, ACPI event leads to the memory addition with the following callchain: acpi_memory_device_add() -> acpi_memory_enable_device() -> __add_memory() After this, the hotplugged memory is physically accessible, and altmap support prepared, before the "memmap on memory" initialization in memory_block_online() is called. On s390, memory hotplug works in a different way. The available hotplug memory has to be defined upfront in the hypervisor, but it is made physically accessible only when the user sets it online via sysfs, currently in the MEM_GOING_ONLINE notifier. This is too late and "memmap on memory" initialization is performed before calling MEM_GOING_ONLINE notifier. During the memory hotplug addition phase, altmap support is prepared and during the memory onlining phase s390 requires memory to be physically accessible and then subsequently initiate the "memmap on memory" initialization process. The memory provider will handle new MEM_PREPARE_ONLINE / MEM_FINISH_OFFLINE notifications and make the memory accessible. The mhp_flag MHP_OFFLINE_INACCESSIBLE is introduced and is relevant when used along with MHP_MEMMAP_ON_MEMORY, because the altmap cannot be written (e.g., poisoned) when adding memory -- before it is set online. This allows for adding memory with an altmap that is not currently made available by a hypervisor. When onlining that memory, the hypervisor can be instructed to make that memory accessible via the new notifiers and the onlining phase will not require any memory allocations, which is helpful in low-memory situations. All architectures ignore unknown memory notifiers. Therefore, the introduction of these new notifiers does not result in any functional modifications across architectures. Link: https://lkml.kernel.org/r/20240108132747.3238763-1-sumanthk@linux.ibm.com Link: https://lkml.kernel.org/r/20240108132747.3238763-2-sumanthk@linux.ibm.com Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Suggested-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/memory_hotplug: allow memmap on memory hotplug request to fallbackAneesh Kumar K.V1-1/+2
If not supported, fallback to not using memap on memmory. This avoids the need for callers to do the fallback. Link: https://lkml.kernel.org/r/20230808091501.287660-3-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-19mm/sparse: remove unused parameters in sparse_remove_section()Yajun Deng1-3/+2
These parameters ms and map_offset are not used in sparse_remove_section(), so remove them. The __remove_section() is only called by __remove_pages(), remove it. And put the WARN_ON_ONCE() in sparse_remove_section(). Link: https://lkml.kernel.org/r/20230607023952.2247489-1-yajun.deng@linux.dev Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09mm: page_alloc: move set_zone_contiguous() into mm_init.cKefeng Wang1-3/+0
set_zone_contiguous() is only used in mm init/hotplug, and clear_zone_contiguous() only used in hotplug, move them from page_alloc.c to the more appropriate file. Link: https://lkml.kernel.org/r/20230516063821.121844-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Len Brown <len.brown@intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: Rafael J. Wysocki <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03mm, memory_hotplug: remove obsolete generic_free_nodedata()Miaohe Lin1-8/+0
Commit 390511e1476e ("mm, memory_hotplug: drop arch_free_nodedata") drops the last caller of generic_free_nodedata(). Remove it too. Link: https://lkml.kernel.org/r/20220916072257.9639-11-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11mm: fix null-ptr-deref in kswapd_is_running()Kefeng Wang1-0/+20
kswapd_run/stop() will set pgdat->kswapd to NULL, which could race with kswapd_is_running() in kcompactd(), kswapd_run/stop() kcompactd() kswapd_is_running() pgdat->kswapd // error or nomal ptr verify pgdat->kswapd // load non-NULL pgdat->kswapd pgdat->kswapd = NULL task_is_running(pgdat->kswapd) // Null pointer derefence KASAN reports the null-ptr-deref shown below, vmscan: Failed to start kswapd on node 0 ... BUG: KASAN: null-ptr-deref in kcompactd+0x440/0x504 Read of size 8 at addr 0000000000000024 by task kcompactd0/37 CPU: 0 PID: 37 Comm: kcompactd0 Kdump: loaded Tainted: G OE 5.10.60 #1 Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 Call trace: dump_backtrace+0x0/0x394 show_stack+0x34/0x4c dump_stack+0x158/0x1e4 __kasan_report+0x138/0x140 kasan_report+0x44/0xdc __asan_load8+0x94/0xd0 kcompactd+0x440/0x504 kthread+0x1a4/0x1f0 ret_from_fork+0x10/0x18 At present kswapd/kcompactd_run() and kswapd/kcompactd_stop() are protected by mem_hotplug_begin/done(), but without kcompactd(). There is no need to involve memory hotplug lock in kcompactd(), so let's add a new mutex to protect pgdat->kswapd accesses. Also, because the kcompactd task will check the state of kswapd task, it's better to call kcompactd_stop() before kswapd_stop() to reduce lock conflicts. [akpm@linux-foundation.org: add comments] Link: https://lkml.kernel.org/r/20220827111959.186838-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11mm: kill is_memblock_offlined()Kefeng Wang1-2/+0
Directly check state of struct memory_block, no need a single function. Link: https://lkml.kernel.org/r/20220827112043.187028-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-03mm: memory_hotplug: make hugetlb_optimize_vmemmap compatible with ↵Muchun Song1-9/+0
memmap_on_memory For now, the feature of hugetlb_free_vmemmap is not compatible with the feature of memory_hotplug.memmap_on_memory, and hugetlb_free_vmemmap takes precedence over memory_hotplug.memmap_on_memory. However, someone wants to make memory_hotplug.memmap_on_memory takes precedence over hugetlb_free_vmemmap since memmap_on_memory makes it more likely to succeed memory hotplug in close-to-OOM situations. So the decision of making hugetlb_free_vmemmap take precedence is not wise and elegant. The proper approach is to have hugetlb_vmemmap.c do the check whether the section which the HugeTLB pages belong to can be optimized. If the section's vmemmap pages are allocated from the added memory block itself, hugetlb_free_vmemmap should refuse to optimize the vmemmap, otherwise, do the optimization. Then both kernel parameters are compatible. So this patch introduces VmemmapSelfHosted to mask any non-optimizable vmemmap pages. The hugetlb_vmemmap can use this flag to detect if a vmemmap page can be optimized. [songmuchun@bytedance.com: walk vmemmap page tables to avoid false-positive] Link: https://lkml.kernel.org/r/20220620110616.12056-3-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20220617135650.74901-3-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Co-developed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13mm: hugetlb_vmemmap: add hugetlb_optimize_vmemmap sysctlMuchun Song1-0/+9
We must add hugetlb_free_vmemmap=on (or "off") to the boot cmdline and reboot the server to enable or disable the feature of optimizing vmemmap pages associated with HugeTLB pages. However, rebooting usually takes a long time. So add a sysctl to enable or disable the feature at runtime without rebooting. Why we need this? There are 3 use cases. 1) The feature of minimizing overhead of struct page associated with each HugeTLB is disabled by default without passing "hugetlb_free_vmemmap=on" to the boot cmdline. When we (ByteDance) deliver the servers to the users who want to enable this feature, they have to configure the grub (change boot cmdline) and reboot the servers, whereas rebooting usually takes a long time (we have thousands of servers). It's a very bad experience for the users. So we need a approach to enable this feature after rebooting. This is a use case in our practical environment. 2) Some use cases are that HugeTLB pages are allocated 'on the fly' instead of being pulled from the HugeTLB pool, those workloads would be affected with this feature enabled. Those workloads could be identified by the characteristics of they never explicitly allocating huge pages with 'nr_hugepages' but only set 'nr_overcommit_hugepages' and then let the pages be allocated from the buddy allocator at fault time. We can confirm it is a real use case from the commit 099730d67417. For those workloads, the page fault time could be ~2x slower than before. We suspect those users want to disable this feature if the system has enabled this before and they don't think the memory savings benefit is enough to make up for the performance drop. 3) If the workload which wants vmemmap pages to be optimized and the workload which wants to set 'nr_overcommit_hugepages' and does not want the extera overhead at fault time when the overcommitted pages be allocated from the buddy allocator are deployed in the same server. The user could enable this feature and set 'nr_hugepages' and 'nr_overcommit_hugepages', then disable the feature. In this case, the overcommited HugeTLB pages will not encounter the extra overhead at fault time. Link: https://lkml.kernel.org/r/20220512041142.39501-5-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-04-28mm/sparse-vmemmap: add a pgmap argument to section activationJoao Martins1-1/+4
Patch series "sparse-vmemmap: memory savings for compound devmaps (device-dax)", v9. This series minimizes 'struct page' overhead by pursuing a similar approach as Muchun Song series "Free some vmemmap pages of hugetlb page" (now merged since v5.14), but applied to devmap with @vmemmap_shift (device-dax). The vmemmap dedpulication original idea (already used in HugeTLB) is to reuse/deduplicate tail page vmemmap areas, particular the area which only describes tail pages. So a vmemmap page describes 64 struct pages, and the first page for a given ZONE_DEVICE vmemmap would contain the head page and 63 tail pages. The second vmemmap page would contain only tail pages, and that's what gets reused across the rest of the subsection/section. The bigger the page size, the bigger the savings (2M hpage -> save 6 vmemmap pages; 1G hpage -> save 4094 vmemmap pages). This is done for PMEM /specifically only/ on device-dax configured namespaces, not fsdax. In other words, a devmap with a @vmemmap_shift. In terms of savings, per 1Tb of memory, the struct page cost would go down with compound devmap: * with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of total memory) * with 1G pages we lose 40MB instead of 16G (0.0014% instead of 1.5% of total memory) The series is mostly summed up by patch 4, and to summarize what the series does: Patches 1 - 3: Minor cleanups in preparation for patch 4. Move the very nice docs of hugetlb_vmemmap.c into a Documentation/vm/ entry. Patch 4: Patch 4 is the one that takes care of the struct page savings (also referred to here as tail-page/vmemmap deduplication). Much like Muchun series, we reuse the second PTE tail page vmemmap areas across a given @vmemmap_shift On important difference though, is that contrary to the hugetlbfs series, there's no vmemmap for the area because we are late-populating it as opposed to remapping a system-ram range. IOW no freeing of pages of already initialized vmemmap like the case for hugetlbfs, which greatly simplifies the logic (besides not being arch-specific). altmap case unchanged and still goes via the vmemmap_populate(). Also adjust the newly added docs to the device-dax case. [Note that device-dax is still a little behind HugeTLB in terms of savings. I have an additional simple patch that reuses the head vmemmap page too, as a follow-up. That will double the savings and namespaces initialization.] Patch 5: Initialize fewer struct pages depending on the page size with DRAM backed struct pages -- because fewer pages are unique and most tail pages (with bigger vmemmap_shift). NVDIMM namespace bootstrap improves from ~268-358 ms to ~80-110/<1ms on 128G NVDIMMs with 2M and 1G respectivally. And struct page needed capacity will be 3.8x / 1071x smaller for 2M and 1G respectivelly. Tested on x86 with 1.5Tb of pmem (including pinning, and RDMA registration/deregistration scalability with 2M MRs) This patch (of 5): In support of using compound pages for devmap mappings, plumb the pgmap down to the vmemmap_populate implementation. Note that while altmap is retrievable from pgmap the memory hotplug code passes altmap without pgmap[*], so both need to be independently plumbed. So in addition to @altmap, pass @pgmap to sparse section populate functions namely: sparse_add_section section_activate populate_section_memmap __populate_section_memmap Passing @pgmap allows __populate_section_memmap() to both fetch the vmemmap_shift in which memmap metadata is created for and also to let sparse-vmemmap fetch pgmap ranges to co-relate to a given section and pick whether to just reuse tail pages from past onlined sections. While at it, fix the kdoc for @altmap for sparse_add_section(). [*] https://lore.kernel.org/linux-mm/20210319092635.6214-1-osalvador@suse.de/ Link: https://lkml.kernel.org/r/20220420155310.9712-1-joao.m.martins@oracle.com Link: https://lkml.kernel.org/r/20220420155310.9712-2-joao.m.martins@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jane Chu <jane.chu@oracle.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-03-22drivers/base/memory: determine and store zone for single-zone memory blocksDavid Hildenbrand1-4/+2
test_pages_in_a_zone() is just another nasty PFN walker that can easily stumble over ZONE_DEVICE memory ranges falling into the same memory block as ordinary system RAM: the memmap of parts of these ranges might possibly be uninitialized. In fact, we observed (on an older kernel) with UBSAN: UBSAN: Undefined behaviour in ./include/linux/mm.h:1133:50 index 7 is out of range for type 'zone [5]' CPU: 121 PID: 35603 Comm: read_all Kdump: loaded Tainted: [...] Hardware name: Dell Inc. PowerEdge R7425/08V001, BIOS 1.12.2 11/15/2019 Call Trace: dump_stack+0x9a/0xf0 ubsan_epilogue+0x9/0x7a __ubsan_handle_out_of_bounds+0x13a/0x181 test_pages_in_a_zone+0x3c4/0x500 show_valid_zones+0x1fa/0x380 dev_attr_show+0x43/0xb0 sysfs_kf_seq_show+0x1c5/0x440 seq_read+0x49d/0x1190 vfs_read+0xff/0x300 ksys_read+0xb8/0x170 do_syscall_64+0xa5/0x4b0 entry_SYSCALL_64_after_hwframe+0x6a/0xdf RIP: 0033:0x7f01f4439b52 We seem to stumble over a memmap that contains a garbage zone id. While we could try inserting pfn_to_online_page() calls, it will just make memory offlining slower, because we use test_pages_in_a_zone() to make sure we're offlining pages that all belong to the same zone. Let's just get rid of this PFN walker and determine the single zone of a memory block -- if any -- for early memory blocks during boot. For memory onlining, we know the single zone already. Let's avoid any additional memmap scanning and just rely on the zone information available during boot. For memory hot(un)plug, we only really care about memory blocks that: * span a single zone (and, thereby, a single node) * are completely System RAM (IOW, no holes, no ZONE_DEVICE) If one of these conditions is not met, we reject memory offlining. Hotplugged memory blocks (starting out offline), always meet both conditions. There are three scenarios to handle: (1) Memory hot(un)plug A memory block with zone == NULL cannot be offlined, corresponding to our previous test_pages_in_a_zone() check. After successful memory onlining/offlining, we simply set the zone accordingly. * Memory onlining: set the zone we just used for onlining * Memory offlining: set zone = NULL So a hotplugged memory block starts with zone = NULL. Once memory onlining is done, we set the proper zone. (2) Boot memory with !CONFIG_NUMA We know that there is just a single pgdat, so we simply scan all zones of that pgdat for an intersection with our memory block PFN range when adding the memory block. If more than one zone intersects (e.g., DMA and DMA32 on x86 for the first memory block) we set zone = NULL and consequently mimic what test_pages_in_a_zone() used to do. (3) Boot memory with CONFIG_NUMA At the point in time we create the memory block devices during boot, we don't know yet which nodes *actually* span a memory block. While we could scan all zones of all nodes for intersections, overlapping nodes complicate the situation and scanning all nodes is possibly expensive. But that problem has already been solved by the code that sets the node of a memory block and creates the link in the sysfs -- do_register_memory_block_under_node(). So, we hook into the code that sets the node id for a memory block. If we already have a different node id set for the memory block, we know that multiple nodes *actually* have PFNs falling into our memory block: we set zone = NULL and consequently mimic what test_pages_in_a_zone() used to do. If there is no node id set, we do the same as (2) for the given node. Note that the call order in driver_init() is: -> memory_dev_init(): create memory block devices -> node_dev_init(): link memory block devices to the node and set the node id So in summary, we detect if there is a single zone responsible for this memory block and we consequently store the zone in that case in the memory block, updating it during memory onlining/offlining. Link: https://lkml.kernel.org/r/20220210184359.235565-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Rafael Parra <rparrazo@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Rafael Parra <rparrazo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22mm, memory_hotplug: reorganize new pgdat initializationMichal Hocko1-1/+1
When a !node_online node is brought up it needs a hotplug specific initialization because the node could be either uninitialized yet or it could have been recycled after previous hotremove. hotadd_init_pgdat is responsible for that. Internal pgdat state is initialized at two places currently - hotadd_init_pgdat - free_area_init_core_hotplug There is no real clear cut what should go where but this patch's chosen to move the whole internal state initialization into free_area_init_core_hotplug. hotadd_init_pgdat is still responsible to pull all the parts together - most notably to initialize zonelists because those depend on the overall topology. This patch doesn't introduce any functional change. Link: https://lkml.kernel.org/r/20220127085305.20890-5-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Rafael Aquini <raquini@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Alexey Makhalov <amakhalov@vmware.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Nico Pache <npache@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22mm, memory_hotplug: drop arch_free_nodedataMichal Hocko1-3/+0
Prior to "mm: handle uninitialized numa nodes gracefully" memory hotplug used to allocate pgdat when memory has been added to a node (hotadd_init_pgdat) arch_free_nodedata has been only used in the failure path because once the pgdat is exported (to be visible by NODA_DATA(nid)) it cannot really be freed because there is no synchronization available for that. pgdat is allocated for each possible nodes now so the memory hotplug doesn't need to do the ever use arch_free_nodedata so drop it. This patch doesn't introduce any functional change. Link: https://lkml.kernel.org/r/20220127085305.20890-4-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Rafael Aquini <raquini@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Alexey Makhalov <amakhalov@vmware.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Nico Pache <npache@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22mm: handle uninitialized numa nodes gracefullyMichal Hocko1-1/+1
We have had several reports [1][2][3] that page allocator blows up when an allocation from a possible node is requested. The underlying reason is that NODE_DATA for the specific node is not allocated. NUMA specific initialization is arch specific and it can vary a lot. E.g. x86 tries to initialize all nodes that have some cpu affinity (see init_cpu_to_node) but this can be insufficient because the node might be cpuless for example. One way to address this problem would be to check for !node_online nodes when trying to get a zonelist and silently fall back to another node. That is unfortunately adding a branch into allocator hot path and it doesn't handle any other potential NODE_DATA users. This patch takes a different approach (following a lead of [3]) and it pre allocates pgdat for all possible nodes in an arch indipendent code - free_area_init. All uninitialized nodes are treated as memoryless nodes. node_state of the node is not changed because that would lead to other side effects - e.g. sysfs representation of such a node and from past discussions [4] it is known that some tools might have problems digesting that. Newly allocated pgdat only gets a minimal initialization and the rest of the work is expected to be done by the memory hotplug - hotadd_new_pgdat (renamed to hotadd_init_pgdat). generic_alloc_nodedata is changed to use the memblock allocator because neither page nor slab allocators are available at the stage when all pgdats are allocated. Hotplug doesn't allocate pgdat anymore so we can use the early boot allocator. The only arch specific implementation is ia64 and that is changed to use the early allocator as well. [1] http://lkml.kernel.org/r/20211101201312.11589-1-amakhalov@vmware.com [2] http://lkml.kernel.org/r/20211207224013.880775-1-npache@redhat.com [3] http://lkml.kernel.org/r/20190114082416.30939-1-mhocko@kernel.org [4] http://lkml.kernel.org/r/20200428093836.27190-1-srikar@linux.vnet.ibm.com [akpm@linux-foundation.org: replace comment, per Mike] Link: https://lkml.kernel.org/r/Yfe7RBeLCijnWBON@dhcp22.suse.cz Reported-by: Alexey Makhalov <amakhalov@vmware.com> Tested-by: Alexey Makhalov <amakhalov@vmware.com> Reported-by: Nico Pache <npache@redhat.com> Acked-by: Rafael Aquini <raquini@redhat.com> Tested-by: Rafael Aquini <raquini@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22mm, memory_hotplug: make arch_alloc_nodedata independent on ↵Michal Hocko1-60/+59
CONFIG_MEMORY_HOTPLUG Patch series "mm, memory_hotplug: handle unitialized numa node gracefully". The core of the fix is patch 2 which also links existing bug reports. The high level goal is to have all possible numa nodes have their pgdat allocated and initialized so for_each_possible_node(nid) NODE_DATA(nid) will never return garbage. This has proven to be problem in several places when an offline numa node is used for an allocation just to realize that node_data and therefore allocation fallback zonelists are not initialized and such an allocation request blows up. There were attempts to address that by checking node_online in several places including the page allocator. This patchset approaches the problem from a different perspective and instead of special casing, which just adds a runtime overhead, it allocates pglist_data for each possible node. This can add some memory overhead for platforms with high number of possible nodes if they do not contain any memory. This should be a rather rare configuration though. How to test this? David has provided and excellent howto: http://lkml.kernel.org/r/6e5ebc19-890c-b6dd-1924-9f25c441010d@redhat.com Patches 1 and 3-6 are mostly cleanups. The patchset has been reviewed by Rafael (thanks!) and the core fix tested by Rafael and Alexey (thanks to both). David has tested as per instructions above and hasn't found any fallouts in the memory hotplug scenarios. This patch (of 6): This is a preparatory patch and it doesn't introduce any functional change. It merely pulls out arch_alloc_nodedata (and co) outside of CONFIG_MEMORY_HOTPLUG because the following patch will need to call this from the generic MM code. Link: https://lkml.kernel.org/r/20220127085305.20890-1-mhocko@kernel.org Link: https://lkml.kernel.org/r/20220127085305.20890-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Rafael Aquini <raquini@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Alexey Makhalov <amakhalov@vmware.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Nico Pache <npache@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-06mm/memory_hotplug: remove stale function declarationsDavid Hildenbrand1-3/+0
These functions no longer exist. Link: https://lkml.kernel.org/r/20210929143600.49379-6-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Alex Shi <alexs@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm/memory_hotplug: memory group aware "auto-movable" online policyDavid Hildenbrand1-1/+2
Use memory groups to improve our "auto-movable" onlining policy: 1. For static memory groups (e.g., a DIMM), online a memory block MOVABLE only if all other memory blocks in the group are either MOVABLE or could be onlined MOVABLE. A DIMM will either be MOVABLE or not, not a mixture. 2. For dynamic memory groups (e.g., a virtio-mem device), online a memory block MOVABLE only if all other memory blocks inside the current unit are either MOVABLE or could be onlined MOVABLE. For a virtio-mem device with a device block size with 512 MiB, all 128 MiB memory blocks wihin a 512 MiB unit will either be MOVABLE or not, not a mixture. We have to pass the memory group to zone_for_pfn_range() to take the memory group into account. Note: for now, there seems to be no compelling reason to make this behavior configurable. Link: https://lkml.kernel.org/r/20210806124715.17090-9-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hui Zhu <teawater@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm/memory_hotplug: track present pages in memory groupsDavid Hildenbrand1-4/+9
Let's track all present pages in each memory group. Especially, track memory present in ZONE_MOVABLE and memory present in one of the kernel zones (which really only is ZONE_NORMAL right now as memory groups only apply to hotplugged memory) separately within a memory group, to prepare for making smart auto-online decision for individual memory blocks within a memory group based on group statistics. Link: https://lkml.kernel.org/r/20210806124715.17090-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hui Zhu <teawater@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08drivers/base/memory: introduce "memory groups" to logically group memory blocksDavid Hildenbrand1-0/+5
In our "auto-movable" memory onlining policy, we want to make decisions across memory blocks of a single memory device. Examples of memory devices include ACPI memory devices (in the simplest case a single DIMM) and virtio-mem. For now, we don't have a connection between a single memory block device and the real memory device. Each memory device consists of 1..X memory block devices. Let's logically group memory blocks belonging to the same memory device in "memory groups". Memory groups can span multiple physical ranges and a memory group itself does not contain any information regarding physical ranges, only properties (e.g., "max_pages") necessary for improved memory onlining. Introduce two memory group types: 1) Static memory group: E.g., a single ACPI memory device, consisting of 1..X memory resources. A memory group consists of 1..Y memory blocks. The whole group is added/removed in one go. If any part cannot get offlined, the whole group cannot be removed. 2) Dynamic memory group: E.g., a single virtio-mem device. Memory is dynamically added/removed in a fixed granularity, called a "unit", consisting of 1..X memory blocks. A unit is added/removed in one go. If any part of a unit cannot get offlined, the whole unit cannot be removed. In case of 1) we usually want either all memory managed by ZONE_MOVABLE or none. In case of 2) we usually want to have as many units as possible managed by ZONE_MOVABLE. We want a single unit to be of the same type. For now, memory groups are an internal concept that is not exposed to user space; we might want to change that in the future, though. add_memory() users can specify a mgid instead of a nid when passing the MHP_NID_IS_MGID flag. Link: https://lkml.kernel.org/r/20210806124715.17090-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hui Zhu <teawater@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Marek Kedzierski <mkedzier@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-09-08mm: track present early pages per zoneDavid Hildenbrand1-1/+1
Patch series "mm/memory_hotplug: