aboutsummaryrefslogtreecommitdiff
path: root/fs/proc
AgeCommit message (Collapse)AuthorFilesLines
2019-12-11sched/cputime, proc/stat: Fix incorrect guest nice cpustat valueFlavio Leitner1-2/+2
The value being used for guest_nice should be CPUTIME_GUEST_NICE and not CPUTIME_USER. Fixes: 26dae145a76c ("procfs: Use all-in-one vtime aware kcpustat accessor") Signed-off-by: Flavio Leitner <fbl@sysclose.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20191205020344.14940-1-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-04fs/proc/Kconfig: fix indentationKrzysztof Kozlowski1-4/+4
Adjust indentation from spaces to tab (+optional two spaces) as in coding style with command like: $ sed -e 's/^ / /' -i */Kconfig [adobriyan@gmail.com: add two spaces where necessary] Link: http://lkml.kernel.org/r/20191124133936.GA5655@avx2 Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04fs/proc/internal.h: shuffle "struct pde_opener"Alexey Dobriyan1-1/+1
List iteration takes more code than anything else which means embedded list_head should be the first element of the structure. Space savings: add/remove: 0/0 grow/shrink: 0/4 up/down: 0/-18 (-18) Function old new delta close_pdeo 228 227 -1 proc_reg_release 86 82 -4 proc_entry_rundown 143 139 -4 proc_reg_open 298 289 -9 Link: http://lkml.kernel.org/r/20191004234753.GB30246@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04fs/proc/generic.c: delete useless "len" variableAlexey Dobriyan1-4/+2
Pointer to next '/' encodes length of path element and next start position. Subtraction and increment are redundant. Link: http://lkml.kernel.org/r/20191004234521.GA30246@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-04proc: change ->nlink under proc_subdir_lockAlexey Dobriyan1-16/+15
Currently gluing PDE into global /proc tree is done under lock, but changing ->nlink is not. Additionally struct proc_dir_entry::nlink is not atomic so updates can be lost. Link: http://lkml.kernel.org/r/20190925202436.GA17388@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-03Merge branch 'timers-core-for-linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Ingo Molnar: "The main changes in the timer code in this cycle were: - Clockevent updates: - timer-of framework cleanups. (Geert Uytterhoeven) - Use timer-of for the renesas-ostm and the device name to prevent name collision in case of multiple timers. (Geert Uytterhoeven) - Check if there is an error after calling of_clk_get in asm9260 (Chuhong Yuan) - ABI fix: Zero out high order bits of nanoseconds on compat syscalls. This got broken a year ago, with apparently no side effects so far. Since the kernel would use random data otherwise I don't think we'd have other options but to fix the bug, even if there was a side effect to applications (Dmitry Safonov) - Optimize ns_to_timespec64() on 32-bit systems: move away from div_s64_rem() which can be slow, to div_u64_rem() which is faster (Arnd Bergmann) - Annotate KCSAN-reported false positive data races in hrtimer_is_queued() users by moving timer->state handling over to the READ_ONCE()/WRITE_ONCE() APIs. This documents these accesses (Eric Dumazet) - Misc cleanups and small fixes" [ I undid the "ABI fix" and updated the comments instead. The reason there were apparently no side effects is that the fix was a no-op. The updated comment is to say _why_ it was a no-op. - Linus ] * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: time: Zero the upper 32-bits in __kernel_timespec on 32-bit time: Rename tsk->real_start_time to ->start_boottime hrtimer: Remove the comment about not used HRTIMER_SOFTIRQ time: Fix spelling mistake in comment time: Optimize ns_to_timespec64() hrtimer: Annotate lockless access to timer->state clocksource/drivers/asm9260: Add a check for of_clk_get clocksource/drivers/renesas-ostm: Use unique device name instead of ostm clocksource/drivers/renesas-ostm: Convert to timer_of clocksource/drivers/timer-of: Use unique device name instead of timer clocksource/drivers/timer-of: Convert last full_name to %pOF
2019-11-21procfs: Use all-in-one vtime aware kcpustat accessorFrederic Weisbecker1-25/+31
Now that we can read also user and guest time safely under vtime, use the relevant accessor to fix frozen kcpustat values on nohz_full CPUs. Reported-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Wanpeng Li <wanpengli@tencent.com> Link: https://lkml.kernel.org/r/20191121024430.19938-4-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-13time: Rename tsk->real_start_time to ->start_boottimePeter Zijlstra1-1/+1
Since it stores CLOCK_BOOTTIME, not, as the name suggests, CLOCK_REALTIME, let's rename ->real_start_time to ->start_bootime. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-11Merge tag 'v5.4-rc7' into sched/core, to pick up fixesIngo Molnar2-14/+18
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-29procfs: Use vtime aware kcpustat accessor to fetch CPUTIME_SYSTEMFrederic Weisbecker1-2/+2
Now that we have a vtime safe kcpustat accessor for CPUTIME_SYSTEM, use it to start fixing frozen kcpustat values on nohz_full CPUs. Reported-by: Yauheni Kaliuta <yauheni.kaliuta@redhat.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jacek Anaszewski <jacek.anaszewski@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rjw@rjwysocki.net> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Wanpeng Li <wanpengli@tencent.com> Link: https://lkml.kernel.org/r/20191016025700.31277-13-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-19proc/meminfo: fix output alignmentKirill A. Shutemov1-2/+2
Patch series "Fixes for THP in page cache", v2. This patch (of 5): Add extra space for FileHugePages and FilePmdMapped, so the output is aligned with other rows. Link: http://lkml.kernel.org/r/20191017164223.2762148-2-songliubraving@fb.com Fixes: 60fbf0ab5da1 ("mm,thp: stats for file backed THP") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Song Liu <songliubraving@fb.com> Tested-by: Song Liu <songliubraving@fb.com> Acked-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19fs/proc/page.c: don't access uninitialized memmaps in fs/proc/page.cDavid Hildenbrand1-12/+16
There are three places where we access uninitialized memmaps, namely: - /proc/kpagecount - /proc/kpageflags - /proc/kpagecgroup We have initialized memmaps either when the section is online or when the page was initialized to the ZONE_DEVICE. Uninitialized memmaps contain garbage and in the worst case trigger kernel BUGs, especially with CONFIG_PAGE_POISONING. For example, not onlining a DIMM during boot and calling /proc/kpagecount with CONFIG_PAGE_POISONING: :/# cat /proc/kpagecount > tmp.test BUG: unable to handle page fault for address: fffffffffffffffe #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 114616067 P4D 114616067 PUD 114618067 PMD 0 Oops: 0000 [#1] SMP NOPTI CPU: 0 PID: 469 Comm: cat Not tainted 5.4.0-rc1-next-20191004+ #11 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4 RIP: 0010:kpagecount_read+0xce/0x1e0 Code: e8 09 83 e0 3f 48 0f a3 02 73 2d 4c 89 e7 48 c1 e7 06 48 03 3d ab 51 01 01 74 1d 48 8b 57 08 480 RSP: 0018:ffffa14e409b7e78 EFLAGS: 00010202 RAX: fffffffffffffffe RBX: 0000000000020000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 00007f76b5595000 RDI: fffff35645000000 RBP: 00007f76b5595000 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000 R13: 0000000000020000 R14: 00007f76b5595000 R15: ffffa14e409b7f08 FS: 00007f76b577d580(0000) GS:ffff8f41bd400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: fffffffffffffffe CR3: 0000000078960000 CR4: 00000000000006f0 Call Trace: proc_reg_read+0x3c/0x60 vfs_read+0xc5/0x180 ksys_read+0x68/0xe0 do_syscall_64+0x5c/0xa0 entry_SYSCALL_64_after_hwframe+0x49/0xbe For now, let's drop support for ZONE_DEVICE from the three pseudo files in order to fix this. To distinguish offline memory (with garbage memmap) from ZONE_DEVICE memory with properly initialized memmaps, we would have to check get_dev_pagemap() and pfn_zone_device_reserved() right now. The usage of both (especially, special casing devmem) is frowned upon and needs to be reworked. The fundamental issue we have is: if (pfn_to_online_page(pfn)) { /* memmap initialized */ } else if (pfn_valid(pfn)) { /* * ??? * a) offline memory. memmap garbage. * b) devmem: memmap initialized to ZONE_DEVICE. * c) devmem: reserved for driver. memmap garbage. * (d) devmem: memmap currently initializing - garbage) */ } We'll leave the pfn_zone_device_reserved() check in stable_page_flags() in place as that function is also used from memory failure. We now no longer dump information about pages that are not in use anymore - offline. Link: http://lkml.kernel.org/r/20191009142435.3975-2-david@redhat.com Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319] Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Qian Cai <cai@lca.pw> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Toshiki Fukasawa <t-fukasawa@vx.jp.nec.com> Cc: Pankaj gupta <pagupta@redhat.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: <stable@vger.kernel.org> [4.13+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-28Merge branch 'next-lockdown' of ↵Linus Torvalds1-0/+6
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull kernel lockdown mode from James Morris: "This is the latest iteration of the kernel lockdown patchset, from Matthew Garrett, David Howells and others. From the original description: This patchset introduces an optional kernel lockdown feature, intended to strengthen the boundary between UID 0 and the kernel. When enabled, various pieces of kernel functionality are restricted. Applications that rely on low-level access to either hardware or the kernel may cease working as a result - therefore this should not be enabled without appropriate evaluation beforehand. The majority of mainstream distributions have been carrying variants of this patchset for many years now, so there's value in providing a doesn't meet every distribution requirement, but gets us much closer to not requiring external patches. There are two major changes since this was last proposed for mainline: - Separating lockdown from EFI secure boot. Background discussion is covered here: https://lwn.net/Articles/751061/ - Implementation as an LSM, with a default stackable lockdown LSM module. This allows the lockdown feature to be policy-driven, rather than encoding an implicit policy within the mechanism. The new locked_down LSM hook is provided to allow LSMs to make a policy decision around whether kernel functionality that would allow tampering with or examining the runtime state of the kernel should be permitted. The included lockdown LSM provides an implementation with a simple policy intended for general purpose use. This policy provides a coarse level of granularity, controllable via the kernel command line: lockdown={integrity|confidentiality} Enable the kernel lockdown feature. If set to integrity, kernel features that allow userland to modify the running kernel are disabled. If set to confidentiality, kernel features that allow userland to extract confidential information from the kernel are also disabled. This may also be controlled via /sys/kernel/security/lockdown and overriden by kernel configuration. New or existing LSMs may implement finer-grained controls of the lockdown features. Refer to the lockdown_reason documentation in include/linux/security.h for details. The lockdown feature has had signficant design feedback and review across many subsystems. This code has been in linux-next for some weeks, with a few fixes applied along the way. Stephen Rothwell noted that commit 9d1f8be5cf42 ("bpf: Restrict bpf when kernel lockdown is in confidentiality mode") is missing a Signed-off-by from its author. Matthew responded that he is providing this under category (c) of the DCO" * 'next-lockdown' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (31 commits) kexec: Fix file verification on S390 security: constify some arrays in lockdown LSM lockdown: Print current->comm in restriction messages efi: Restrict efivar_ssdt_load when the kernel is locked down tracefs: Restrict tracefs when the kernel is locked down debugfs: Restrict debugfs when the kernel is locked down kexec: Allow kexec_file() with appropriate IMA policy when locked down lockdown: Lock down perf when in confidentiality mode bpf: Restrict bpf when kernel lockdown is in confidentiality mode lockdown: Lock down tracing and perf kprobes when in confidentiality mode lockdown: Lock down /proc/kcore x86/mmiotrace: Lock down the testmmiotrace module lockdown: Lock down module params that specify hardware parameters (eg. ioport) lockdown: Lock down TIOCSSERIAL lockdown: Prohibit PCMCIA CIS storage when the kernel is locked down acpi: Disable ACPI table override if the kernel is locked down acpi: Ignore acpi_rsdp kernel param when the kernel has been locked down ACPI: Limit access to custom_method when the kernel is locked down x86/msr: Restrict MSR access when the kernel is locked down x86: Lock down IO port access when the kernel is locked down ...
2019-09-24mm,thp: stats for file backed THPSong Liu2-1/+7
In preparation for non-shmem THP, this patch adds a few stats and exposes them in /proc/meminfo, /sys/bus/node/devices/<node>/meminfo, and /proc/<pid>/task/<tid>/smaps. This patch is mostly a rewrite of Kirill A. Shutemov's earlier version: https://lkml.kernel.org/r/20170126115819.58875-5-kirill.shutemov@linux.intel.com/ Link: http://lkml.kernel.org/r/20190801184244.3169074-5-songliubraving@fb.com Signed-off-by: Song Liu <songliubraving@fb.com> Acked-by: Rik van Riel <riel@surriel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Hugh Dickins <hughd@google.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24mm: remove quicklist page table cachesNicholas Piggin1-4/+0
Patch series "mm: remove quicklist page table caches". A while ago Nicholas proposed to remove quicklist page table caches [1]. I've rebased his patch on the curren upstream and switched ia64 and sh to use generic versions of PTE allocation. [1] https://lore.kernel.org/linux-mm/20190711030339.20892-1-npiggin@gmail.com This patch (of 3): Remove page table allocator "quicklists". These have been around for a long time, but have not got much traction in the last decade and are only used on ia64 and sh architectures. The numbers in the initial commit look interesting but probably don't apply anymore. If anybody wants to resurrect this it's in the git history, but it's unhelpful to have this code and divergent allocator behaviour for minor archs. Also it might be better to instead make more general improvements to page allocator if this is still so slow. Link: http://lkml.kernel.org/r/1565250728-21721-2-git-send-email-rppt@linux.ibm.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24mm: introduce compound_nr()Matthew Wilcox (Oracle)1-1/+1
Replace 1 << compound_order(page) with compound_nr(page). Minor improvements in readability. Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-21Merge tag 'for-linus-hmm' of ↵Linus Torvalds1-38/+42
git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma Pull hmm updates from Jason Gunthorpe: "This is more cleanup and consolidation of the hmm APIs and the very strongly related mmu_notifier interfaces. Many places across the tree using these interfaces are touched in the process. Beyond that a cleanup to the page walker API and a few memremap related changes round out the series: - General improvement of hmm_range_fault() and related APIs, more documentation, bug fixes from testing, API simplification & consolidation, and unused API removal - Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE, and make them internal kconfig selects - Hoist a lot of code related to mmu notifier attachment out of drivers by using a refcount get/put attachment idiom and remove the convoluted mmu_notifier_unregister_no_release() and related APIs. - General API improvement for the migrate_vma API and revision of its only user in nouveau - Annotate mmu_notifiers with lockdep and sleeping region debugging Two series unrelated to HMM or mmu_notifiers came along due to dependencies: - Allow pagemap's memremap_pages family of APIs to work without providing a struct device - Make walk_page_range() and related use a constant structure for function pointers" * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (75 commits) libnvdimm: Enable unit test infrastructure compile checks mm, notifier: Catch sleeping/blocking for !blockable kernel.h: Add non_block_start/end() drm/radeon: guard against calling an unpaired radeon_mn_unregister() csky: add missing brackets in a macro for tlb.h pagewalk: use lockdep_assert_held for locking validation pagewalk: separate function pointers from iterator data mm: split out a new pagewalk.h header from mm.h mm/mmu_notifiers: annotate with might_sleep() mm/mmu_notifiers: prime lockdep mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end mm/mmu_notifiers: remove the __mmu_notifier_invalidate_range_start/end exports mm/hmm: hmm_range_fault() infinite loop mm/hmm: hmm_range_fault() NULL pointer bug mm/hmm: fix hmm_range_fault()'s handling of swapped out pages mm/mmu_notifiers: remove unregister_no_release RDMA/odp: remove ib_ucontext from ib_umem RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm' RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr RDMA/mlx5: Use ib_umem_start instead of umem.address ...
2019-09-20Merge tag 'powerpc-5.4-1' of ↵Linus Torvalds1-4/+4
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "This is a bit late, partly due to me travelling, and partly due to a power outage knocking out some of my test systems *while* I was travelling. - Initial support for running on a system with an Ultravisor, which is software that runs below the hypervisor and protects guests against some attacks by the hypervisor. - Support for building the kernel to run as a "Secure Virtual Machine", ie. as a guest capable of running on a system with an Ultravisor. - Some changes to our DMA code on bare metal, to allow devices with medium sized DMA masks (> 32 && < 59 bits) to use more than 2GB of DMA space. - Support for firmware assisted crash dumps on bare metal (powernv). - Two series fixing bugs in and refactoring our PCI EEH code. - A large series refactoring our exception entry code to use gas macros, both to make it more readable and also enable some future optimisations. As well as many cleanups and other minor features & fixups. Thanks to: Adam Zerella, Alexey Kardashevskiy, Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Anshuman Khandual, Balbir Singh, Benjamin Herrenschmidt, Cédric Le Goater, Christophe JAILLET, Christophe Leroy, Christopher M. Riedl, Christoph Hellwig, Claudio Carvalho, Daniel Axtens, David Gibson, David Hildenbrand, Desnes A. Nunes do Rosario, Ganesh Goudar, Gautham R. Shenoy, Greg Kurz, Guerney Hunt, Gustavo Romero, Halil Pasic, Hari Bathini, Joakim Tjernlund, Jonathan Neuschafer, Jordan Niethe, Leonardo Bras, Lianbo Jiang, Madhavan Srinivasan, Mahesh Salgaonkar, Mahesh Salgaonkar, Masahiro Yamada, Maxiwell S. Garcia, Michael Anderson, Nathan Chancellor, Nathan Lynch, Naveen N. Rao, Nicholas Piggin, Oliver O'Halloran, Qian Cai, Ram Pai, Ravi Bangoria, Reza Arbab, Ryan Grimm, Sam Bobroff, Santosh Sivaraj, Segher Boessenkool, Sukadev Bhattiprolu, Thiago Bauermann, Thiago Jung Bauermann, Thomas Gleixner, Tom Lendacky, Vasant Hegde" * tag 'powerpc-5.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (264 commits) powerpc/mm/mce: Keep irqs disabled during lockless page table walk powerpc: Use ftrace_graph_ret_addr() when unwinding powerpc/ftrace: Enable HAVE_FUNCTION_GRAPH_RET_ADDR_PTR ftrace: Look up the address of return_to_handler() using helpers powerpc: dump kernel log before carrying out fadump or kdump docs: powerpc: Add missing documentation reference powerpc/xmon: Fix output of XIVE IPI powerpc/xmon: Improve output of XIVE interrupts powerpc/mm/radix: remove useless kernel messages powerpc/fadump: support holes in kernel boot memory area powerpc/fadump: remove RMA_START and RMA_END macros powerpc/fadump: update documentation about option to release opalcore powerpc/fadump: consider f/w load area powerpc/opalcore: provide an option to invalidate /sys/firmware/opal/core file powerpc/opalcore: export /sys/firmware/opal/core for analysing opal crashes powerpc/fadump: update documentation about CONFIG_PRESERVE_FA_DUMP powerpc/fadump: add support to preserve crash data on FADUMP disabled kernel powerpc/fadump: improve how crashed kernel's memory is reserved powerpc/fadump: consider reserved ranges while releasing memory powerpc/fadump: make crash memory ranges array allocation generic ...
2019-09-07pagewalk: separate function pointers from iterator dataChristoph Hellwig1-37/+41
The mm_walk structure currently mixed data and code. Split out the operations vectors into a new mm_walk_ops structure, and while we are changing the API also declare the mm_walk structure inside the walk_page_range and walk_page_vma functions. Based on patch from Linus Torvalds. Link: https://lore.kernel.org/r/20190828141955.22210-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Thomas Hellstrom <thellstrom@vmware.com> Reviewed-by: Steven Price <steven.price@arm.com> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-09-07mm: split out a new pagewalk.h header from mm.hChristoph Hellwig1-1/+1
Add a new header for the two handful of users of the walk_page_range / walk_page_vma interface instead of polluting all users of mm.h with it. Link: https://lore.kernel.org/r/20190828141955.22210-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Thomas Hellstrom <thellstrom@vmware.com> Reviewed-by: Steven Price <steven.price@arm.com> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-09-05new helper: get_tree_keyed()Al Viro1-2/+1
For vfs_get_keyed_super users. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-08-19lockdown: Print current->comm in restriction messagesMatthew Garrett1-2/+3
Print the content of current->comm in messages generated by lockdown to indicate a restriction that was hit. This makes it a bit easier to find out what caused the message. The message now patterned something like: Lockdown: <comm>: <what> is restricted; see man kernel_lockdown.7 Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Matthew Garrett <mjg59@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: James Morris <jmorris@namei.org>
2019-08-19lockdown: Lock down /proc/kcoreDavid Howells1-0/+5
Disallow access to /proc/kcore when the kernel is locked down to prevent access to cryptographic data. This is limited to lockdown confidentiality mode and is still permitted in integrity mode. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Matthew Garrett <mjg59@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: James Morris <jmorris@namei.org>
2019-08-09fs/core/vmcore: Move sev_active() reference to x86 arch codeThiago Jung Bauermann1-4/+4
Secure Encrypted Virtualization is an x86-specific feature, so it shouldn't appear in generic kernel code because it forces non-x86 architectures to define the sev_active() function, which doesn't make a lot of sense. To solve this problem, add an x86 elfcorehdr_read() function to override the generic weak implementation. To do that, it's necessary to make read_from_oldmem() public so that it can be used outside of vmcore.c. Also, remove the export for sev_active() since it's only used in files that won't be built as modules. Signed-off-by: Thiago Jung Bauermann <bauerman@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Lianbo Jiang <lijiang@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190806044919.10622-6-bauerman@linux.ibm.com
2019-07-19Merge branch 'work.mount0' of ↵Linus Torvalds1-4/+3
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs mount updates from Al Viro: "The first part of mount updates. Convert filesystems to use the new mount API" * 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) mnt_init(): call shmem_init() unconditionally constify ksys_mount() string arguments don't bother with registering rootfs init_rootfs(): don't bother with init_ramfs_fs() vfs: Convert smackfs to use the new mount API vfs: Convert selinuxfs to use the new mount API vfs: Convert securityfs to use the new mount API vfs: Convert apparmorfs to use the new mount API vfs: Convert openpromfs to use the new mount API vfs: Convert xenfs to use the new mount API vfs: Convert gadgetfs to use the new mount API vfs: Convert oprofilefs to use the new mount API vfs: Convert ibmasmfs to use the new mount API vfs: Convert qib_fs/ipathfs to use the new mount API vfs: Convert efivarfs to use the new mount API vfs: Convert configfs to use the new mount API vfs: Convert binfmt_misc to use the new mount API convenience helper: get_tree_single() convenience helper get_tree_nodev() vfs: Kill sget_userns() ...
2019-07-18proc/sysctl: add shared variables for range checkMatteo Croce1-0/+4
In the sysctl code the proc_dointvec_minmax() function is often used to validate the user supplied value between an allowed range. This function uses the extra1 and extra2 members from struct ctl_table as minimum and maximum allowed value. On sysctl handler declaration, in every source file there are some readonly variables containing just an integer which address is assigned to the extra1 and extra2 members, so the sysctl range is enforced. The special values 0, 1 and INT_MAX are very often used as range boundary, leading duplication of variables like zero=0, one=1, int_max=INT_MAX in different source files: $ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l 248 Add a const int array containing the most commonly used values, some macros to refer more easily to the correct array member, and use them instead of creating a local one for every object file. This is the bloat-o-meter output comparing the old and new binary compiled with the default Fedora config: # scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164) Data old new delta sysctl_vals - 12 +12 __kstrtab_sysctl_vals - 12 +12 max 14 10 -4 int_max 16 - -16 one 68 - -68 zero 128 28 -100 Total: Before=20583249, After=20583085, chg -0.00% [mcroce@redhat.com: tipc: remove two unused variables] Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com [akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c] [arnd@arndb.de: proc/sysctl: make firmware loader table conditional] Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de [akpm@linux-foundation.org: fix fs/eventpoll.c] Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com Signed-off-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Kees Cook <keescook@chromium.org> Reviewed-by: Aaron Tomlin <atomlin@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-18mm: thp: fix false negative of shmem vma's THP eligibilityYang Shi1-1/+2
Commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each vma") introduced THPeligible bit for processes' smaps. But, when checking the eligibility for shmem vma, __transparent_hugepage_enabled() is called to override the result from shmem_huge_enabled(). It may result in the anonymous vma's THP flag override shmem's. For example, running a simple test which create THP for shmem, but with anonymous THP disabled, when reading the process's smaps, it may show: 7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test Size: 4096 kB ... [snip] ... ShmemPmdMapped: 4096 kB ... [snip] ... THPeligible: 0 And, /proc/meminfo does show THP allocated and PMD mapped too: ShmemHugePages: 4096 kB ShmemPmdMapped: 4096 kB This doesn't make too much sense. The shmem objects should be treated separately from anonymous THP. Calling shmem_huge_enabled() with checking MMF_DISABLE_THP sounds good enough. And, we could skip stack and dax vma check since we already checked if the vma is shmem already. Also check if vma is suitable for THP by calling transhuge_vma_suitable(). And minor fix to smaps output format and documentation. Link: http://lkml.kernel.org/r/1560401041-32207-3-git-send-email-yang.shi@linux.alibaba.com Fixes: 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each vma") Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-17Merge branch 'akpm' (patches from Andrew)Linus Torvalds4-11/+32
Merge more updates from Andrew Morton: "VM: - z3fold fixes and enhancements by Henry Burns and Vitaly Wool - more accurate reclaimed slab caches calculations by Yafang Shao - fix MAP_UNINITIALIZED UAPI symbol to not depend on config, by Christoph Hellwig - !CONFIG_MMU fixes by Christoph Hellwig - new novmcoredd parameter to omit device dumps from vmcore, by Kairui Song - new test_meminit module for testing heap and pagealloc initialization, by Alexander Potapenko - ioremap improvements for huge mappings, by Anshuman Khandual - generalize kprobe page fault handling, by Anshuman Khandual - device-dax hotplug fixes and improvements, by Pavel Tatashin - enable synchronous DAX fault on powerpc, by Aneesh Kumar K.V - add pte_devmap() support for arm64, by Robin Murphy - unify locked_vm accounting with a helper, by Daniel Jordan - several misc fixes core/lib: - new typeof_member() macro including some users, by Alexey Dobriyan - make BIT() and GENMASK() available in asm, by Masahiro Yamada - changed LIST_POISON2 on x86_64 to 0xdead000000000122 for better code generation, by Alexey Dobriyan - rbtree code size optimizations, by Michel Lespinasse - convert struct pid count to refcount_t, by Joel Fernandes get_maintainer.pl: - add --no-moderated switch to skip moderated ML's, by Joe Perches misc: - ptrace PTRACE_GET_SYSCALL_INFO interface - coda updates - gdb scripts, various" [ Using merge message suggestion from Vlastimil Babka, with some editing - Linus ] * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (100 commits) fs/select.c: use struct_size() in kmalloc() mm: add account_locked_vm utility function arm64: mm: implement pte_devmap support mm: introduce ARCH_HAS_PTE_DEVMAP mm: clean up is_device_*_page() definitions mm/mmap: move common defines to mman-common.h mm: move MAP_SYNC to asm-generic/mman-common.h device-dax: "Hotremove" persistent memory that is used like normal RAM mm/hotplug: make remove_memory() interface usable device-dax: fix memory and resource leak if hotplug fails include/linux/lz4.h: fix spelling and copy-paste errors in documentation ipc/mqueue.c: only perform resource calculation if user valid include/asm-generic/bug.h: fix "cut here" for WARN_ON for __WARN_TAINT architectures scripts/gdb: add helpers to find and list devices scripts/gdb: add lx-genpd-summary command drivers/pps/pps.c: clear offset flags in PPS_SETPARAMS ioctl kernel/pid.c: convert struct pid count to refcount_t drivers/rapidio/devices/rio_mport_cdev.c: NUL terminate some strings select: shift restore_saved_sigmask_unless() into poll_select_copy_remaining() select: change do_poll() to return -ERESTARTNOHAND rather than -EINTR ...
2019-07-16fs/proc/proc_sysctl.c: fix the default values of i_uid/i_gid on /proc/sys ↵Radoslaw Burny1-0/+4
inodes. Normally, the inode's i_uid/i_gid are translated relative to s_user_ns, but this is not a correct behavior for proc. Since sysctl permission check in test_perm is done against GLOBAL_ROOT_[UG]ID, it makes more sense to use these values in u_[ug]id of proc inodes. In other words: although uid/gid in the inode is not read during test_perm, the inode logically belongs to the root of the namespace. I have confirmed this with Eric Biederman at LPC and in this thread: https://lore.kernel.org/lkml/87k1kzjdff.fsf@xmission.com Consequences ============ Since the i_[ug]id values of proc nodes are not used for permissions checks, this change usually makes no functional difference. However, it causes an issue in a setup where: * a namespace container is created without root user in container - hence the i_[ug]id of proc nodes are set to INVALID_[UG]ID * container creator tries to configure it by writing /proc/sys files, e.g. writing /proc/sys/kernel/shmmax to configure shared memory limit Kernel does not allow to open an inode for writing if its i_[ug]id are invalid, making it impossible to write shmmax and thus - configure the container. Using a container with no root mapping is apparently rare, but we do use this configuration at Google. Also, we use a generic tool to configure the container limits, and the inability to write any of them causes a failure. History ======= The invalid uids/gids in inodes first appeared due to 81754357770e (fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns). However, AFAIK, this did not immediately cause any issues. The inability to write to these "invalid" inodes was only caused by a later commit 0bd23d09b874 (vfs: Don't modify inodes with a uid or gid unknown to the vfs). Tested: Used a repro program that creates a user namespace without any mapping and stat'ed /proc/$PID/root/proc/sys/kernel/shmmax from outside. Before the change, it shows the overflow uid, with the change it's 0. The overflow uid indicates that the uid in the inode is not correct and thus it is not possible to open the file for writing. Link: http://lkml.kernel.org/r/20190708115130.250149-1-rburny@google.com Fixes: 0bd23d09b874 ("vfs: Don't modify inodes with a uid or gid unknown to the vfs") Signed-off-by: Radoslaw Burny <rburny@google.com> Acked-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Seth Forshee <seth.forshee@canonical.com> Cc: John Sperbeck <jsperbeck@google.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: <stable@vger.kernel.org> [4.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-16fs/proc/inode.c: use typeof_member() macroAlexey Dobriyan1-10/+17
Don't repeat function signatures twice. This is a kind-of-precursor for "struct proc_ops". Note: typeof(pde->proc_fops->...) ...; can't be used because ->proc_fops is "const struct file_operations *". "const" prevents assignment down the code and it can't be deleted in the type system. Link: http://lkml.kernel.org/r/20190529191110.GB5703@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-16vmcore: add a kernel parameter novmcoreddKairui Song2-1/+11
Since commit 2724273e8fd0 ("vmcore: add API to collect hardware dump in second kernel"), drivers are allowed to add device related dump data to vmcore as they want by using the device dump API. This has a potential issue, the data is stored in memory, drivers may append too much data and use too much memory. The vmcore is typically used in a kdump kernel which runs in a pre-reserved small chunk of memory. So as a result it will make kdump unusable at all due to OOM issues. So introduce new 'novmcoredd' command line option. User can disable device dump to reduce memory usage. This is helpful if device dump is using too much memory, disabling device dump could make sure a regular vmcore without device dump data is still available. [akpm@linux-foundation.org: tweak documentation] [akpm@linux-foundation.org: vmcore.c needs moduleparam.h] Link: http://lkml.kernel.org/r/20190528111856.7276-1-kasong@redhat.com Signed-off-by: Kairui Song <kasong@redhat.com> Acked-by: Dave Young <dyoung@redhat.com> Reviewed-by: Bhupesh Sharma <bhsharma@redhat.com> Cc: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Cc: "David S . Miller" <davem@davemloft.net> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-16Merge tag 'docs/v5.3-1' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media Pull rst conversion of docs from Mauro Carvalho Chehab: "As agreed with Jon, I'm sending this big series directly to you, c/c him, as this series required a special care, in order to avoid conflicts with other trees" * tag 'docs/v5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (77 commits) docs: kbuild: fix build with pdf and fix some minor issues docs: block: fix pdf output docs: arm: fix a breakage with pdf output docs: don't use nested tables docs: gpio: add sysfs interface to the admin-guide docs: locking: add it to the main index docs: add some directories to the main documentation index docs: add SPDX tags to new index files docs: add a memory-devices subdir to driver-api docs: phy: place documentation under driver-api docs: serial: move it to the driver-api docs: driver-api: add remaining converted dirs to it docs: driver-api: add xilinx driver API documentation docs: driver-api: add a series of orphaned documents docs: admin-guide: add a series of orphaned documents docs: cgroup-v1: add it to the admin-guide book docs: aoe: add it to the driver-api book docs: add some documentation dirs to the driver-api book docs: driver-model: move it to the driver-api book docs: lp855x-driver.rst: add it to the driver-api book ...
2019-07-16Merge branch 'proc-cmdline' (/proc/<pid>/cmdline fixes)Linus Torvalds1-57/+75
This fixes two problems reported with the cmdline simplification and cleanup last year: - the setproctitle() special cases didn't quite match the original semantics, and it can be noticeable: https://lore.kernel.org/lkml/alpine.LNX.2.21.1904052326230.3249@kich.toxcorp.com/ - it could leak an uninitialized byte from the temporary buffer under the right (wrong) circustances: https://lore.kernel.org/lkml/20190712160913.17727-1-izbyshev@ispras.ru/ It rewrites the logic entirely, splitting it into two separate commits (and two separate functions) for the two different cases ("unedited cmdline" vs "setproctitle() has been used to change the command line"). * proc-cmdline: /proc/<pid>/cmdline: add back the setproctitle() special case /proc/<pid>/cmdline: remove all the special cases
2019-07-16/proc/<pid>/cmdline: add back the setproctitle() special caseLinus Torvalds1-4/+77
This makes the setproctitle() special case very explicit indeed, and handles it with a separate helper function entirely. In the process, it re-instates the original semantics of simply stopping at the first NUL character when the original last NUL character is no longer there. [ The original semantics can still be seen in mm/util.c: get_cmdline() that is limited to a fixed-size buffer ] This makes the logic about when we use the string lengths etc much more obvious, and makes it easier to see what we do and what the two very different cases are. Note that even when we allow walking past the end of the argument array (because the setproctitle() might have overwritten and overflowed the original argv[] strings), we only allow it when it overflows into the environment region if it is immediately adjacent. [ Fixed for missing 'count' checks noted by Alexey Izbyshev ] Link: https://lore.kernel.org/lkml/alpine.LNX.2.21.1904052326230.3249@kich.toxcorp.com/ Fixes: 5ab827189965 ("fs/proc: simplify and clarify get_mm_cmdline() function") Cc: Jakub Jankowski <shasta@toxcorp.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Alexey Izbyshev <izbyshev@ispras.ru> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-16/proc/<pid>/cmdline: remove all the special casesLinus Torvalds1-63/+8
Start off with a clean slate that only reads exactly from arg_start to arg_end, without any oddities. This simplifies the code and in the process removes the case that caused us to potentially leak an uninitialized byte from the temporary kernel buffer. Note that in order to start from scratch with an understandable base, this simplifies things _too_ much, and removes all the legacy logic to handle setproctitle() having changed the argument strings. We'll add back those special cases very differently in the next commit. Link: https://lore.kernel.org/lkml/20190712160913.17727-1-izbyshev@ispras.ru/ Fixes: f5b65348fd77 ("proc: fix missing final NUL in get_mm_cmdline() rewrite") Cc: Alexey Izbyshev <izbyshev@ispras.ru> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-15docs: admin-guide: move sysctl directory to itMauro Carvalho Chehab1-1/+1
The stuff under sysctl describes /sys interface from userspace point of view. So, add it to the admin-guide