13 Aug, 2020
40 commits
-
Drop the repeated word "and".
Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Reviewed-by: Zi Yan
Link: http://lkml.kernel.org/r/20200801173822.14973-12-rdunlap@infradead.org
Signed-off-by: Linus Torvalds -
Drop the repeated word "the".
Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Reviewed-by: Zi Yan
Link: http://lkml.kernel.org/r/20200801173822.14973-11-rdunlap@infradead.org
Signed-off-by: Linus Torvalds -
Drop the repeated word "them" and "that".
Change "the the" to "to the".Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Reviewed-by: Zi Yan
Link: http://lkml.kernel.org/r/20200801173822.14973-10-rdunlap@infradead.org
Signed-off-by: Linus Torvalds -
Drop the repeated word "that" in two places.
Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Reviewed-by: Zi Yan
Link: http://lkml.kernel.org/r/20200801173822.14973-9-rdunlap@infradead.org
Signed-off-by: Linus Torvalds -
Drop the repeated word "and".
Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Reviewed-by: Zi Yan
Link: http://lkml.kernel.org/r/20200801173822.14973-8-rdunlap@infradead.org
Signed-off-by: Linus Torvalds -
Drop the repeated word "to" in two places.
Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Reviewed-by: Zi Yan
Link: http://lkml.kernel.org/r/20200801173822.14973-7-rdunlap@infradead.org
Signed-off-by: Linus Torvalds -
Drop the repeated word "down".
Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Reviewed-by: Zi Yan
Link: http://lkml.kernel.org/r/20200801173822.14973-6-rdunlap@infradead.org
Signed-off-by: Linus Torvalds -
Drop the repeated word "the" in two places.
Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Reviewed-by: Mike Kravetz
Reviewed-by: Zi Yan
Link: http://lkml.kernel.org/r/20200801173822.14973-5-rdunlap@infradead.org
Signed-off-by: Linus Torvalds -
Drop the repeated word "pages".
Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Reviewed-by: Zi Yan
Link: http://lkml.kernel.org/r/20200801173822.14973-4-rdunlap@infradead.org
Signed-off-by: Linus Torvalds -
Drop the repeated word "the".
Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Reviewed-by: Zi Yan
Link: http://lkml.kernel.org/r/20200801173822.14973-3-rdunlap@infradead.org
Signed-off-by: Linus Torvalds -
Drop the repeated word "a".
Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Reviewed-by: Zi Yan
Link: http://lkml.kernel.org/r/20200801173822.14973-2-rdunlap@infradead.org
Signed-off-by: Linus Torvalds -
The macro is not used anywhere, so remove the definition.
Signed-off-by: Arvind Sankar
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Reviewed-by: David Hildenbrand
Acked-by: Dave Hansen
Acked-by: David S. Miller
Acked-by: Mike Rapoport
Link: http://lkml.kernel.org/r/20200723231544.17274-4-nivedita@alum.mit.edu
Signed-off-by: Linus Torvalds -
The macro is not used anywhere, so remove the definition.
Signed-off-by: Arvind Sankar
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Reviewed-by: David Hildenbrand
Acked-by: Dave Hansen
Acked-by: Mike Rapoport
Link: http://lkml.kernel.org/r/20200723231544.17274-3-nivedita@alum.mit.edu
Signed-off-by: Linus Torvalds -
Drop the doubled word "for" in a comment.
Fix spello of "incremented".Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Acked-by: Chris Down
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Link: http://lkml.kernel.org/r/b04aa2e4-7c95-12f0-599d-43d07fb28134@infradead.org
Signed-off-by: Linus Torvalds -
Drop the doubled word "in" in a comment.
Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Cc: Konrad Rzeszutek Wilk
Link: http://lkml.kernel.org/r/3af7ed91-ad62-8445-40a4-9e07a64b9523@infradead.org
Signed-off-by: Linus Torvalds -
Change the doubled word "is" in a comment to "it is".
Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Link: http://lkml.kernel.org/r/ad605959-0083-4794-8d31-6b073300dd6f@infradead.org
Signed-off-by: Linus Torvalds -
Drop the doubled words "to" and "the".
Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Reviewed-by: SeongJae Park
Link: http://lkml.kernel.org/r/d9fae8d6-0d60-4d52-9385-3199ee98de49@infradead.org
Signed-off-by: Linus Torvalds -
Drop the doubled words "used" and "by".
Drop the repeated acronym "TLB" and make several other fixes around it.
(capital letters, spellos)Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Reviewed-by: SeongJae Park
Link: http://lkml.kernel.org/r/2bb6e13e-44df-4920-52d9-4d3539945f73@infradead.org
Signed-off-by: Linus Torvalds -
When onlining a first memory block in a zone, pcp lists are not updated
thus pcp struct will have the default setting of ->high = 0,->batch = 1.This means till the second memory block in a zone(if it have) is onlined
the pcp lists of this zone will not contain any pages because pcp's
->count is always greater than ->high thus free_pcppages_bulk() is called
to free batch size(=1) pages every time system wants to add a page to the
pcp list through free_unref_page().To put this in a word, system is not using benefits offered by the pcp
lists when there is a single onlineable memory block in a zone. Correct
this by always updating the pcp lists when memory block is onlined.Fixes: 1f522509c77a ("mem-hotplug: avoid multiple zones sharing same boot strapping boot_pageset")
Signed-off-by: Charan Teja Reddy
Signed-off-by: Andrew Morton
Reviewed-by: David Hildenbrand
Acked-by: Vlastimil Babka
Acked-by: Michal Hocko
Cc: Vinayak Menon
Link: http://lkml.kernel.org/r/1596372896-15336-1-git-send-email-charante@codeaurora.org
Signed-off-by: Linus Torvalds -
When check_memblock_offlined_cb() returns failed rc(e.g. the memblock is
online at that time), mem_hotplug_begin/done is unpaired in such case.Therefore a warning:
Call Trace:
percpu_up_write+0x33/0x40
try_remove_memory+0x66/0x120
? _cond_resched+0x19/0x30
remove_memory+0x2b/0x40
dev_dax_kmem_remove+0x36/0x72 [kmem]
device_release_driver_internal+0xf0/0x1c0
device_release_driver+0x12/0x20
bus_remove_device+0xe1/0x150
device_del+0x17b/0x3e0
unregister_dev_dax+0x29/0x60
devm_action_release+0x15/0x20
release_nodes+0x19a/0x1e0
devres_release_all+0x3f/0x50
device_release_driver_internal+0x100/0x1c0
driver_detach+0x4c/0x8f
bus_remove_driver+0x5c/0xd0
driver_unregister+0x31/0x50
dax_pmem_exit+0x10/0xfe0 [dax_pmem]Fixes: f1037ec0cc8a ("mm/memory_hotplug: fix remove_memory() lockdep splat")
Signed-off-by: Jia He
Signed-off-by: Andrew Morton
Reviewed-by: David Hildenbrand
Acked-by: Michal Hocko
Acked-by: Dan Williams
Cc: [5.6+]
Cc: Andy Lutomirski
Cc: Baoquan He
Cc: Borislav Petkov
Cc: Catalin Marinas
Cc: Chuhong Yuan
Cc: Dave Hansen
Cc: Dave Jiang
Cc: Fenghua Yu
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: Jonathan Cameron
Cc: Kaly Xin
Cc: Logan Gunthorpe
Cc: Masahiro Yamada
Cc: Mike Rapoport
Cc: Peter Zijlstra
Cc: Rich Felker
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: Vishal Verma
Cc: Will Deacon
Cc: Yoshinori Sato
Link: http://lkml.kernel.org/r/20200710031619.18762-3-justin.he@arm.com
Signed-off-by: Linus Torvalds -
This is to introduce a general dummy helper. memory_add_physaddr_to_nid()
is a fallback option to get the nid in case NUMA_NO_NID is detected.After this patch, arm64/sh/s390 can simply use the general dummy version.
PowerPC/x86/ia64 will still use their specific version.This is the preparation to set a fallback value for dev_dax->target_node.
Signed-off-by: Jia He
Signed-off-by: Andrew Morton
Reviewed-by: David Hildenbrand
Cc: Dan Williams
Cc: Michal Hocko
Cc: Catalin Marinas
Cc: Will Deacon
Cc: Tony Luck
Cc: Fenghua Yu
Cc: Yoshinori Sato
Cc: Rich Felker
Cc: Dave Hansen
Cc: Andy Lutomirski
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Borislav Petkov
Cc: "H. Peter Anvin"
Cc: Vishal Verma
Cc: Dave Jiang
Cc: Baoquan He
Cc: Chuhong Yuan
Cc: Mike Rapoport
Cc: Logan Gunthorpe
Cc: Masahiro Yamada
Cc: Jonathan Cameron
Cc: Kaly Xin
Link: http://lkml.kernel.org/r/20200710031619.18762-2-justin.he@arm.com
Signed-off-by: Linus Torvalds -
Some of our servers spend significant time at kernel boot initializing
memory block sysfs directories and then creating symlinks between them and
the corresponding nodes. The slowness happens because the machines get
stuck with the smallest supported memory block size on x86 (128M), which
results in 16,288 directories to cover the 2T of installed RAM. The
search for each memory block is noticeable even with commit 4fb6eabf1037
("drivers/base/memory.c: cache memory blocks in xarray to accelerate
lookup").Commit 078eb6aa50dc ("x86/mm/memory_hotplug: determine block size based on
the end of boot memory") chooses the block size based on alignment with
memory end. That addresses hotplug failures in qemu guests, but for bare
metal systems whose memory end isn't aligned to even the smallest size, it
leaves them at 128M.Make kernels that aren't running on a hypervisor use the largest supported
size (2G) to minimize overhead on big machines. Kernel boot goes 7%
faster on the aforementioned servers, shaving off half a second.[daniel.m.jordan@oracle.com: v3]
Link: http://lkml.kernel.org/r/20200714205450.945834-1-daniel.m.jordan@oracle.comSigned-off-by: Daniel Jordan
Signed-off-by: Andrew Morton
Acked-by: David Hildenbrand
Cc: Andy Lutomirski
Cc: Dave Hansen
Cc: David Hildenbrand
Cc: Michal Hocko
Cc: Pavel Tatashin
Cc: Peter Zijlstra
Cc: Steven Sistare
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20200609225451.3542648-1-daniel.m.jordan@oracle.com
Signed-off-by: Linus Torvalds -
Fix W=1 compile warnings (invalid kerneldoc):
mm/mmu_notifier.c:187: warning: Function parameter or member 'interval_sub' not described in 'mmu_interval_read_bgin'
mm/mmu_notifier.c:708: warning: Function parameter or member 'subscription' not described in 'mmu_notifier_registr'
mm/mmu_notifier.c:708: warning: Excess function parameter 'mn' description in 'mmu_notifier_register'
mm/mmu_notifier.c:880: warning: Function parameter or member 'subscription' not described in 'mmu_notifier_put'
mm/mmu_notifier.c:880: warning: Excess function parameter 'mn' description in 'mmu_notifier_put'
mm/mmu_notifier.c:982: warning: Function parameter or member 'ops' not described in 'mmu_interval_notifier_insert'Signed-off-by: Krzysztof Kozlowski
Signed-off-by: Andrew Morton
Reviewed-by: Jason Gunthorpe
Link: http://lkml.kernel.org/r/20200728171109.28687-4-krzk@kernel.org
Signed-off-by: Linus Torvalds -
The current_gfp_context() converts a number of PF_MEMALLOC_* per-process
flags into the corresponding GFP_* flags for memory allocation. In that
function, current->flags is accessed 3 times. That may lead to duplicated
access of the same memory location.This is not usually a problem with minimal debug config options on as the
compiler can optimize away the duplicated memory accesses. With most of
the debug config options on, however, that may not be the case. For
example, the x86-64 object size of the __need_fs_reclaim() in a debug
kernel that calls current_gfp_context() was 309 bytes. With this patch
applied, the object size is reduced to 202 bytes. This is a saving of 107
bytes and will probably be slightly faster too.Use READ_ONCE() to access current->flags to prevent the compiler from
possibly accessing current->flags multiple times.Signed-off-by: Waiman Long
Signed-off-by: Andrew Morton
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Mathieu Desnoyers
Cc: Michel Lespinasse
Link: http://lkml.kernel.org/r/20200618212936.9776-1-longman@redhat.com
Signed-off-by: Linus Torvalds -
The routine cma_init_reserved_areas is designed to activate all
reserved cma areas. It quits when it first encounters an error.
This can leave some areas in a state where they are reserved but
not activated. There is no feedback to code which performed the
reservation. Attempting to allocate memory from areas in such a
state will result in a BUG.Modify cma_init_reserved_areas to always attempt to activate all
areas. The called routine, cma_activate_area is responsible for
leaving the area in a valid state. No one is making active use
of returned error codes, so change the routine to void.How to reproduce: This example uses kernelcore, hugetlb and cma
as an easy way to reproduce. However, this is a more general cma
issue.Two node x86 VM 16GB total, 8GB per node
Kernel command line parameters, kernelcore=4G hugetlb_cma=8G
Related boot time messages,
hugetlb_cma: reserve 8192 MiB, up to 4096 MiB per node
cma: Reserved 4096 MiB at 0x0000000100000000
hugetlb_cma: reserved 4096 MiB on node 0
cma: Reserved 4096 MiB at 0x0000000300000000
hugetlb_cma: reserved 4096 MiB on node 1
cma: CMA area hugetlb could not be activated# echo 8 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
...
Call Trace:
bitmap_find_next_zero_area_off+0x51/0x90
cma_alloc+0x1a5/0x310
alloc_fresh_huge_page+0x78/0x1a0
alloc_pool_huge_page+0x6f/0xf0
set_max_huge_pages+0x10c/0x250
nr_hugepages_store_common+0x92/0x120
? __kmalloc+0x171/0x270
kernfs_fop_write+0xc1/0x1a0
vfs_write+0xc7/0x1f0
ksys_write+0x5f/0xe0
do_syscall_64+0x4d/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xa9Fixes: c64be2bb1c6e ("drivers: add Contiguous Memory Allocator")
Signed-off-by: Mike Kravetz
Signed-off-by: Andrew Morton
Reviewed-by: Roman Gushchin
Acked-by: Barry Song
Cc: Marek Szyprowski
Cc: Michal Nazarewicz
Cc: Kyungmin Park
Cc: Joonsoo Kim
Cc:
Link: http://lkml.kernel.org/r/20200730163123.6451-1-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds -
Once we enable CMA_DEBUGFS, we will get the below errors: directory
'cma-hugetlb' with parent 'cma' already present.We should have different names for different CMA areas.
Signed-off-by: Barry Song
Signed-off-by: Andrew Morton
Reviewed-by: Mike Kravetz
Acked-by: Roman Gushchin
Link: http://lkml.kernel.org/r/20200616223131.33828-3-song.bao.hua@hisilicon.com
Signed-off-by: Linus Torvalds -
Patch series "mm: fix the names of general cma and hugetlb cma", v2.
The current code of CMA can only work when users pass a const string as
name parameter. we need to fix the way to handle names in CMA. On the
other hand, to avoid name conflicts after enabling CMA_DEBUGFS, each
hugetlb should get a different CMA name.This patch (of 2):
If users give a name saved in stack, the current code will generate magic
pointer. if users don't give a name(NULL), kasprintf() will always return
NULL as we are at the early stage. that means cma_init_reserved_mem()
will return -ENOMEM if users set name parameter as NULL.[natechancellor@gmail.com: return cma->name directly in cma_get_name]
Link: https://github.com/ClangBuiltLinux/linux/issues/1063
Link: http://lkml.kernel.org/r/20200623015840.621964-1-natechancellor@gmail.comSigned-off-by: Barry Song
Signed-off-by: Nathan Chancellor
Signed-off-by: Andrew Morton
Reviewed-by: Mike Kravetz
Acked-by: Roman Gushchin
Link: http://lkml.kernel.org/r/20200616223131.33828-2-song.bao.hua@hisilicon.com
Signed-off-by: Linus Torvalds -
In some case the cma area could not be activated, but the cma_alloc be
used under this case, then the kernel will crash caused by NULL pointer
dereference.Add bitmap valid check in cma_alloc to avoid this issue.
Signed-off-by: Jianqun Xu
Signed-off-by: Andrew Morton
Reviewed-by: David Hildenbrand
Link: http://lkml.kernel.org/r/20200615010123.15596-1-jay.xu@rock-chips.com
Signed-off-by: Linus Torvalds -
Add following new vmstat events which will help in validating THP
migration without split. Statistics reported through these new VM events
will help in performance debugging.1. THP_MIGRATION_SUCCESS
2. THP_MIGRATION_FAILURE
3. THP_MIGRATION_SPLITIn addition, these new events also update normal page migration statistics
appropriately via PGMIGRATE_SUCCESS and PGMIGRATE_FAILURE. While here,
this updates current trace event 'mm_migrate_pages' to accommodate now
available THP statistics.[akpm@linux-foundation.org: s/hpage_nr_pages/thp_nr_pages/]
[ziy@nvidia.com: v2]
Link: http://lkml.kernel.org/r/C5E3C65C-8253-4638-9D3C-71A61858BB8B@nvidia.com
[anshuman.khandual@arm.com: s/thp_nr_pages/hpage_nr_pages/]
Link: http://lkml.kernel.org/r/1594287583-16568-1-git-send-email-anshuman.khandual@arm.comSigned-off-by: Anshuman Khandual
Signed-off-by: Zi Yan
Signed-off-by: Andrew Morton
Reviewed-by: Daniel Jordan
Cc: Hugh Dickins
Cc: Matthew Wilcox
Cc: Zi Yan
Cc: John Hubbard
Cc: Naoya Horiguchi
Link: http://lkml.kernel.org/r/1594080415-27924-1-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Linus Torvalds -
Since commit 3917c80280c93a7123f ("thp: change CoW semantics for
anon-THP"), the CoW page fault of THP has been rewritten, debug_cow is not
used anymore. So, just remove it.Signed-off-by: Yang Shi
Signed-off-by: Andrew Morton
Reviewed-by: Zi Yan
Cc: Kirill A. Shutemov
Link: http://lkml.kernel.org/r/1592270980-116062-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Linus Torvalds -
Add a migrate_vma_*() self test for mmap(MAP_SHARED) to verify that
!vma_anonymous() ranges won't be migrated.Signed-off-by: Ralph Campbell
Signed-off-by: Andrew Morton
Cc: Jerome Glisse
Cc: John Hubbard
Cc: Christoph Hellwig
Cc: Jason Gunthorpe
Cc: "Bharata B Rao"
Cc: Shuah Khan
Link: http://lkml.kernel.org/r/20200710194840.7602-3-rcampbell@nvidia.com
Link: http://lkml.kernel.org/r/20200709165711.26584-3-rcampbell@nvidia.com
Signed-off-by: Linus Torvalds -
Patch series "mm/migrate: optimize migrate_vma_setup() for holes".
A simple optimization for migrate_vma_*() when the source vma is not an
anonymous vma and a new test case to exercise it.This patch (of 2):
When migrating system memory to device private memory, if the source
address range is a valid VMA range and there is no memory or a zero page,
the source PFN array is marked as valid but with no PFN.This lets the device driver allocate private memory and clear it, then
insert the new device private struct page into the CPU's page tables when
migrate_vma_pages() is called. migrate_vma_pages() only inserts the new
page if the VMA is an anonymous range.There is no point in telling the device driver to allocate device private
memory and then not migrate the page. Instead, mark the source PFN array
entries as not migrating to avoid this overhead.[rcampbell@nvidia.com: v2]
Link: http://lkml.kernel.org/r/20200710194840.7602-2-rcampbell@nvidia.comSigned-off-by: Ralph Campbell
Signed-off-by: Andrew Morton
Cc: Jerome Glisse
Cc: John Hubbard
Cc: Christoph Hellwig
Cc: Jason Gunthorpe
Cc: "Bharata B Rao"
Cc: Shuah Khan
Link: http://lkml.kernel.org/r/20200710194840.7602-1-rcampbell@nvidia.com
Link: http://lkml.kernel.org/r/20200709165711.26584-1-rcampbell@nvidia.com
Link: http://lkml.kernel.org/r/20200709165711.26584-2-rcampbell@nvidia.com
Signed-off-by: Linus Torvalds -
Commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") requires callers of huge_pte_alloc to hold i_mmap_rwsem
in at least read mode. This is because the explicit locking in
huge_pmd_share (called by huge_pte_alloc) was removed. When restructuring
the code, the call to huge_pte_alloc in the else block at the beginning of
hugetlb_fault was missed.Unfortunately, that else clause is exercised when there is no page table
entry. This will likely lead to a call to huge_pmd_share. If
huge_pmd_share thinks pmd sharing is possible, it will traverse the
mapping tree (i_mmap) without holding i_mmap_rwsem. If someone else is
modifying the tree, bad things such as addressing exceptions or worse
could happen.Simply remove the else clause. It should have been removed previously.
The code following the else will call huge_pte_alloc with the appropriate
locking.To prevent this type of issue in the future, add routines to assert that
i_mmap_rwsem is held, and call these routines in huge pmd sharing
routines.Fixes: c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
Suggested-by: Matthew Wilcox
Signed-off-by: Mike Kravetz
Signed-off-by: Andrew Morton
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Naoya Horiguchi
Cc: "Aneesh Kumar K.V"
Cc: Andrea Arcangeli
Cc: "Kirill A.Shutemov"
Cc: Davidlohr Bueso
Cc: Prakash Sangappa
Cc:
Link: http://lkml.kernel.org/r/e670f327-5cf9-1959-96e4-6dc7cc30d3d5@oracle.com
Signed-off-by: Linus Torvalds -
syzbot found issues with having hugetlbfs on a union/overlay as reported
in [1]. Due to the limitations (no write) and special functionality of
hugetlbfs, it does not work well in filesystem stacking. There are no
know use cases for hugetlbfs stacking. Rather than making modifications
to get hugetlbfs working in such environments, simply prevent stacking.[1] https://lore.kernel.org/linux-mm/000000000000b4684e05a2968ca6@google.com/
Reported-by: syzbot+d6ec23007e951dadf3de@syzkaller.appspotmail.com
Suggested-by: Amir Goldstein
Signed-off-by: Mike Kravetz
Signed-off-by: Andrew Morton
Acked-by: Miklos Szeredi
Cc: Al Viro
Cc: Matthew Wilcox
Cc: Colin Walters
Link: http://lkml.kernel.org/r/80f869aa-810d-ef6c-8888-b46cee135907@oracle.com
Signed-off-by: Linus Torvalds -
When the OOM killer finds a victim and tryies to kill it, if the victim is
already exiting, the task mm will be NULL and no process will be killed.
But the dump_header() has been already executed, so it will be strange to
dump so much information without killing a process. We'd better show some
helpful information to indicate why this happens.Suggested-by: David Rientjes
Signed-off-by: Yafang Shao
Signed-off-by: Andrew Morton
Acked-by: Michal Hocko
Cc: Tetsuo Handa
Cc: Qian Cai
Link: http://lkml.kernel.org/r/20200721010127.17238-1-laoar.shao@gmail.com
Signed-off-by: Linus Torvalds -
The exported value includes oom_score_adj so the range is no [0, 1000] as
described in the previous section but rather [0, 2000]. Mention that fact
explicitly.Signed-off-by: Michal Hocko
Signed-off-by: Andrew Morton
Cc: Jonathan Corbet
Cc: David Rientjes
Cc: Yafang Shao
Link: http://lkml.kernel.org/r/20200709062603.18480-2-mhocko@kernel.org
Signed-off-by: Linus Torvalds -
There are at least two notes in the oom section. The 3% discount for root
processes is gone since d46078b28889 ("mm, oom: remove 3% bonus for
CAP_SYS_ADMIN processes").Likewise children of the selected oom victim are not sacrificed since
bbbe48029720 ("mm, oom: remove 'prefer children over parent' heuristic")Drop both of them.
Signed-off-by: Michal Hocko
Signed-off-by: Andrew Morton
Cc: Jonathan Corbet
Cc: David Rientjes
Cc: Yafang Shao
Link: http://lkml.kernel.org/r/20200709062603.18480-1-mhocko@kernel.org
Signed-off-by: Linus Torvalds -
Recently we found an issue on our production environment that when memcg
oom is triggered the oom killer doesn't chose the process with largest
resident memory but chose the first scanned process. Note that all
processes in this memcg have the same oom_score_adj, so the oom killer
should chose the process with largest resident memory.Bellow is part of the oom info, which is enough to analyze this issue.
[7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
[7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
[7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
[...]
[7516987.983293] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[7516987.983510] [ 5740] 0 5740 257 1 32768 0 -998 pause
[7516987.983574] [58804] 0 58804 4594 771 81920 0 -998 entry_point.bas
[7516987.983577] [58908] 0 58908 7089 689 98304 0 -998 cron
[7516987.983580] [58910] 0 58910 16235 5576 163840 0 -998 supervisord
[7516987.983590] [59620] 0 59620 18074 1395 188416 0 -998 sshd
[7516987.983594] [59622] 0 59622 18680 6679 188416 0 -998 python
[7516987.983598] [59624] 0 59624 1859266 5161 548864 0 -998 odin-agent
[7516987.983600] [59625] 0 59625 707223 9248 983040 0 -998 filebeat
[7516987.983604] [59627] 0 59627 416433 64239 774144 0 -998 odin-log-agent
[7516987.983607] [59631] 0 59631 180671 15012 385024 0 -998 python3
[7516987.983612] [61396] 0 61396 791287 3189 352256 0 -998 client
[7516987.983615] [61641] 0 61641 1844642 29089 946176 0 -998 client
[7516987.983765] [ 9236] 0 9236 2642 467 53248 0 -998 php_scanner
[7516987.983911] [42898] 0 42898 15543 838 167936 0 -998 su
[7516987.983915] [42900] 1000 42900 3673 867 77824 0 -998 exec_script_vr2
[7516987.983918] [42925] 1000 42925 36475 19033 335872 0 -998 python
[7516987.983921] [57146] 1000 57146 3673 848 73728 0 -998 exec_script_J2p
[7516987.983925] [57195] 1000 57195 186359 22958 491520 0 -998 python2
[7516987.983928] [58376] 1000 58376 275764 14402 290816 0 -998 rosmaster
[7516987.983931] [58395] 1000 58395 155166 4449 245760 0 -998 rosout
[7516987.983935] [58406] 1000 58406 18285584 3967322 37101568 0 -998 data_sim
[7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
[7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kBWe can find that the first scanned process 5740 (pause) was killed, but
its rss is only one page. That is because, when we calculate the oom
badness in oom_badness(), we always ignore the negtive point and convert
all of these negtive points to 1. Now as oom_score_adj of all the
processes in this targeted memcg have the same value -998, the points of
these processes are all negtive value. As a result, the first scanned
process will be killed.The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
a Guaranteed pod, which has higher priority to prevent from being killed
by system oom.To fix this issue, we should make the calculation of oom point more
accurate. We can achieve it by convert the chosen_point from 'unsigned
long' to 'long'.[cai@lca.pw: reported a issue in the previous version]
[mhocko@suse.com: fixed the issue reported by Cai]
[mhocko@suse.com: add the comment in proc_oom_score()]
[laoar.shao@gmail.com: v3]
Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.comSigned-off-by: Yafang Shao
Signed-off-by: Andrew Morton
Tested-by: Naresh Kamboju
Acked-by: Michal Hocko
Cc: David Rientjes
Cc: Qian Cai
Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com
Signed-off-by: Linus Torvalds -
Change "interlave" to "interleave".
Signed-off-by: Yanfei Xu
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Link: http://lkml.kernel.org/r/20200810063454.9357-1-yanfei.xu@windriver.com
Signed-off-by: Linus Torvalds -
Previous implementatoin calls untagged_addr() before error check, while if
the error check failed and return EINVAL, the untagged_addr() call is just
useless work.Signed-off-by: Wenchao Hao
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Link: http://lkml.kernel.org/r/20200801090825.5597-1-haowenchao22@gmail.com
Signed-off-by: Linus Torvalds