13 Aug, 2020

40 commits

  • Drop the repeated word "and".

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Zi Yan
    Link: http://lkml.kernel.org/r/20200801173822.14973-12-rdunlap@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Drop the repeated word "the".

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Zi Yan
    Link: http://lkml.kernel.org/r/20200801173822.14973-11-rdunlap@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Drop the repeated word "them" and "that".
    Change "the the" to "to the".

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Zi Yan
    Link: http://lkml.kernel.org/r/20200801173822.14973-10-rdunlap@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Drop the repeated word "that" in two places.

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Zi Yan
    Link: http://lkml.kernel.org/r/20200801173822.14973-9-rdunlap@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Drop the repeated word "and".

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Zi Yan
    Link: http://lkml.kernel.org/r/20200801173822.14973-8-rdunlap@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Drop the repeated word "to" in two places.

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Zi Yan
    Link: http://lkml.kernel.org/r/20200801173822.14973-7-rdunlap@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Drop the repeated word "down".

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Zi Yan
    Link: http://lkml.kernel.org/r/20200801173822.14973-6-rdunlap@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Drop the repeated word "the" in two places.

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Reviewed-by: Zi Yan
    Link: http://lkml.kernel.org/r/20200801173822.14973-5-rdunlap@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Drop the repeated word "pages".

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Zi Yan
    Link: http://lkml.kernel.org/r/20200801173822.14973-4-rdunlap@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Drop the repeated word "the".

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Zi Yan
    Link: http://lkml.kernel.org/r/20200801173822.14973-3-rdunlap@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Drop the repeated word "a".

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Zi Yan
    Link: http://lkml.kernel.org/r/20200801173822.14973-2-rdunlap@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • The macro is not used anywhere, so remove the definition.

    Signed-off-by: Arvind Sankar
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: Dave Hansen
    Acked-by: David S. Miller
    Acked-by: Mike Rapoport
    Link: http://lkml.kernel.org/r/20200723231544.17274-4-nivedita@alum.mit.edu
    Signed-off-by: Linus Torvalds

    Arvind Sankar
     
  • The macro is not used anywhere, so remove the definition.

    Signed-off-by: Arvind Sankar
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: Dave Hansen
    Acked-by: Mike Rapoport
    Link: http://lkml.kernel.org/r/20200723231544.17274-3-nivedita@alum.mit.edu
    Signed-off-by: Linus Torvalds

    Arvind Sankar
     
  • Drop the doubled word "for" in a comment.
    Fix spello of "incremented".

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Acked-by: Chris Down
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Link: http://lkml.kernel.org/r/b04aa2e4-7c95-12f0-599d-43d07fb28134@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Drop the doubled word "in" in a comment.

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Konrad Rzeszutek Wilk
    Link: http://lkml.kernel.org/r/3af7ed91-ad62-8445-40a4-9e07a64b9523@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Change the doubled word "is" in a comment to "it is".

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/ad605959-0083-4794-8d31-6b073300dd6f@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Drop the doubled words "to" and "the".

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: SeongJae Park
    Link: http://lkml.kernel.org/r/d9fae8d6-0d60-4d52-9385-3199ee98de49@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Drop the doubled words "used" and "by".

    Drop the repeated acronym "TLB" and make several other fixes around it.
    (capital letters, spellos)

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: SeongJae Park
    Link: http://lkml.kernel.org/r/2bb6e13e-44df-4920-52d9-4d3539945f73@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • When onlining a first memory block in a zone, pcp lists are not updated
    thus pcp struct will have the default setting of ->high = 0,->batch = 1.

    This means till the second memory block in a zone(if it have) is onlined
    the pcp lists of this zone will not contain any pages because pcp's
    ->count is always greater than ->high thus free_pcppages_bulk() is called
    to free batch size(=1) pages every time system wants to add a page to the
    pcp list through free_unref_page().

    To put this in a word, system is not using benefits offered by the pcp
    lists when there is a single onlineable memory block in a zone. Correct
    this by always updating the pcp lists when memory block is onlined.

    Fixes: 1f522509c77a ("mem-hotplug: avoid multiple zones sharing same boot strapping boot_pageset")
    Signed-off-by: Charan Teja Reddy
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Vinayak Menon
    Link: http://lkml.kernel.org/r/1596372896-15336-1-git-send-email-charante@codeaurora.org
    Signed-off-by: Linus Torvalds

    Charan Teja Reddy
     
  • When check_memblock_offlined_cb() returns failed rc(e.g. the memblock is
    online at that time), mem_hotplug_begin/done is unpaired in such case.

    Therefore a warning:
    Call Trace:
    percpu_up_write+0x33/0x40
    try_remove_memory+0x66/0x120
    ? _cond_resched+0x19/0x30
    remove_memory+0x2b/0x40
    dev_dax_kmem_remove+0x36/0x72 [kmem]
    device_release_driver_internal+0xf0/0x1c0
    device_release_driver+0x12/0x20
    bus_remove_device+0xe1/0x150
    device_del+0x17b/0x3e0
    unregister_dev_dax+0x29/0x60
    devm_action_release+0x15/0x20
    release_nodes+0x19a/0x1e0
    devres_release_all+0x3f/0x50
    device_release_driver_internal+0x100/0x1c0
    driver_detach+0x4c/0x8f
    bus_remove_driver+0x5c/0xd0
    driver_unregister+0x31/0x50
    dax_pmem_exit+0x10/0xfe0 [dax_pmem]

    Fixes: f1037ec0cc8a ("mm/memory_hotplug: fix remove_memory() lockdep splat")
    Signed-off-by: Jia He
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: Michal Hocko
    Acked-by: Dan Williams
    Cc: [5.6+]
    Cc: Andy Lutomirski
    Cc: Baoquan He
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Chuhong Yuan
    Cc: Dave Hansen
    Cc: Dave Jiang
    Cc: Fenghua Yu
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Jonathan Cameron
    Cc: Kaly Xin
    Cc: Logan Gunthorpe
    Cc: Masahiro Yamada
    Cc: Mike Rapoport
    Cc: Peter Zijlstra
    Cc: Rich Felker
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vishal Verma
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200710031619.18762-3-justin.he@arm.com
    Signed-off-by: Linus Torvalds

    Jia He
     
  • This is to introduce a general dummy helper. memory_add_physaddr_to_nid()
    is a fallback option to get the nid in case NUMA_NO_NID is detected.

    After this patch, arm64/sh/s390 can simply use the general dummy version.
    PowerPC/x86/ia64 will still use their specific version.

    This is the preparation to set a fallback value for dev_dax->target_node.

    Signed-off-by: Jia He
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Cc: Dan Williams
    Cc: Michal Hocko
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Yoshinori Sato
    Cc: Rich Felker
    Cc: Dave Hansen
    Cc: Andy Lutomirski
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: "H. Peter Anvin"
    Cc: Vishal Verma
    Cc: Dave Jiang
    Cc: Baoquan He
    Cc: Chuhong Yuan
    Cc: Mike Rapoport
    Cc: Logan Gunthorpe
    Cc: Masahiro Yamada
    Cc: Jonathan Cameron
    Cc: Kaly Xin
    Link: http://lkml.kernel.org/r/20200710031619.18762-2-justin.he@arm.com
    Signed-off-by: Linus Torvalds

    Jia He
     
  • Some of our servers spend significant time at kernel boot initializing
    memory block sysfs directories and then creating symlinks between them and
    the corresponding nodes. The slowness happens because the machines get
    stuck with the smallest supported memory block size on x86 (128M), which
    results in 16,288 directories to cover the 2T of installed RAM. The
    search for each memory block is noticeable even with commit 4fb6eabf1037
    ("drivers/base/memory.c: cache memory blocks in xarray to accelerate
    lookup").

    Commit 078eb6aa50dc ("x86/mm/memory_hotplug: determine block size based on
    the end of boot memory") chooses the block size based on alignment with
    memory end. That addresses hotplug failures in qemu guests, but for bare
    metal systems whose memory end isn't aligned to even the smallest size, it
    leaves them at 128M.

    Make kernels that aren't running on a hypervisor use the largest supported
    size (2G) to minimize overhead on big machines. Kernel boot goes 7%
    faster on the aforementioned servers, shaving off half a second.

    [daniel.m.jordan@oracle.com: v3]
    Link: http://lkml.kernel.org/r/20200714205450.945834-1-daniel.m.jordan@oracle.com

    Signed-off-by: Daniel Jordan
    Signed-off-by: Andrew Morton
    Acked-by: David Hildenbrand
    Cc: Andy Lutomirski
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: Michal Hocko
    Cc: Pavel Tatashin
    Cc: Peter Zijlstra
    Cc: Steven Sistare
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20200609225451.3542648-1-daniel.m.jordan@oracle.com
    Signed-off-by: Linus Torvalds

    Daniel Jordan
     
  • Fix W=1 compile warnings (invalid kerneldoc):

    mm/mmu_notifier.c:187: warning: Function parameter or member 'interval_sub' not described in 'mmu_interval_read_bgin'
    mm/mmu_notifier.c:708: warning: Function parameter or member 'subscription' not described in 'mmu_notifier_registr'
    mm/mmu_notifier.c:708: warning: Excess function parameter 'mn' description in 'mmu_notifier_register'
    mm/mmu_notifier.c:880: warning: Function parameter or member 'subscription' not described in 'mmu_notifier_put'
    mm/mmu_notifier.c:880: warning: Excess function parameter 'mn' description in 'mmu_notifier_put'
    mm/mmu_notifier.c:982: warning: Function parameter or member 'ops' not described in 'mmu_interval_notifier_insert'

    Signed-off-by: Krzysztof Kozlowski
    Signed-off-by: Andrew Morton
    Reviewed-by: Jason Gunthorpe
    Link: http://lkml.kernel.org/r/20200728171109.28687-4-krzk@kernel.org
    Signed-off-by: Linus Torvalds

    Krzysztof Kozlowski
     
  • The current_gfp_context() converts a number of PF_MEMALLOC_* per-process
    flags into the corresponding GFP_* flags for memory allocation. In that
    function, current->flags is accessed 3 times. That may lead to duplicated
    access of the same memory location.

    This is not usually a problem with minimal debug config options on as the
    compiler can optimize away the duplicated memory accesses. With most of
    the debug config options on, however, that may not be the case. For
    example, the x86-64 object size of the __need_fs_reclaim() in a debug
    kernel that calls current_gfp_context() was 309 bytes. With this patch
    applied, the object size is reduced to 202 bytes. This is a saving of 107
    bytes and will probably be slightly faster too.

    Use READ_ONCE() to access current->flags to prevent the compiler from
    possibly accessing current->flags multiple times.

    Signed-off-by: Waiman Long
    Signed-off-by: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Mathieu Desnoyers
    Cc: Michel Lespinasse
    Link: http://lkml.kernel.org/r/20200618212936.9776-1-longman@redhat.com
    Signed-off-by: Linus Torvalds

    Waiman Long
     
  • The routine cma_init_reserved_areas is designed to activate all
    reserved cma areas. It quits when it first encounters an error.
    This can leave some areas in a state where they are reserved but
    not activated. There is no feedback to code which performed the
    reservation. Attempting to allocate memory from areas in such a
    state will result in a BUG.

    Modify cma_init_reserved_areas to always attempt to activate all
    areas. The called routine, cma_activate_area is responsible for
    leaving the area in a valid state. No one is making active use
    of returned error codes, so change the routine to void.

    How to reproduce: This example uses kernelcore, hugetlb and cma
    as an easy way to reproduce. However, this is a more general cma
    issue.

    Two node x86 VM 16GB total, 8GB per node
    Kernel command line parameters, kernelcore=4G hugetlb_cma=8G
    Related boot time messages,
    hugetlb_cma: reserve 8192 MiB, up to 4096 MiB per node
    cma: Reserved 4096 MiB at 0x0000000100000000
    hugetlb_cma: reserved 4096 MiB on node 0
    cma: Reserved 4096 MiB at 0x0000000300000000
    hugetlb_cma: reserved 4096 MiB on node 1
    cma: CMA area hugetlb could not be activated

    # echo 8 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages

    BUG: kernel NULL pointer dereference, address: 0000000000000000
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    PGD 0 P4D 0
    Oops: 0000 [#1] SMP PTI
    ...
    Call Trace:
    bitmap_find_next_zero_area_off+0x51/0x90
    cma_alloc+0x1a5/0x310
    alloc_fresh_huge_page+0x78/0x1a0
    alloc_pool_huge_page+0x6f/0xf0
    set_max_huge_pages+0x10c/0x250
    nr_hugepages_store_common+0x92/0x120
    ? __kmalloc+0x171/0x270
    kernfs_fop_write+0xc1/0x1a0
    vfs_write+0xc7/0x1f0
    ksys_write+0x5f/0xe0
    do_syscall_64+0x4d/0x90
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Fixes: c64be2bb1c6e ("drivers: add Contiguous Memory Allocator")
    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Reviewed-by: Roman Gushchin
    Acked-by: Barry Song
    Cc: Marek Szyprowski
    Cc: Michal Nazarewicz
    Cc: Kyungmin Park
    Cc: Joonsoo Kim
    Cc:
    Link: http://lkml.kernel.org/r/20200730163123.6451-1-mike.kravetz@oracle.com
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Once we enable CMA_DEBUGFS, we will get the below errors: directory
    'cma-hugetlb' with parent 'cma' already present.

    We should have different names for different CMA areas.

    Signed-off-by: Barry Song
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Acked-by: Roman Gushchin
    Link: http://lkml.kernel.org/r/20200616223131.33828-3-song.bao.hua@hisilicon.com
    Signed-off-by: Linus Torvalds

    Barry Song
     
  • Patch series "mm: fix the names of general cma and hugetlb cma", v2.

    The current code of CMA can only work when users pass a const string as
    name parameter. we need to fix the way to handle names in CMA. On the
    other hand, to avoid name conflicts after enabling CMA_DEBUGFS, each
    hugetlb should get a different CMA name.

    This patch (of 2):

    If users give a name saved in stack, the current code will generate magic
    pointer. if users don't give a name(NULL), kasprintf() will always return
    NULL as we are at the early stage. that means cma_init_reserved_mem()
    will return -ENOMEM if users set name parameter as NULL.

    [natechancellor@gmail.com: return cma->name directly in cma_get_name]
    Link: https://github.com/ClangBuiltLinux/linux/issues/1063
    Link: http://lkml.kernel.org/r/20200623015840.621964-1-natechancellor@gmail.com

    Signed-off-by: Barry Song
    Signed-off-by: Nathan Chancellor
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Acked-by: Roman Gushchin
    Link: http://lkml.kernel.org/r/20200616223131.33828-2-song.bao.hua@hisilicon.com
    Signed-off-by: Linus Torvalds

    Barry Song
     
  • In some case the cma area could not be activated, but the cma_alloc be
    used under this case, then the kernel will crash caused by NULL pointer
    dereference.

    Add bitmap valid check in cma_alloc to avoid this issue.

    Signed-off-by: Jianqun Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Link: http://lkml.kernel.org/r/20200615010123.15596-1-jay.xu@rock-chips.com
    Signed-off-by: Linus Torvalds

    Jianqun Xu
     
  • Add following new vmstat events which will help in validating THP
    migration without split. Statistics reported through these new VM events
    will help in performance debugging.

    1. THP_MIGRATION_SUCCESS
    2. THP_MIGRATION_FAILURE
    3. THP_MIGRATION_SPLIT

    In addition, these new events also update normal page migration statistics
    appropriately via PGMIGRATE_SUCCESS and PGMIGRATE_FAILURE. While here,
    this updates current trace event 'mm_migrate_pages' to accommodate now
    available THP statistics.

    [akpm@linux-foundation.org: s/hpage_nr_pages/thp_nr_pages/]
    [ziy@nvidia.com: v2]
    Link: http://lkml.kernel.org/r/C5E3C65C-8253-4638-9D3C-71A61858BB8B@nvidia.com
    [anshuman.khandual@arm.com: s/thp_nr_pages/hpage_nr_pages/]
    Link: http://lkml.kernel.org/r/1594287583-16568-1-git-send-email-anshuman.khandual@arm.com

    Signed-off-by: Anshuman Khandual
    Signed-off-by: Zi Yan
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: Zi Yan
    Cc: John Hubbard
    Cc: Naoya Horiguchi
    Link: http://lkml.kernel.org/r/1594080415-27924-1-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • Since commit 3917c80280c93a7123f ("thp: change CoW semantics for
    anon-THP"), the CoW page fault of THP has been rewritten, debug_cow is not
    used anymore. So, just remove it.

    Signed-off-by: Yang Shi
    Signed-off-by: Andrew Morton
    Reviewed-by: Zi Yan
    Cc: Kirill A. Shutemov
    Link: http://lkml.kernel.org/r/1592270980-116062-1-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • Add a migrate_vma_*() self test for mmap(MAP_SHARED) to verify that
    !vma_anonymous() ranges won't be migrated.

    Signed-off-by: Ralph Campbell
    Signed-off-by: Andrew Morton
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Christoph Hellwig
    Cc: Jason Gunthorpe
    Cc: "Bharata B Rao"
    Cc: Shuah Khan
    Link: http://lkml.kernel.org/r/20200710194840.7602-3-rcampbell@nvidia.com
    Link: http://lkml.kernel.org/r/20200709165711.26584-3-rcampbell@nvidia.com
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     
  • Patch series "mm/migrate: optimize migrate_vma_setup() for holes".

    A simple optimization for migrate_vma_*() when the source vma is not an
    anonymous vma and a new test case to exercise it.

    This patch (of 2):

    When migrating system memory to device private memory, if the source
    address range is a valid VMA range and there is no memory or a zero page,
    the source PFN array is marked as valid but with no PFN.

    This lets the device driver allocate private memory and clear it, then
    insert the new device private struct page into the CPU's page tables when
    migrate_vma_pages() is called. migrate_vma_pages() only inserts the new
    page if the VMA is an anonymous range.

    There is no point in telling the device driver to allocate device private
    memory and then not migrate the page. Instead, mark the source PFN array
    entries as not migrating to avoid this overhead.

    [rcampbell@nvidia.com: v2]
    Link: http://lkml.kernel.org/r/20200710194840.7602-2-rcampbell@nvidia.com

    Signed-off-by: Ralph Campbell
    Signed-off-by: Andrew Morton
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Christoph Hellwig
    Cc: Jason Gunthorpe
    Cc: "Bharata B Rao"
    Cc: Shuah Khan
    Link: http://lkml.kernel.org/r/20200710194840.7602-1-rcampbell@nvidia.com
    Link: http://lkml.kernel.org/r/20200709165711.26584-1-rcampbell@nvidia.com
    Link: http://lkml.kernel.org/r/20200709165711.26584-2-rcampbell@nvidia.com
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     
  • Commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
    synchronization") requires callers of huge_pte_alloc to hold i_mmap_rwsem
    in at least read mode. This is because the explicit locking in
    huge_pmd_share (called by huge_pte_alloc) was removed. When restructuring
    the code, the call to huge_pte_alloc in the else block at the beginning of
    hugetlb_fault was missed.

    Unfortunately, that else clause is exercised when there is no page table
    entry. This will likely lead to a call to huge_pmd_share. If
    huge_pmd_share thinks pmd sharing is possible, it will traverse the
    mapping tree (i_mmap) without holding i_mmap_rwsem. If someone else is
    modifying the tree, bad things such as addressing exceptions or worse
    could happen.

    Simply remove the else clause. It should have been removed previously.
    The code following the else will call huge_pte_alloc with the appropriate
    locking.

    To prevent this type of issue in the future, add routines to assert that
    i_mmap_rwsem is held, and call these routines in huge pmd sharing
    routines.

    Fixes: c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
    Suggested-by: Matthew Wilcox
    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K.V"
    Cc: Andrea Arcangeli
    Cc: "Kirill A.Shutemov"
    Cc: Davidlohr Bueso
    Cc: Prakash Sangappa
    Cc:
    Link: http://lkml.kernel.org/r/e670f327-5cf9-1959-96e4-6dc7cc30d3d5@oracle.com
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • syzbot found issues with having hugetlbfs on a union/overlay as reported
    in [1]. Due to the limitations (no write) and special functionality of
    hugetlbfs, it does not work well in filesystem stacking. There are no
    know use cases for hugetlbfs stacking. Rather than making modifications
    to get hugetlbfs working in such environments, simply prevent stacking.

    [1] https://lore.kernel.org/linux-mm/000000000000b4684e05a2968ca6@google.com/

    Reported-by: syzbot+d6ec23007e951dadf3de@syzkaller.appspotmail.com
    Suggested-by: Amir Goldstein
    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Acked-by: Miklos Szeredi
    Cc: Al Viro
    Cc: Matthew Wilcox
    Cc: Colin Walters
    Link: http://lkml.kernel.org/r/80f869aa-810d-ef6c-8888-b46cee135907@oracle.com
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • When the OOM killer finds a victim and tryies to kill it, if the victim is
    already exiting, the task mm will be NULL and no process will be killed.
    But the dump_header() has been already executed, so it will be strange to
    dump so much information without killing a process. We'd better show some
    helpful information to indicate why this happens.

    Suggested-by: David Rientjes
    Signed-off-by: Yafang Shao
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Qian Cai
    Link: http://lkml.kernel.org/r/20200721010127.17238-1-laoar.shao@gmail.com
    Signed-off-by: Linus Torvalds

    Yafang Shao
     
  • The exported value includes oom_score_adj so the range is no [0, 1000] as
    described in the previous section but rather [0, 2000]. Mention that fact
    explicitly.

    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Cc: Jonathan Corbet
    Cc: David Rientjes
    Cc: Yafang Shao
    Link: http://lkml.kernel.org/r/20200709062603.18480-2-mhocko@kernel.org
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • There are at least two notes in the oom section. The 3% discount for root
    processes is gone since d46078b28889 ("mm, oom: remove 3% bonus for
    CAP_SYS_ADMIN processes").

    Likewise children of the selected oom victim are not sacrificed since
    bbbe48029720 ("mm, oom: remove 'prefer children over parent' heuristic")

    Drop both of them.

    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Cc: Jonathan Corbet
    Cc: David Rientjes
    Cc: Yafang Shao
    Link: http://lkml.kernel.org/r/20200709062603.18480-1-mhocko@kernel.org
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Recently we found an issue on our production environment that when memcg
    oom is triggered the oom killer doesn't chose the process with largest
    resident memory but chose the first scanned process. Note that all
    processes in this memcg have the same oom_score_adj, so the oom killer
    should chose the process with largest resident memory.

    Bellow is part of the oom info, which is enough to analyze this issue.
    [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
    [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
    [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
    [...]
    [7516987.983293] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
    [7516987.983510] [ 5740] 0 5740 257 1 32768 0 -998 pause
    [7516987.983574] [58804] 0 58804 4594 771 81920 0 -998 entry_point.bas
    [7516987.983577] [58908] 0 58908 7089 689 98304 0 -998 cron
    [7516987.983580] [58910] 0 58910 16235 5576 163840 0 -998 supervisord
    [7516987.983590] [59620] 0 59620 18074 1395 188416 0 -998 sshd
    [7516987.983594] [59622] 0 59622 18680 6679 188416 0 -998 python
    [7516987.983598] [59624] 0 59624 1859266 5161 548864 0 -998 odin-agent
    [7516987.983600] [59625] 0 59625 707223 9248 983040 0 -998 filebeat
    [7516987.983604] [59627] 0 59627 416433 64239 774144 0 -998 odin-log-agent
    [7516987.983607] [59631] 0 59631 180671 15012 385024 0 -998 python3
    [7516987.983612] [61396] 0 61396 791287 3189 352256 0 -998 client
    [7516987.983615] [61641] 0 61641 1844642 29089 946176 0 -998 client
    [7516987.983765] [ 9236] 0 9236 2642 467 53248 0 -998 php_scanner
    [7516987.983911] [42898] 0 42898 15543 838 167936 0 -998 su
    [7516987.983915] [42900] 1000 42900 3673 867 77824 0 -998 exec_script_vr2
    [7516987.983918] [42925] 1000 42925 36475 19033 335872 0 -998 python
    [7516987.983921] [57146] 1000 57146 3673 848 73728 0 -998 exec_script_J2p
    [7516987.983925] [57195] 1000 57195 186359 22958 491520 0 -998 python2
    [7516987.983928] [58376] 1000 58376 275764 14402 290816 0 -998 rosmaster
    [7516987.983931] [58395] 1000 58395 155166 4449 245760 0 -998 rosout
    [7516987.983935] [58406] 1000 58406 18285584 3967322 37101568 0 -998 data_sim
    [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
    [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
    [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

    We can find that the first scanned process 5740 (pause) was killed, but
    its rss is only one page. That is because, when we calculate the oom
    badness in oom_badness(), we always ignore the negtive point and convert
    all of these negtive points to 1. Now as oom_score_adj of all the
    processes in this targeted memcg have the same value -998, the points of
    these processes are all negtive value. As a result, the first scanned
    process will be killed.

    The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
    a Guaranteed pod, which has higher priority to prevent from being killed
    by system oom.

    To fix this issue, we should make the calculation of oom point more
    accurate. We can achieve it by convert the chosen_point from 'unsigned
    long' to 'long'.

    [cai@lca.pw: reported a issue in the previous version]
    [mhocko@suse.com: fixed the issue reported by Cai]
    [mhocko@suse.com: add the comment in proc_oom_score()]
    [laoar.shao@gmail.com: v3]
    Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com

    Signed-off-by: Yafang Shao
    Signed-off-by: Andrew Morton
    Tested-by: Naresh Kamboju
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Qian Cai
    Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com
    Signed-off-by: Linus Torvalds

    Yafang Shao
     
  • Change "interlave" to "interleave".

    Signed-off-by: Yanfei Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200810063454.9357-1-yanfei.xu@windriver.com
    Signed-off-by: Linus Torvalds

    Yanfei Xu
     
  • Previous implementatoin calls untagged_addr() before error check, while if
    the error check failed and return EINVAL, the untagged_addr() call is just
    useless work.

    Signed-off-by: Wenchao Hao
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200801090825.5597-1-haowenchao22@gmail.com
    Signed-off-by: Linus Torvalds

    Wenchao Hao