29 Jun, 2019

7 commits

  • 0-Day test system reported some OOM regressions for several THP
    (Transparent Huge Page) swap test cases. These regressions are bisected
    to 6861428921b5 ("block: always define BIO_MAX_PAGES as 256"). In the
    commit, BIO_MAX_PAGES is set to 256 even when THP swap is enabled. So the
    bio_alloc(gfp_flags, 512) in get_swap_bio() may fail when swapping out
    THP. That causes the OOM.

    As in the patch description of 6861428921b5 ("block: always define
    BIO_MAX_PAGES as 256"), THP swap should use multi-page bvec to write THP
    to swap space. So the issue is fixed via doing that in get_swap_bio().

    BTW: I remember I have checked the THP swap code when 6861428921b5
    ("block: always define BIO_MAX_PAGES as 256") was merged, and thought the
    THP swap code needn't to be changed. But apparently, I was wrong. I
    should have done this at that time.

    Link: http://lkml.kernel.org/r/20190624075515.31040-1-ying.huang@intel.com
    Fixes: 6861428921b5 ("block: always define BIO_MAX_PAGES as 256")
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Ming Lei
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Daniel Jordan
    Cc: Jens Axboe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • gcc gets confused in pcpu_get_vm_areas() because there are too many
    branches that affect whether 'lva' was initialized before it gets used:

    mm/vmalloc.c: In function 'pcpu_get_vm_areas':
    mm/vmalloc.c:991:4: error: 'lva' may be used uninitialized in this function [-Werror=maybe-uninitialized]
    insert_vmap_area_augment(lva, &va->rb_node,
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    &free_vmap_area_root, &free_vmap_area_list);
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    mm/vmalloc.c:916:20: note: 'lva' was declared here
    struct vmap_area *lva;
    ^~~

    Add an intialization to NULL, and check whether this has changed before
    the first use.

    [akpm@linux-foundation.org: tweak comments]
    Link: http://lkml.kernel.org/r/20190618092650.2943749-1-arnd@arndb.de
    Fixes: 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
    Signed-off-by: Arnd Bergmann
    Reviewed-by: Uladzislau Rezki (Sony)
    Cc: Joel Fernandes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • Currently the calcuation of end_pfn can round up the pfn number to more
    than the actual maximum number of pfns, causing an Oops. Fix this by
    ensuring end_pfn is never more than max_pfn.

    This can be easily triggered when on systems where the end_pfn gets
    rounded up to more than max_pfn using the idle-page stress-ng stress test:

    sudo stress-ng --idle-page 0

    BUG: unable to handle kernel paging request at 00000000000020d8
    #PF error: [normal kernel read fault]
    PGD 0 P4D 0
    Oops: 0000 [#1] SMP PTI
    CPU: 1 PID: 11039 Comm: stress-ng-idle- Not tainted 5.0.0-5-generic #6-Ubuntu
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
    RIP: 0010:page_idle_get_page+0xc8/0x1a0
    Code: 0f b1 0a 75 7d 48 8b 03 48 89 c2 48 c1 e8 33 83 e0 07 48 c1 ea 36 48 8d 0c 40 4c 8d 24 88 49 c1 e4 07 4c 03 24 d5 00 89 c3 be 8b 44 24 58 48 8d b8 80 a1 02 00 e8 07 d5 77 00 48 8b 53 08 48
    RSP: 0018:ffffafd7c672fde8 EFLAGS: 00010202
    RAX: 0000000000000005 RBX: ffffe36341fff700 RCX: 000000000000000f
    RDX: 0000000000000284 RSI: 0000000000000275 RDI: 0000000001fff700
    RBP: ffffafd7c672fe00 R08: ffffa0bc34056410 R09: 0000000000000276
    R10: ffffa0bc754e9b40 R11: ffffa0bc330f6400 R12: 0000000000002080
    R13: ffffe36341fff700 R14: 0000000000080000 R15: ffffa0bc330f6400
    FS: 00007f0ec1ea5740(0000) GS:ffffa0bc7db00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00000000000020d8 CR3: 0000000077d68000 CR4: 00000000000006e0
    Call Trace:
    page_idle_bitmap_write+0x8c/0x140
    sysfs_kf_bin_write+0x5c/0x70
    kernfs_fop_write+0x12e/0x1b0
    __vfs_write+0x1b/0x40
    vfs_write+0xab/0x1b0
    ksys_write+0x55/0xc0
    __x64_sys_write+0x1a/0x20
    do_syscall_64+0x5a/0x110
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Link: http://lkml.kernel.org/r/20190618124352.28307-1-colin.king@canonical.com
    Fixes: 33c3fc71c8cf ("mm: introduce idle page tracking")
    Signed-off-by: Colin Ian King
    Reviewed-by: Andrew Morton
    Acked-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Mike Rapoport
    Cc: Mel Gorman
    Cc: Stephen Rothwell
    Cc: Andrey Ryabinin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Colin Ian King
     
  • In dump_oom_summary() oc->constraint is used to show oom_constraint_text,
    but it hasn't been set before. So the value of it is always the default
    value 0. We should inititialize it before.

    Bellow is the output when memcg oom occurs,

    before this patch:
    oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null), cpuset=/,mems_allowed=0,oom_memcg=/foo,task_memcg=/foo,task=bash,pid=7997,uid=0

    after this patch:
    oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null), cpuset=/,mems_allowed=0,oom_memcg=/foo,task_memcg=/foo,task=bash,pid=13681,uid=0

    Link: http://lkml.kernel.org/r/1560522038-15879-1-git-send-email-laoar.shao@gmail.com
    Fixes: ef8444ea01d7 ("mm, oom: reorganize the oom report in dump_header")
    Signed-off-by: Yafang Shao
    Acked-by: Michal Hocko
    Cc: Wind Yu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yafang Shao
     
  • madvise(MADV_SOFT_OFFLINE) often returns -EBUSY when calling soft offline
    for hugepages with overcommitting enabled. That was caused by the
    suboptimal code in current soft-offline code. See the following part:

    ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
    MIGRATE_SYNC, MR_MEMORY_FAILURE);
    if (ret) {
    ...
    } else {
    /*
    * We set PG_hwpoison only when the migration source hugepage
    * was successfully dissolved, because otherwise hwpoisoned
    * hugepage remains on free hugepage list, then userspace will
    * find it as SIGBUS by allocation failure. That's not expected
    * in soft-offlining.
    */
    ret = dissolve_free_huge_page(page);
    if (!ret) {
    if (set_hwpoison_free_buddy_page(page))
    num_poisoned_pages_inc();
    }
    }
    return ret;

    Here dissolve_free_huge_page() returns -EBUSY if the migration source page
    was freed into buddy in migrate_pages(), but even in that case we actually
    has a chance that set_hwpoison_free_buddy_page() succeeds. So that means
    current code gives up offlining too early now.

    dissolve_free_huge_page() checks that a given hugepage is suitable for
    dissolving, where we should return success for !PageHuge() case because
    the given hugepage is considered as already dissolved.

    This change also affects other callers of dissolve_free_huge_page(), which
    are cleaned up together.

    [n-horiguchi@ah.jp.nec.com: v3]
    Link: http://lkml.kernel.org/r/1560761476-4651-3-git-send-email-n-horiguchi@ah.jp.nec.comLink: http://lkml.kernel.org/r/1560154686-18497-3-git-send-email-n-horiguchi@ah.jp.nec.com
    Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining")
    Signed-off-by: Naoya Horiguchi
    Reported-by: Chen, Jerry T
    Tested-by: Chen, Jerry T
    Reviewed-by: Mike Kravetz
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: Xishi Qiu
    Cc: "Chen, Jerry T"
    Cc: "Zhuo, Qiuxu"
    Cc: [4.19+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • The pass/fail of soft offline should be judged by checking whether the
    raw error page was finally contained or not (i.e. the result of
    set_hwpoison_free_buddy_page()), but current code do not work like
    that. It might lead us to misjudge the test result when
    set_hwpoison_free_buddy_page() fails.

    Without this fix, there are cases where madvise(MADV_SOFT_OFFLINE) may
    not offline the original page and will not return an error.

    Link: http://lkml.kernel.org/r/1560154686-18497-2-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Fixes: 6bc9b56433b76 ("mm: fix race on soft-offlining")
    Reviewed-by: Mike Kravetz
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: Xishi Qiu
    Cc: "Chen, Jerry T"
    Cc: "Zhuo, Qiuxu"
    Cc: [4.19+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • mpol_rebind_nodemask() is called for MPOL_BIND and MPOL_INTERLEAVE
    mempoclicies when the tasks's cpuset's mems_allowed changes. For
    policies created without MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES,
    it works by remapping the policy's allowed nodes (stored in v.nodes)
    using the previous value of mems_allowed (stored in
    w.cpuset_mems_allowed) as the domain of map and the new mems_allowed
    (passed as nodes) as the range of the map (see the comment of
    bitmap_remap() for details).

    The result of remapping is stored back as policy's nodemask in v.nodes,
    and the new value of mems_allowed should be stored in
    w.cpuset_mems_allowed to facilitate the next rebind, if it happens.

    However, 213980c0f23b ("mm, mempolicy: simplify rebinding mempolicies
    when updating cpusets") introduced a bug where the result of remapping
    is stored in w.cpuset_mems_allowed instead. Thus, a mempolicy's
    allowed nodes can evolve in an unexpected way after a series of
    rebinding due to cpuset mems_allowed changes, possibly binding to a
    wrong node or a smaller number of nodes which may e.g. overload them.
    This patch fixes the bug so rebinding again works as intended.

    [vbabka@suse.cz: new changlog]
    Link: http://lkml.kernel.org/r/ef6a69c6-c052-b067-8f2c-9d615c619bb9@suse.cz
    Link: http://lkml.kernel.org/r/1558768043-23184-1-git-send-email-zhongjiang@huawei.com
    Fixes: 213980c0f23b ("mm, mempolicy: simplify rebinding mempolicies when updating cpusets")
    Signed-off-by: zhong jiang
    Reviewed-by: Vlastimil Babka
    Cc: Oscar Salvador
    Cc: Anshuman Khandual
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Ralph Campbell
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     

19 Jun, 2019

4 commits

  • Based on 2 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license version 2 as
    published by the free software foundation

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license version 2 as
    published by the free software foundation #

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 4122 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Enrico Weigelt
    Reviewed-by: Kate Stewart
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • Based on 1 normalized pattern(s):

    this work is licensed under the terms of the gnu gpl version 2 see
    the copying file in the top level directory

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 35 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Kate Stewart
    Reviewed-by: Enrico Weigelt
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190604081206.797835076@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • Based on 1 normalized pattern(s):

    this work is licensed under the terms of the gnu gpl version 2

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 48 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Reviewed-by: Enrico Weigelt
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190604081204.624030236@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • Based on 1 normalized pattern(s):

    this file is released under the gpl v2

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 3 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Alexios Zavras
    Reviewed-by: Allison Randal
    Reviewed-by: Armijn Hemel
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190602204655.103854853@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

17 Jun, 2019

1 commit

  • Pull x86 fixes from Thomas Gleixner:
    "The accumulated fixes from this and last week:

    - Fix vmalloc TLB flush and map range calculations which lead to
    stale TLBs, spurious faults and other hard to diagnose issues.

    - Use fault_in_pages_writable() for prefaulting the user stack in the
    FPU code as it's less fragile than the current solution

    - Use the PF_KTHREAD flag when checking for a kernel thread instead
    of current->mm as the latter can give the wrong answer due to
    use_mm()

    - Compute the vmemmap size correctly for KASLR and 5-Level paging.
    Otherwise this can end up with a way too small vmemmap area.

    - Make KASAN and 5-level paging work again by making sure that all
    invalid bits are masked out when computing the P4D offset. This
    worked before but got broken recently when the LDT remap area was
    moved.

    - Prevent a NULL pointer dereference in the resource control code
    which can be triggered with certain mount options when the
    requested resource is not available.

    - Enforce ordering of microcode loading vs. perf initialization on
    secondary CPUs. Otherwise perf tries to access a non-existing MSR
    as the boot CPU marked it as available.

    - Don't stop the resource control group walk early otherwise the
    control bitmaps are not updated correctly and become inconsistent.

    - Unbreak kgdb by returning 0 on success from
    kgdb_arch_set_breakpoint() instead of an error code.

    - Add more Icelake CPU model defines so depending changes can be
    queued in other trees"

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/microcode, cpuhotplug: Add a microcode loader CPU hotplug callback
    x86/kasan: Fix boot with 5-level paging and KASAN
    x86/fpu: Don't use current->mm to check for a kthread
    x86/kgdb: Return 0 from kgdb_arch_set_breakpoint()
    x86/resctrl: Prevent NULL pointer dereference when local MBM is disabled
    x86/resctrl: Don't stop walking closids when a locksetup group is found
    x86/fpu: Update kernel's FPU state before using for the fsave header
    x86/mm/KASLR: Compute the size of the vmemmap section properly
    x86/fpu: Use fault_in_pages_writeable() for pre-faulting
    x86/CPU: Add more Icelake model numbers
    mm/vmalloc: Avoid rare case of flushing TLB with weird arguments
    mm/vmalloc: Fix calculation of direct map addr range

    Linus Torvalds
     

14 Jun, 2019

9 commits

  • Logan noticed that devm_memremap_pages_release() kills the percpu_ref
    drops all the page references that were acquired at init and then
    immediately proceeds to unplug, arch_remove_memory(), the backing pages
    for the pagemap. If for some reason device shutdown actually collides
    with a busy / elevated-ref-count page then arch_remove_memory() should
    be deferred until after that reference is dropped.

    As it stands the "wait for last page ref drop" happens *after*
    devm_memremap_pages_release() returns, which is obviously too late and
    can lead to crashes.

    Fix this situation by assigning the responsibility to wait for the
    percpu_ref to go idle to devm_memremap_pages() with a new ->cleanup()
    callback. Implement the new cleanup callback for all
    devm_memremap_pages() users: pmem, devdax, hmm, and p2pdma.

    Link: http://lkml.kernel.org/r/155727339156.292046.5432007428235387859.stgit@dwillia2-desk3.amr.corp.intel.com
    Fixes: 41e94a851304 ("add devm_memremap_pages")
    Signed-off-by: Dan Williams
    Reported-by: Logan Gunthorpe
    Reviewed-by: Ira Weiny
    Reviewed-by: Logan Gunthorpe
    Cc: Bjorn Helgaas
    Cc: "Jérôme Glisse"
    Cc: Christoph Hellwig
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • There was the below bug report from Wu Fangsuo.

    On the CMA allocation path, isolate_migratepages_range() could isolate
    unevictable LRU pages and reclaim_clean_page_from_list() can try to
    reclaim them if they are clean file-backed pages.

    page:ffffffbf02f33b40 count:86 mapcount:84 mapping:ffffffc08fa7a810 index:0x24
    flags: 0x19040c(referenced|uptodate|arch_1|mappedtodisk|unevictable|mlocked)
    raw: 000000000019040c ffffffc08fa7a810 0000000000000024 0000005600000053
    raw: ffffffc009b05b20 ffffffc009b05b20 0000000000000000 ffffffc09bf3ee80
    page dumped because: VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page))
    page->mem_cgroup:ffffffc09bf3ee80
    ------------[ cut here ]------------
    kernel BUG at /home/build/farmland/adroid9.0/kernel/linux/mm/vmscan.c:1350!
    Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
    Modules linked in:
    CPU: 0 PID: 7125 Comm: syz-executor Tainted: G S 4.14.81 #3
    Hardware name: ASR AQUILAC EVB (DT)
    task: ffffffc00a54cd00 task.stack: ffffffc009b00000
    PC is at shrink_page_list+0x1998/0x3240
    LR is at shrink_page_list+0x1998/0x3240
    pc : [] lr : [] pstate: 60400045
    sp : ffffffc009b05940
    ..
    shrink_page_list+0x1998/0x3240
    reclaim_clean_pages_from_list+0x3c0/0x4f0
    alloc_contig_range+0x3bc/0x650
    cma_alloc+0x214/0x668
    ion_cma_allocate+0x98/0x1d8
    ion_alloc+0x200/0x7e0
    ion_ioctl+0x18c/0x378
    do_vfs_ioctl+0x17c/0x1780
    SyS_ioctl+0xac/0xc0

    Wu found it's due to commit ad6b67041a45 ("mm: remove SWAP_MLOCK in
    ttu"). Before that, unevictable pages go to cull_mlocked so that we
    can't reach the VM_BUG_ON_PAGE line.

    To fix the issue, this patch filters out unevictable LRU pages from the
    reclaim_clean_pages_from_list in CMA.

    Link: http://lkml.kernel.org/r/20190524071114.74202-1-minchan@kernel.org
    Fixes: ad6b67041a45 ("mm: remove SWAP_MLOCK in ttu")
    Signed-off-by: Minchan Kim
    Reported-by: Wu Fangsuo
    Debugged-by: Wu Fangsuo
    Tested-by: Wu Fangsuo
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Pankaj Suryawanshi
    Cc: [4.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • When fixing the race conditions between the coredump and the mmap_sem
    holders outside the context of the process, we focused on
    mmget_not_zero()/get_task_mm() callers in 04f5866e41fb70 ("coredump: fix
    race condition between mmget_not_zero()/get_task_mm() and core
    dumping"), but those aren't the only cases where the mmap_sem can be
    taken outside of the context of the process as Michal Hocko noticed
    while backporting that commit to older -stable kernels.

    If mmgrab() is called in the context of the process, but then the
    mm_count reference is transferred outside the context of the process,
    that can also be a problem if the mmap_sem has to be taken for writing
    through that mm_count reference.

    khugepaged registration calls mmgrab() in the context of the process,
    but the mmap_sem for writing is taken later in the context of the
    khugepaged kernel thread.

    collapse_huge_page() after taking the mmap_sem for writing doesn't
    modify any vma, so it's not obvious that it could cause a problem to the
    coredump, but it happens to modify the pmd in a way that breaks an
    invariant that pmd_trans_huge_lock() relies upon. collapse_huge_page()
    needs the mmap_sem for writing just to block concurrent page faults that
    call pmd_trans_huge_lock().

    Specifically the invariant that "!pmd_trans_huge()" cannot become a
    "pmd_trans_huge()" doesn't hold while collapse_huge_page() runs.

    The coredump will call __get_user_pages() without mmap_sem for reading,
    which eventually can invoke a lockless page fault which will need a
    functional pmd_trans_huge_lock().

    So collapse_huge_page() needs to use mmget_still_valid() to check it's
    not running concurrently with the coredump... as long as the coredump
    can invoke page faults without holding the mmap_sem for reading.

    This has "Fixes: khugepaged" to facilitate backporting, but in my view
    it's more a bug in the coredump code that will eventually have to be
    rewritten to stop invoking page faults without the mmap_sem for reading.
    So the long term plan is still to drop all mmget_still_valid().

    Link: http://lkml.kernel.org/r/20190607161558.32104-1-aarcange@redhat.com
    Fixes: ba76149f47d8 ("thp: khugepaged")
    Signed-off-by: Andrea Arcangeli
    Reported-by: Michal Hocko
    Acked-by: Michal Hocko
    Acked-by: Kirill A. Shutemov
    Cc: Oleg Nesterov
    Cc: Jann Horn
    Cc: Hugh Dickins
    Cc: Mike Rapoport
    Cc: Mike Kravetz
    Cc: Peter Xu
    Cc: Jason Gunthorpe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • On a 64-bit machine the value of "vma->vm_end - vma->vm_start" may be
    negative when using 32 bit ints and the "count >> PAGE_SHIFT"'s result
    will be wrong. So change the local variable and return value to
    unsigned long to fix the problem.

    Link: http://lkml.kernel.org/r/20190513023701.83056-1-swkhack@gmail.com
    Fixes: 0cf2f6f6dc60 ("mm: mlock: check against vma for actual mlock() size")
    Signed-off-by: swkhack
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    swkhack
     
  • A few new fields were added to mmu_gather to make TLB flush smarter for
    huge page by telling what level of page table is changed.

    __tlb_reset_range() is used to reset all these page table state to
    unchanged, which is called by TLB flush for parallel mapping changes for
    the same range under non-exclusive lock (i.e. read mmap_sem).

    Before commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in
    munmap"), the syscalls (e.g. MADV_DONTNEED, MADV_FREE) which may update
    PTEs in parallel don't remove page tables. But, the forementioned
    commit may do munmap() under read mmap_sem and free page tables. This
    may result in program hang on aarch64 reported by Jan Stancek. The
    problem could be reproduced by his test program with slightly modified
    below.

    ---8< num_iter; i++) {
    map_address = mmap(distant_area, (size_t) map_size, PROT_WRITE | PROT_READ,
    MAP_SHARED | MAP_ANONYMOUS, -1, 0);
    if (map_address == MAP_FAILED) {
    perror("mmap");
    exit(1);
    }

    for (j = 0; j < map_size; j++)
    map_address[j] = 'b';

    if (munmap(map_address, map_size) == -1) {
    perror("munmap");
    exit(1);
    }
    }

    return NULL;
    }

    void *dummy(void *ptr)
    {
    return NULL;
    }

    int main(void)
    {
    pthread_t thid[2];

    /* hint for mmap in map_write_unmap() */
    distant_area = mmap(0, DISTANT_MMAP_SIZE, PROT_WRITE | PROT_READ,
    MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
    munmap(distant_area, (size_t)DISTANT_MMAP_SIZE);
    distant_area += DISTANT_MMAP_SIZE / 2;

    while (1) {
    pthread_create(&thid[0], NULL, map_write_unmap, NULL);
    pthread_create(&thid[1], NULL, dummy, NULL);

    pthread_join(thid[0], NULL);
    pthread_join(thid[1], NULL);
    }
    }
    ---8mmap_sem);
    unmap_region()
    tlb_gather_mmu()
    inc_tlb_flush_pending(tlb->mm);
    free_pgtables()
    tlb->freed_tables = 1
    tlb->cleared_pmds = 1

    pthread_exit()
    madvise(thread_stack, 8M, MADV_DONTNEED)
    zap_page_range()
    tlb_gather_mmu()
    inc_tlb_flush_pending(tlb->mm);

    tlb_finish_mmu()
    if (mm_tlb_flush_nested(tlb->mm))
    __tlb_reset_range()

    __tlb_reset_range() would reset freed_tables and cleared_* bits, but this
    may cause inconsistency for munmap() which do free page tables. Then it
    may result in some architectures, e.g. aarch64, may not flush TLB
    completely as expected to have stale TLB entries remained.

    Use fullmm flush since it yields much better performance on aarch64 and
    non-fullmm doesn't yields significant difference on x86.

    The original proposed fix came from Jan Stancek who mainly debugged this
    issue, I just wrapped up everything together.

    Jan's testing results:

    v5.2-rc2-24-gbec7550cca10
    --------------------------
    mean stddev
    real 37.382 2.780
    user 1.420 0.078
    sys 54.658 1.855

    v5.2-rc2-24-gbec7550cca10 + "mm: mmu_gather: remove __tlb_reset_range() for force flush"
    ---------------------------------------------------------------------------------------_
    mean stddev
    real 37.119 2.105
    user 1.548 0.087
    sys 55.698 1.357

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/1558322252-113575-1-git-send-email-yang.shi@linux.alibaba.com
    Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
    Signed-off-by: Yang Shi
    Signed-off-by: Jan Stancek
    Reported-by: Jan Stancek
    Tested-by: Jan Stancek
    Suggested-by: Will Deacon
    Tested-by: Will Deacon
    Acked-by: Will Deacon
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: "Aneesh Kumar K.V"
    Cc: Nadav Amit
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: [4.20+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • Johannes pointed out that after commit 886cf1901db9 ("mm: move
    recent_rotated pages calculation to shrink_inactive_list()") we lost all
    zone_reclaim_stat::recent_rotated history.

    This fixes it.

    Link: http://lkml.kernel.org/r/155905972210.26456.11178359431724024112.stgit@localhost.localdomain
    Fixes: 886cf1901db9 ("mm: move recent_rotated pages calculation to shrink_inactive_list()")
    Signed-off-by: Kirill Tkhai
    Reported-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Daniel Jordan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • If mlockall() is called with only MCL_ONFAULT as flag, it removes any
    previously applied lockings and does nothing else.

    This behavior is counter-intuitive and doesn't match the Linux man page.

    For mlockall():

    EINVAL Unknown flags were specified or MCL_ONFAULT was specified
    without either MCL_FUTURE or MCL_CURRENT.

    Consequently, return the error EINVAL, if only MCL_ONFAULT is passed.
    That way, applications will at least detect that they are calling
    mlockall() incorrectly.

    Link: http://lkml.kernel.org/r/20190527075333.GA6339@er01809n.ebgroup.elektrobit.com
    Fixes: b0f205c2a308 ("mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage")
    Signed-off-by: Stefan Potyra
    Reviewed-by: Daniel Jordan
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Potyra, Stefan
     
  • Syzbot reported following memory leak:

    ffffffffda RBX: 0000000000000003 RCX: 0000000000441f79
    BUG: memory leak
    unreferenced object 0xffff888114f26040 (size 32):
    comm "syz-executor626", pid 7056, jiffies 4294948701 (age 39.410s)
    hex dump (first 32 bytes):
    40 60 f2 14 81 88 ff ff 40 60 f2 14 81 88 ff ff @`......@`......
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    slab_post_alloc_hook mm/slab.h:439 [inline]
    slab_alloc mm/slab.c:3326 [inline]
    kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
    kmalloc include/linux/slab.h:547 [inline]
    __memcg_init_list_lru_node+0x58/0xf0 mm/list_lru.c:352
    memcg_init_list_lru_node mm/list_lru.c:375 [inline]
    memcg_init_list_lru mm/list_lru.c:459 [inline]
    __list_lru_init+0x193/0x2a0 mm/list_lru.c:626
    alloc_super+0x2e0/0x310 fs/super.c:269
    sget_userns+0x94/0x2a0 fs/super.c:609
    sget+0x8d/0xb0 fs/super.c:660
    mount_nodev+0x31/0xb0 fs/super.c:1387
    fuse_mount+0x2d/0x40 fs/fuse/inode.c:1236
    legacy_get_tree+0x27/0x80 fs/fs_context.c:661
    vfs_get_tree+0x2e/0x120 fs/super.c:1476
    do_new_mount fs/namespace.c:2790 [inline]
    do_mount+0x932/0xc50 fs/namespace.c:3110
    ksys_mount+0xab/0x120 fs/namespace.c:3319
    __do_sys_mount fs/namespace.c:3333 [inline]
    __se_sys_mount fs/namespace.c:3330 [inline]
    __x64_sys_mount+0x26/0x30 fs/namespace.c:3330
    do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    This is a simple off by one bug on the error path.

    Link: http://lkml.kernel.org/r/20190528043202.99980-1-shakeelb@google.com
    Fixes: 60d3fd32a7a9 ("list_lru: introduce per-memcg lists")
    Reported-by: syzbot+f90a420dfe2b1b03cb2c@syzkaller.appspotmail.com
    Signed-off-by: Shakeel Butt
    Acked-by: Michal Hocko
    Reviewed-by: Kirill Tkhai
    Cc: [4.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • The kernel test robot noticed a 26% will-it-scale pagefault regression
    from commit 42a300353577 ("mm: memcontrol: fix recursive statistics
    correctness & scalabilty"). This appears to be caused by bouncing the
    additional cachelines from the new hierarchical statistics counters.

    We can fix this by getting rid of the batched local counters instead.

    Originally, there were *only* group-local counters, and they were fully
    maintained per cpu. A reader of a stats file high up in the cgroup tree
    would have to walk the entire subtree and collect each level's per-cpu
    counters to get the recursive view. This was prohibitively expensive,
    and so we switched to per-cpu batched updates of the local counters
    during a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
    memory.stat reporting"), reducing the complexity from nr_subgroups *
    nr_cpus to nr_subgroups.

    With growing machines and cgroup trees, the tree walk itself became too
    expensive for monitoring top-level groups, and this is when the culprit
    patch added hierarchy counters on each cgroup level. When the per-cpu
    batch size would be reached, both the local and the hierarchy counters
    would get batch-updated from the per-cpu delta simultaneously.

    This makes local and hierarchical counter reads blazingly fast, but it
    unfortunately makes the write-side too cache line intense.

    Since local counter reads were never a problem - we only centralized
    them to accelerate the hierarchy walk - and use of the local counters
    are becoming rarer due to replacement with hierarchical views (ongoing
    rework in the page reclaim and workingset code), we can make those local
    counters unbatched per-cpu counters again.

    The scheme will then be as such:

    when a memcg statistic changes, the writer will:
    - update the local counter (per-cpu)
    - update the batch counter (per-cpu). If the batch is full:
    - spill the batch into the group's atomic_t
    - spill the batch into all ancestors' atomic_ts
    - empty out the batch counter (per-cpu)

    when a local memcg counter is read, the reader will:
    - collect the local counter from all cpus

    when a hiearchy memcg counter is read, the reader will:
    - read the atomic_t

    We might be able to simplify this further and make the recursive
    counters unbatched per-cpu counters as well (batch upward propagation,
    but leave per-cpu collection to the readers), but that will require a
    more in-depth analysis and testing of all the callsites. Deal with the
    immediate regression for now.

    Link: http://lkml.kernel.org/r/20190521151647.GB2870@cmpxchg.org
    Fixes: 42a300353577 ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
    Signed-off-by: Johannes Weiner
    Reported-by: kernel test robot
    Tested-by: kernel test robot
    Cc: Michal Hocko
    Cc: Shakeel Butt
    Cc: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

05 Jun, 2019

5 commits

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation version 2 of the license

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 315 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Reviewed-by: Armijn Hemel
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190531190115.503150771@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • Based on 1 normalized pattern(s):

    this file is released under the gplv2

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 68 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Armijn Hemel
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190531190114.292346262@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • Based on 1 normalized pattern(s):

    this software may be redistributed and or modified under the terms
    of the gnu general public license gpl version 2 as published by the
    free software foundation

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 1 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Reviewed-by: Armijn Hemel
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190531190112.039124428@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license version 2 as
    published by the free software foundation this program is
    distributed in the hope that it will be useful but without any
    warranty without even the implied warranty of merchantability or
    fitness for a particular purpose see the gnu general public license
    for more details you should have received a copy of the gnu general
    public license along with this program if not write to the free
    software foundation inc 59 temple place suite 330 boston ma 02111
    1307 usa

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 136 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Alexios Zavras
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190530000436.384967451@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • Based on 1 normalized pattern(s):

    this software may be redistributed and or modified under the terms
    of the gnu general public license gpl version 2 only as published by
    the free software foundation

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 1 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Richard Fontana
    Reviewed-by: Alexios Zavras
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190529141333.676969322@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

03 Jun, 2019

2 commits

  • In a rare case, flush_tlb_kernel_range() could be called with a start
    higher than the end.

    In vm_remove_mappings(), in case page_address() returns 0 for all pages
    (for example they were all in highmem), _vm_unmap_aliases() will be
    called with start = ULONG_MAX, end = 0 and flush = 1.

    If at the same time, the vmalloc purge operation is triggered by something
    else while the current operation is between remove_vm_area() and
    _vm_unmap_aliases(), then the vm mapping just removed will be already
    purged. In this case the call of vm_unmap_aliases() may not find any other
    mappings to flush and so end up flushing start = ULONG_MAX, end = 0. So
    only set flush = true if we find something in the direct mapping that we
    need to flush, and this way this can't happen.

    Signed-off-by: Rick Edgecombe
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Linus Torvalds
    Cc: Meelis Roos
    Cc: Nadav Amit
    Cc: Peter Zijlstra
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Fixes: 868b104d7379 ("mm/vmalloc: Add flag for freeing of special permsissions")
    Link: https://lkml.kernel.org/r/20190527211058.2729-3-rick.p.edgecombe@intel.com
    Signed-off-by: Ingo Molnar

    Rick Edgecombe
     
  • The calculation of the direct map address range to flush was wrong.
    This could cause the RO direct map alias to not get flushed. Today
    this shouldn't be a problem because this flush is only needed on x86
    right now and the spurious fault handler will fix cached RO->RW
    translations. In the future though, it could cause the permissions
    to remain RO in the TLB for the direct map alias, and then the page
    would return from the page allocator to some other component as RO
    and cause a crash.

    So fix fix the address range calculation so the flush will include the
    direct map range.

    Signed-off-by: Rick Edgecombe
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Linus Torvalds
    Cc: Meelis Roos
    Cc: Nadav Amit
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 868b104d7379 ("mm/vmalloc: Add flag for freeing of special permsissions")
    Link: https://lkml.kernel.org/r/20190527211058.2729-2-rick.p.edgecombe@intel.com
    Signed-off-by: Ingo Molnar

    Rick Edgecombe
     

02 Jun, 2019

7 commits

  • When we have holes in a normal memory zone, we could endup having
    cached_migrate_pfns which may not necessarily be valid, under heavy memory
    pressure with swapping enabled ( via __reset_isolation_suitable(),
    triggered by kswapd).

    Later if we fail to find a page via fast_isolate_freepages(), we may end
    up using the migrate_pfn we started the search with, as valid page. This
    could lead to accessing NULL pointer derefernces like below, due to an
    invalid mem_section pointer.

    Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 [47/1825]
    Mem abort info:
    ESR = 0x96000004
    Exception class = DABT (current EL), IL = 32 bits
    SET = 0, FnV = 0
    EA = 0, S1PTW = 0
    Data abort info:
    ISV = 0, ISS = 0x00000004
    CM = 0, WnR = 0
    user pgtable: 4k pages, 48-bit VAs, pgdp = 0000000082f94ae9
    [0000000000000008] pgd=0000000000000000
    Internal error: Oops: 96000004 [#1] SMP
    ...
    CPU: 10 PID: 6080 Comm: qemu-system-aar Not tainted 510-rc1+ #6
    Hardware name: AmpereComputing(R) OSPREY EV-883832-X3-0001/OSPREY, BIOS 4819 09/25/2018
    pstate: 60000005 (nZCv daif -PAN -UAO)
    pc : set_pfnblock_flags_mask+0x58/0xe8
    lr : compaction_alloc+0x300/0x950
    [...]
    Process qemu-system-aar (pid: 6080, stack limit = 0x0000000095070da5)
    Call trace:
    set_pfnblock_flags_mask+0x58/0xe8
    compaction_alloc+0x300/0x950
    migrate_pages+0x1a4/0xbb0
    compact_zone+0x750/0xde8
    compact_zone_order+0xd8/0x118
    try_to_compact_pages+0xb4/0x290
    __alloc_pages_direct_compact+0x84/0x1e0
    __alloc_pages_nodemask+0x5e0/0xe18
    alloc_pages_vma+0x1cc/0x210
    do_huge_pmd_anonymous_page+0x108/0x7c8
    __handle_mm_fault+0xdd4/0x1190
    handle_mm_fault+0x114/0x1c0
    __get_user_pages+0x198/0x3c0
    get_user_pages_unlocked+0xb4/0x1d8
    __gfn_to_pfn_memslot+0x12c/0x3b8
    gfn_to_pfn_prot+0x4c/0x60
    kvm_handle_guest_abort+0x4b0/0xcd8
    handle_exit+0x140/0x1b8
    kvm_arch_vcpu_ioctl_run+0x260/0x768
    kvm_vcpu_ioctl+0x490/0x898
    do_vfs_ioctl+0xc4/0x898
    ksys_ioctl+0x8c/0xa0
    __arm64_sys_ioctl+0x28/0x38
    el0_svc_common+0x74/0x118
    el0_svc_handler+0x38/0x78
    el0_svc+0x8/0xc
    Code: f8607840 f100001f 8b011401 9a801020 (f9400400)
    ---[ end trace af6a35219325a9b6 ]---

    The issue was reported on an arm64 server with 128GB with holes in the
    zone (e.g, [32GB@4GB, 96GB@544GB]), with a swap device enabled, while
    running 100 KVM guest instances.

    This patch fixes the issue by ensuring that the page belongs to a valid
    PFN when we fallback to using the lower limit of the scan range upon
    failure in fast_isolate_freepages().

    Link: http://lkml.kernel.org/r/1558711908-15688-1-git-send-email-suzuki.poulose@arm.com
    Fixes: 5a811889de10f1eb ("mm, compaction: use free lists to quickly locate a migration target")
    Signed-off-by: Suzuki K Poulose
    Reported-by: Marc Zyngier
    Reviewed-by: Mel Gorman
    Reviewed-by: Anshuman Khandual
    Cc: Michal Hocko
    Cc: Qian Cai
    Cc: Marc Zyngier
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Suzuki K Poulose
     
  • When building with -Wuninitialized and CONFIG_KASAN_SW_TAGS unset, Clang
    warns:

    mm/kasan/common.c:484:40: warning: variable 'tag' is uninitialized when
    used here [-Wuninitialized]
    kasan_unpoison_shadow(set_tag(object, tag), size);
    ^~~

    set_tag ignores tag in this configuration but clang doesn't realize it at
    this point in its pipeline, as it points to arch_kasan_set_tag as being
    the point where it is used, which will later be expanded to (void
    *)(object) without a use of tag. Initialize tag to 0xff, as it removes
    this warning and doesn't change the meaning of the code.

    Link: https://github.com/ClangBuiltLinux/linux/issues/465
    Link: http://lkml.kernel.org/r/20190502163057.6603-1-natechancellor@gmail.com
    Fixes: 7f94ffbc4c6a ("kasan: add hooks implementation for tag-based mode")
    Signed-off-by: Nathan Chancellor
    Reviewed-by: Andrey Konovalov
    Reviewed-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: Nick Desaulniers
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nathan Chancellor
     
  • kmem_cache_alloc() may be called from z3fold_alloc() in atomic context, so
    we need to pass correct gfp flags to avoid "scheduling while atomic" bug.

    Link: http://lkml.kernel.org/r/20190523153245.119dfeed55927e8755250ddd@gmail.com
    Fixes: 7c2b8baa61fe5 ("mm/z3fold.c: add structure for buddy handles")
    Signed-off-by: Vitaly Wool
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     
  • When get_user_pages*() is called with pages = NULL, the processing of
    VM_FAULT_RETRY terminates early without actually retrying to fault-in all
    the pages.

    If the pages in the requested range belong to a VMA that has userfaultfd
    registered, handle_userfault() returns VM_FAULT_RETRY *after* user space
    has populated the page, but for the gup pre-fault case there's no actual
    retry and the caller will get no pages although they are present.

    This issue was uncovered when running post-copy memory restore in CRIU
    after d9c9ce34ed5c ("x86/fpu: Fault-in user stack if
    copy_fpstate_to_sigframe() fails").

    After this change, the copying of FPU state to the sigframe switched from
    copy_to_user() variants which caused a real page fault to get_user_pages()
    with pages parameter set to NULL.

    In post-copy mode of CRIU, the destination memory is managed with
    userfaultfd and lack of the retry for pre-fault case in get_user_pages()
    causes a crash of the restored process.

    Making the pre-fault behavior of get_user_pages() the same as the "normal"
    one fixes the issue.

    Link: http://lkml.kernel.org/r/1557844195-18882-1-git-send-email-rppt@linux.ibm.com
    Fixes: d9c9ce34ed5c ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails")
    Signed-off-by: Mike Rapoport
    Tested-by: Andrei Vagin [https://travis-ci.org/avagin/linux/builds/533184940]
    Tested-by: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: Sebastian Andrzej Siewior
    Cc: Borislav Petkov
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • We have a single node system with node 0 disabled:
    Scanning NUMA topology in Northbridge 24
    Number of physical nodes 2
    Skipping disabled node 0
    Node 1 MemBase 0000000000000000 Limit 00000000fbff0000
    NODE_DATA(1) allocated [mem 0xfbfda000-0xfbfeffff]

    This causes crashes in memcg when system boots:
    BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    #PF error: [normal kernel read fault]
    ...
    RIP: 0010:list_lru_add+0x94/0x170
    ...
    Call Trace:
    d_lru_add+0x44/0x50
    dput.part.34+0xfc/0x110
    __fput+0x108/0x230
    task_work_run+0x9f/0xc0
    exit_to_usermode_loop+0xf5/0x100

    It is reproducible as far as 4.12. I did not try older kernels. You have
    to have a new enough systemd, e.g. 241 (the reason is unknown -- was not
    investigated). Cannot be reproduced with systemd 234.

    The system crashes because the size of lru array is never updated in
    memcg_update_all_list_lrus and the reads are past the zero-sized array,
    causing dereferences of random memory.

    The root cause are list_lru_memcg_aware checks in the list_lru code. The
    test in list_lru_memcg_aware is broken: it assumes node 0 is always
    present, but it is not true on some systems as can be seen above.

    So fix this by avoiding checks on node 0. Remember the memcg-awareness by
    a bool flag in struct list_lru.

    Link: http://lkml.kernel.org/r/20190522091940.3615-1-jslaby@suse.cz
    Fixes: 60d3fd32a7a9 ("list_lru: introduce per-memcg lists")
    Signed-off-by: Jiri Slaby
    Acked-by: Michal Hocko
    Suggested-by: Vladimir Davydov
    Acked-by: Vladimir Davydov
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Raghavendra K T
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     
  • The commit a3b609ef9f8b ("proc read mm's {arg,env}_{start,end} with mmap
    semaphore taken.") added synchronization of reading argument/environment
    boundaries under mmap_sem. Later commit 88aa7cc688d4 ("mm: introduce
    arg_lock to protect arg_start|end and env_start|end in mm_struct") avoided
    the coarse use of mmap_sem in similar situations. But there still
    remained two places that (mis)use mmap_sem.

    get_cmdline should also use arg_lock instead of mmap_sem when it reads the
    boundaries.

    The second place that should use arg_lock is in prctl_set_mm. By
    protecting the boundaries fields with the arg_lock, we can downgrade
    mmap_sem to reader lock (analogous to what we already do in
    prctl_set_mm_map).

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/20190502125203.24014-3-mkoutny@suse.com
    Fixes: 88aa7cc688d4 ("mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct")
    Signed-off-by: Michal Koutný
    Signed-off-by: Laurent Dufour
    Co-developed-by: Laurent Dufour
    Reviewed-by: Cyrill Gorcunov
    Acked-by: Michal Hocko
    Cc: Yang Shi
    Cc: Mateusz Guzik
    Cc: Kirill Tkhai
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Koutný
     
  • Reported-by: Nicholas Joll
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

31 May, 2019

3 commits

  • Based on 1 normalized pattern(s):

    subject to the gnu public license version 2

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 1 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Steve Winslow
    Reviewed-by: Alexios Zavras
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190528171440.319650492@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • Based on 3 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version this program is distributed in the
    hope that it will be useful but without any warranty without even
    the implied warranty of merchantability or fitness for a particular
    purpose see the gnu general public license for more details

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version [author] [kishon] [vijay] [abraham]
    [i] [kishon]@[ti] [com] this program is distributed in the hope that
    it will be useful but without any warranty without even the implied
    warranty of merchantability or fitness for a particular purpose see
    the gnu general public license for more details

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version [author] [graeme] [gregory]
    [gg]@[slimlogic] [co] [uk] [author] [kishon] [vijay] [abraham] [i]
    [kishon]@[ti] [com] [based] [on] [twl6030]_[usb] [c] [author] [hema]
    [hk] [hemahk]@[ti] [com] this program is distributed in the hope
    that it will be useful but without any warranty without even the
    implied warranty of merchantability or fitness for a particular
    purpose see the gnu general public license for more details

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 1105 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Reviewed-by: Richard Fontana
    Reviewed-by: Kate Stewart
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070033.202006027@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 3029 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

24 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your optional any later version of the license

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 3 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Richard Fontana
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190520075212.713472955@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

21 May, 2019

1 commit