15 Jun, 2019

1 commit


09 Jun, 2019

1 commit

  • Mostly due to x86 and acpi conversion, several documentation
    links are still pointing to the old file. Fix them.

    Signed-off-by: Mauro Carvalho Chehab
    Reviewed-by: Wolfram Sang
    Reviewed-by: Sven Van Asbroeck
    Reviewed-by: Bhupesh Sharma
    Acked-by: Mark Brown
    Signed-off-by: Jonathan Corbet

    Mauro Carvalho Chehab
     

05 Jun, 2019

5 commits

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation version 2 of the license

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 315 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Reviewed-by: Armijn Hemel
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190531190115.503150771@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • Based on 1 normalized pattern(s):

    this file is released under the gplv2

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 68 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Armijn Hemel
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190531190114.292346262@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • Based on 1 normalized pattern(s):

    this software may be redistributed and or modified under the terms
    of the gnu general public license gpl version 2 as published by the
    free software foundation

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 1 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Reviewed-by: Armijn Hemel
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190531190112.039124428@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license version 2 as
    published by the free software foundation this program is
    distributed in the hope that it will be useful but without any
    warranty without even the implied warranty of merchantability or
    fitness for a particular purpose see the gnu general public license
    for more details you should have received a copy of the gnu general
    public license along with this program if not write to the free
    software foundation inc 59 temple place suite 330 boston ma 02111
    1307 usa

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 136 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Alexios Zavras
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190530000436.384967451@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • Based on 1 normalized pattern(s):

    this software may be redistributed and or modified under the terms
    of the gnu general public license gpl version 2 only as published by
    the free software foundation

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 1 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Richard Fontana
    Reviewed-by: Alexios Zavras
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190529141333.676969322@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

02 Jun, 2019

7 commits

  • When we have holes in a normal memory zone, we could endup having
    cached_migrate_pfns which may not necessarily be valid, under heavy memory
    pressure with swapping enabled ( via __reset_isolation_suitable(),
    triggered by kswapd).

    Later if we fail to find a page via fast_isolate_freepages(), we may end
    up using the migrate_pfn we started the search with, as valid page. This
    could lead to accessing NULL pointer derefernces like below, due to an
    invalid mem_section pointer.

    Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 [47/1825]
    Mem abort info:
    ESR = 0x96000004
    Exception class = DABT (current EL), IL = 32 bits
    SET = 0, FnV = 0
    EA = 0, S1PTW = 0
    Data abort info:
    ISV = 0, ISS = 0x00000004
    CM = 0, WnR = 0
    user pgtable: 4k pages, 48-bit VAs, pgdp = 0000000082f94ae9
    [0000000000000008] pgd=0000000000000000
    Internal error: Oops: 96000004 [#1] SMP
    ...
    CPU: 10 PID: 6080 Comm: qemu-system-aar Not tainted 510-rc1+ #6
    Hardware name: AmpereComputing(R) OSPREY EV-883832-X3-0001/OSPREY, BIOS 4819 09/25/2018
    pstate: 60000005 (nZCv daif -PAN -UAO)
    pc : set_pfnblock_flags_mask+0x58/0xe8
    lr : compaction_alloc+0x300/0x950
    [...]
    Process qemu-system-aar (pid: 6080, stack limit = 0x0000000095070da5)
    Call trace:
    set_pfnblock_flags_mask+0x58/0xe8
    compaction_alloc+0x300/0x950
    migrate_pages+0x1a4/0xbb0
    compact_zone+0x750/0xde8
    compact_zone_order+0xd8/0x118
    try_to_compact_pages+0xb4/0x290
    __alloc_pages_direct_compact+0x84/0x1e0
    __alloc_pages_nodemask+0x5e0/0xe18
    alloc_pages_vma+0x1cc/0x210
    do_huge_pmd_anonymous_page+0x108/0x7c8
    __handle_mm_fault+0xdd4/0x1190
    handle_mm_fault+0x114/0x1c0
    __get_user_pages+0x198/0x3c0
    get_user_pages_unlocked+0xb4/0x1d8
    __gfn_to_pfn_memslot+0x12c/0x3b8
    gfn_to_pfn_prot+0x4c/0x60
    kvm_handle_guest_abort+0x4b0/0xcd8
    handle_exit+0x140/0x1b8
    kvm_arch_vcpu_ioctl_run+0x260/0x768
    kvm_vcpu_ioctl+0x490/0x898
    do_vfs_ioctl+0xc4/0x898
    ksys_ioctl+0x8c/0xa0
    __arm64_sys_ioctl+0x28/0x38
    el0_svc_common+0x74/0x118
    el0_svc_handler+0x38/0x78
    el0_svc+0x8/0xc
    Code: f8607840 f100001f 8b011401 9a801020 (f9400400)
    ---[ end trace af6a35219325a9b6 ]---

    The issue was reported on an arm64 server with 128GB with holes in the
    zone (e.g, [32GB@4GB, 96GB@544GB]), with a swap device enabled, while
    running 100 KVM guest instances.

    This patch fixes the issue by ensuring that the page belongs to a valid
    PFN when we fallback to using the lower limit of the scan range upon
    failure in fast_isolate_freepages().

    Link: http://lkml.kernel.org/r/1558711908-15688-1-git-send-email-suzuki.poulose@arm.com
    Fixes: 5a811889de10f1eb ("mm, compaction: use free lists to quickly locate a migration target")
    Signed-off-by: Suzuki K Poulose
    Reported-by: Marc Zyngier
    Reviewed-by: Mel Gorman
    Reviewed-by: Anshuman Khandual
    Cc: Michal Hocko
    Cc: Qian Cai
    Cc: Marc Zyngier
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Suzuki K Poulose
     
  • When building with -Wuninitialized and CONFIG_KASAN_SW_TAGS unset, Clang
    warns:

    mm/kasan/common.c:484:40: warning: variable 'tag' is uninitialized when
    used here [-Wuninitialized]
    kasan_unpoison_shadow(set_tag(object, tag), size);
    ^~~

    set_tag ignores tag in this configuration but clang doesn't realize it at
    this point in its pipeline, as it points to arch_kasan_set_tag as being
    the point where it is used, which will later be expanded to (void
    *)(object) without a use of tag. Initialize tag to 0xff, as it removes
    this warning and doesn't change the meaning of the code.

    Link: https://github.com/ClangBuiltLinux/linux/issues/465
    Link: http://lkml.kernel.org/r/20190502163057.6603-1-natechancellor@gmail.com
    Fixes: 7f94ffbc4c6a ("kasan: add hooks implementation for tag-based mode")
    Signed-off-by: Nathan Chancellor
    Reviewed-by: Andrey Konovalov
    Reviewed-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: Nick Desaulniers
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nathan Chancellor
     
  • kmem_cache_alloc() may be called from z3fold_alloc() in atomic context, so
    we need to pass correct gfp flags to avoid "scheduling while atomic" bug.

    Link: http://lkml.kernel.org/r/20190523153245.119dfeed55927e8755250ddd@gmail.com
    Fixes: 7c2b8baa61fe5 ("mm/z3fold.c: add structure for buddy handles")
    Signed-off-by: Vitaly Wool
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     
  • When get_user_pages*() is called with pages = NULL, the processing of
    VM_FAULT_RETRY terminates early without actually retrying to fault-in all
    the pages.

    If the pages in the requested range belong to a VMA that has userfaultfd
    registered, handle_userfault() returns VM_FAULT_RETRY *after* user space
    has populated the page, but for the gup pre-fault case there's no actual
    retry and the caller will get no pages although they are present.

    This issue was uncovered when running post-copy memory restore in CRIU
    after d9c9ce34ed5c ("x86/fpu: Fault-in user stack if
    copy_fpstate_to_sigframe() fails").

    After this change, the copying of FPU state to the sigframe switched from
    copy_to_user() variants which caused a real page fault to get_user_pages()
    with pages parameter set to NULL.

    In post-copy mode of CRIU, the destination memory is managed with
    userfaultfd and lack of the retry for pre-fault case in get_user_pages()
    causes a crash of the restored process.

    Making the pre-fault behavior of get_user_pages() the same as the "normal"
    one fixes the issue.

    Link: http://lkml.kernel.org/r/1557844195-18882-1-git-send-email-rppt@linux.ibm.com
    Fixes: d9c9ce34ed5c ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails")
    Signed-off-by: Mike Rapoport
    Tested-by: Andrei Vagin [https://travis-ci.org/avagin/linux/builds/533184940]
    Tested-by: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: Sebastian Andrzej Siewior
    Cc: Borislav Petkov
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • We have a single node system with node 0 disabled:
    Scanning NUMA topology in Northbridge 24
    Number of physical nodes 2
    Skipping disabled node 0
    Node 1 MemBase 0000000000000000 Limit 00000000fbff0000
    NODE_DATA(1) allocated [mem 0xfbfda000-0xfbfeffff]

    This causes crashes in memcg when system boots:
    BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    #PF error: [normal kernel read fault]
    ...
    RIP: 0010:list_lru_add+0x94/0x170
    ...
    Call Trace:
    d_lru_add+0x44/0x50
    dput.part.34+0xfc/0x110
    __fput+0x108/0x230
    task_work_run+0x9f/0xc0
    exit_to_usermode_loop+0xf5/0x100

    It is reproducible as far as 4.12. I did not try older kernels. You have
    to have a new enough systemd, e.g. 241 (the reason is unknown -- was not
    investigated). Cannot be reproduced with systemd 234.

    The system crashes because the size of lru array is never updated in
    memcg_update_all_list_lrus and the reads are past the zero-sized array,
    causing dereferences of random memory.

    The root cause are list_lru_memcg_aware checks in the list_lru code. The
    test in list_lru_memcg_aware is broken: it assumes node 0 is always
    present, but it is not true on some systems as can be seen above.

    So fix this by avoiding checks on node 0. Remember the memcg-awareness by
    a bool flag in struct list_lru.

    Link: http://lkml.kernel.org/r/20190522091940.3615-1-jslaby@suse.cz
    Fixes: 60d3fd32a7a9 ("list_lru: introduce per-memcg lists")
    Signed-off-by: Jiri Slaby
    Acked-by: Michal Hocko
    Suggested-by: Vladimir Davydov
    Acked-by: Vladimir Davydov
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Raghavendra K T
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     
  • The commit a3b609ef9f8b ("proc read mm's {arg,env}_{start,end} with mmap
    semaphore taken.") added synchronization of reading argument/environment
    boundaries under mmap_sem. Later commit 88aa7cc688d4 ("mm: introduce
    arg_lock to protect arg_start|end and env_start|end in mm_struct") avoided
    the coarse use of mmap_sem in similar situations. But there still
    remained two places that (mis)use mmap_sem.

    get_cmdline should also use arg_lock instead of mmap_sem when it reads the
    boundaries.

    The second place that should use arg_lock is in prctl_set_mm. By
    protecting the boundaries fields with the arg_lock, we can downgrade
    mmap_sem to reader lock (analogous to what we already do in
    prctl_set_mm_map).

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/20190502125203.24014-3-mkoutny@suse.com
    Fixes: 88aa7cc688d4 ("mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct")
    Signed-off-by: Michal Koutný
    Signed-off-by: Laurent Dufour
    Co-developed-by: Laurent Dufour
    Reviewed-by: Cyrill Gorcunov
    Acked-by: Michal Hocko
    Cc: Yang Shi
    Cc: Mateusz Guzik
    Cc: Kirill Tkhai
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Koutný
     
  • Reported-by: Nicholas Joll
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

31 May, 2019

3 commits

  • Based on 1 normalized pattern(s):

    subject to the gnu public license version 2

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 1 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Steve Winslow
    Reviewed-by: Alexios Zavras
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190528171440.319650492@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • Based on 3 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version this program is distributed in the
    hope that it will be useful but without any warranty without even
    the implied warranty of merchantability or fitness for a particular
    purpose see the gnu general public license for more details

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version [author] [kishon] [vijay] [abraham]
    [i] [kishon]@[ti] [com] this program is distributed in the hope that
    it will be useful but without any warranty without even the implied
    warranty of merchantability or fitness for a particular purpose see
    the gnu general public license for more details

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version [author] [graeme] [gregory]
    [gg]@[slimlogic] [co] [uk] [author] [kishon] [vijay] [abraham] [i]
    [kishon]@[ti] [com] [based] [on] [twl6030]_[usb] [c] [author] [hema]
    [hk] [hemahk]@[ti] [com] this program is distributed in the hope
    that it will be useful but without any warranty without even the
    implied warranty of merchantability or fitness for a particular
    purpose see the gnu general public license for more details

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 1105 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Reviewed-by: Richard Fontana
    Reviewed-by: Kate Stewart
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070033.202006027@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 3029 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

24 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your optional any later version of the license

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 3 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Richard Fontana
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190520075212.713472955@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

21 May, 2019

3 commits

  • Add SPDX license identifiers to all Make/Kconfig files which:

    - Have no license information of any form

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have MODULE_LICENCE("GPL*") inside which was used in the initial
    scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

20 May, 2019

2 commits

  • Merge yet more updates from Andrew Morton:
    "A few final bits:

    - large changes to vmalloc, yielding large performance benefits

    - tweak the console-flush-on-panic code

    - a few fixes"

    * emailed patches from Andrew Morton :
    panic: add an option to replay all the printk message in buffer
    initramfs: don't free a non-existent initrd
    fs/writeback.c: use rcu_barrier() to wait for inflight wb switches going into workqueue when umount
    mm/compaction.c: correct zone boundary handling when isolating pages from a pageblock
    mm/vmap: add DEBUG_AUGMENT_LOWEST_MATCH_CHECK macro
    mm/vmap: add DEBUG_AUGMENT_PROPAGATE_CHECK macro
    mm/vmalloc.c: keep track of free blocks for vmap allocation

    Linus Torvalds
     
  • Pull core fixes from Ingo Molnar:
    "This fixes a particularly thorny munmap() bug with MPX, plus fixes a
    host build environment assumption in objtool"

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    objtool: Allow AR to be overridden with HOSTAR
    x86/mpx, mm/core: Fix recursive munmap() corruption

    Linus Torvalds
     

19 May, 2019

4 commits

  • syzbot reported the following error from a tree with a head commit of
    baf76f0c58ae ("slip: make slhc_free() silently accept an error pointer")

    BUG: unable to handle kernel paging request at ffffea0003348000
    #PF error: [normal kernel read fault]
    PGD 12c3f9067 P4D 12c3f9067 PUD 12c3f8067 PMD 0
    Oops: 0000 [#1] PREEMPT SMP KASAN
    CPU: 1 PID: 28916 Comm: syz-executor.2 Not tainted 5.1.0-rc6+ #89
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:constant_test_bit arch/x86/include/asm/bitops.h:314 [inline]
    RIP: 0010:PageCompound include/linux/page-flags.h:186 [inline]
    RIP: 0010:isolate_freepages_block+0x1c0/0xd40 mm/compaction.c:579
    Code: 01 d8 ff 4d 85 ed 0f 84 ef 07 00 00 e8 29 00 d8 ff 4c 89 e0 83 85 38 ff
    ff ff 01 48 c1 e8 03 42 80 3c 38 00 0f 85 31 0a 00 00 8b 2c 24 31 ff 49
    c1 ed 10 41 83 e5 01 44 89 ee e8 3a 01 d8 ff
    RSP: 0018:ffff88802b31eab8 EFLAGS: 00010246
    RAX: 1ffffd4000669000 RBX: 00000000000cd200 RCX: ffffc9000a235000
    RDX: 000000000001ca5e RSI: ffffffff81988cc7 RDI: 0000000000000001
    RBP: ffff88802b31ebd8 R08: ffff88805af700c0 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffffea0003348000
    R13: 0000000000000000 R14: ffff88802b31f030 R15: dffffc0000000000
    FS: 00007f61648dc700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffea0003348000 CR3: 0000000037c64000 CR4: 00000000001426e0
    Call Trace:
    fast_isolate_around mm/compaction.c:1243 [inline]
    fast_isolate_freepages mm/compaction.c:1418 [inline]
    isolate_freepages mm/compaction.c:1438 [inline]
    compaction_alloc+0x1aee/0x22e0 mm/compaction.c:1550

    There is no reproducer and it is difficult to hit -- 1 crash every few
    days. The issue is very similar to the fix in commit 6b0868c820ff
    ("mm/compaction.c: correct zone boundary handling when resetting pageblock
    skip hints"). When isolating free pages around a target pageblock, the
    boundary handling is off by one and can stray into the next pageblock.
    Triggering the syzbot error requires that the end of pageblock is section
    or zone aligned, and that the next section is unpopulated.

    A more subtle consequence of the bug is that pageblocks were being
    improperly used as migration targets which potentially hurts fragmentation
    avoidance in the long-term one page at a time.

    A debugging patch revealed that it's definitely possible to stray outside
    of a pageblock which is not intended. While syzbot cannot be used to
    verify this patch, it was confirmed that the debugging warning no longer
    triggers with this patch applied. It has also been confirmed that the THP
    allocation stress tests are not degraded by this patch.

    Link: http://lkml.kernel.org/r/20190510182124.GI18914@techsingularity.net
    Fixes: e332f741a8dd ("mm, compaction: be selective about what pageblocks to clear skip hints")
    Signed-off-by: Mel Gorman
    Reported-by: syzbot+d84c80f9fe26a0f7a734@syzkaller.appspotmail.com
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Qian Cai
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: # v5.1+
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This macro adds some debug code to check that vmap allocations are
    happened in ascending order.

    By default this option is set to 0 and not active. It requires
    recompilation of the kernel to activate it. Set to 1, compile the
    kernel.

    [urezki@gmail.com: v4]
    Link: http://lkml.kernel.org/r/20190406183508.25273-4-urezki@gmail.com
    Link: http://lkml.kernel.org/r/20190402162531.10888-4-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Reviewed-by: Roman Gushchin
    Cc: Ingo Molnar
    Cc: Joel Fernandes
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Oleksiy Avramchenko
    Cc: Steven Rostedt
    Cc: Tejun Heo
    Cc: Thomas Garnier
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uladzislau Rezki (Sony)
     
  • This macro adds some debug code to check that the augment tree is
    maintained correctly, meaning that every node contains valid
    subtree_max_size value.

    By default this option is set to 0 and not active. It requires
    recompilation of the kernel to activate it. Set to 1, compile the
    kernel.

    [urezki@gmail.com: v4]
    Link: http://lkml.kernel.org/r/20190406183508.25273-3-urezki@gmail.com
    Link: http://lkml.kernel.org/r/20190402162531.10888-3-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Reviewed-by: Roman Gushchin
    Cc: Ingo Molnar
    Cc: Joel Fernandes
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Oleksiy Avramchenko
    Cc: Steven Rostedt
    Cc: Tejun Heo
    Cc: Thomas Garnier
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uladzislau Rezki (Sony)
     
  • Patch series "improve vmap allocation", v3.

    Objective
    ---------

    Please have a look for the description at:

    https://lkml.org/lkml/2018/10/19/786

    but let me also summarize it a bit here as well.

    The current implementation has O(N) complexity. Requests with different
    permissive parameters can lead to long allocation time. When i say
    "long" i mean milliseconds.

    Description
    -----------

    This approach organizes the KVA memory layout into free areas of the
    1-ULONG_MAX range, i.e. an allocation is done over free areas lookups,
    instead of finding a hole between two busy blocks. It allows to have
    lower number of objects which represent the free space, therefore to have
    less fragmented memory allocator. Because free blocks are always as large
    as possible.

    It uses the augment tree where all free areas are sorted in ascending
    order of va->va_start address in pair with linked list that provides
    O(1) access to prev/next elements.

    Since the tree is augment, we also maintain the "subtree_max_size" of VA
    that reflects a maximum available free block in its left or right
    sub-tree. Knowing that, we can easily traversal toward the lowest (left
    most path) free area.

    Allocation: ~O(log(N)) complexity. It is sequential allocation method
    therefore tends to maximize locality. The search is done until a first
    suitable block is large enough to encompass the requested parameters.
    Bigger areas are split.

    I copy paste here the description of how the area is split, since i
    described it in https://lkml.org/lkml/2018/10/19/786

    A free block can be split by three different ways. Their names are
    FL_FIT_TYPE, LE_FIT_TYPE/RE_FIT_TYPE and NE_FIT_TYPE, i.e. they
    correspond to how requested size and alignment fit to a free block.

    FL_FIT_TYPE - in this case a free block is just removed from the free
    list/tree because it fully fits. Comparing with current design there is
    an extra work with rb-tree updating.

    LE_FIT_TYPE/RE_FIT_TYPE - left/right edges fit. In this case what we do
    is just cutting a free block. It is as fast as a current design. Most of
    the vmalloc allocations just end up with this case, because the edge is
    always aligned to 1.

    NE_FIT_TYPE - Is much less common case. Basically it happens when
    requested size and alignment does not fit left nor right edges, i.e. it
    is between them. In this case during splitting we have to build a
    remaining left free area and place it back to the free list/tree.

    Comparing with current design there are two extra steps. First one is we
    have to allocate a new vmap_area structure. Second one we have to insert
    that remaining free block to the address sorted list/tree.

    In order to optimize a first case there is a cache with free_vmap objects.
    Instead of allocating from slab we just take an object from the cache and
    reuse it.

    Second one is pretty optimized. Since we know a start point in the tree
    we do not do a search from the top. Instead a traversal begins from a
    rb-tree node we split.

    De-allocation. ~O(log(N)) complexity. An area is not inserted straight
    away to the tree/list, instead we identify the spot first, checking if it
    can be merged around neighbors. The list provides O(1) access to
    prev/next, so it is pretty fast to check it. Summarizing. If merged then
    large coalesced areas are created, if not the area is just linked making
    more fragments.

    There is one more thing that i should mention here. After modification of
    VA node, its subtree_max_size is updated if it was/is the biggest area in
    its left or right sub-tree. Apart of that it can also be populated back
    to upper levels to fix the tree. For more details please have a look at
    the __augment_tree_propagate_from() function and the description.

    Tests and stressing
    -------------------

    I use the "test_vmalloc.sh" test driver available under
    "tools/testing/selftests/vm/" since 5.1-rc1 kernel. Just trigger "sudo
    ./test_vmalloc.sh" to find out how to deal with it.

    Tested on different platforms including x86_64/i686/ARM64/x86_64_NUMA.
    Regarding last one, i do not have any physical access to NUMA system,
    therefore i emulated it. The time of stressing is days.

    If you run the test driver in "stress mode", you also need the patch that
    is in Andrew's tree but not in Linux 5.1-rc1. So, please apply it:

    http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/commit/?id=e0cf7749bade6da318e98e934a24d8b62fab512c

    After massive testing, i have not identified any problems like memory
    leaks, crashes or kernel panics. I find it stable, but more testing would
    be good.

    Performance analysis
    --------------------

    I have used two systems to test. One is i5-3320M CPU @ 2.60GHz and
    another is HiKey960(arm64) board. i5-3320M runs on 4.20 kernel, whereas
    Hikey960 uses 4.15 kernel. I have both system which could run on 5.1-rc1
    as well, but the results have not been ready by time i an writing this.

    Currently it consist of 8 tests. There are three of them which correspond
    to different types of splitting(to compare with default). We have 3
    ones(see above). Another 5 do allocations in different conditions.

    a) sudo ./test_vmalloc.sh performance

    When the test driver is run in "performance" mode, it runs all available
    tests pinned to first online CPU with sequential execution test order. We
    do it in order to get stable and repeatable results. Take a look at time
    difference in "long_busy_list_alloc_test". It is not surprising because
    the worst case is O(N).

    # i5-3320M
    How many cycles all tests took:
    CPU0=646919905370(default) cycles vs CPU0=193290498550(patched) cycles

    # See detailed table with results here:
    ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_default.txt
    ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_patched.txt

    # Hikey960 8x CPUs
    How many cycles all tests took:
    CPU0=3478683207 cycles vs CPU0=463767978 cycles

    # See detailed table with results here:
    ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_default.txt
    ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_patched.txt

    b) time sudo ./test_vmalloc.sh test_repeat_count=1

    With this configuration, all tests are run on all available online CPUs.
    Before running each CPU shuffles its tests execution order. It gives
    random allocation behaviour. So it is rough comparison, but it puts in
    the picture for sure.

    # i5-3320M
    vs
    real 101m22.813s real 0m56.805s
    user 0m0.011s user 0m0.015s
    sys 0m5.076s sys 0m0.023s

    # See detailed table with results here:
    ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_default.txt
    ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_patched.txt

    # Hikey960 8x CPUs
    vs
    real unknown real 4m25.214s
    user unknown user 0m0.011s
    sys unknown sys 0m0.670s

    I did not manage to complete this test on "default Hikey960" kernel
    version. After 24 hours it was still running, therefore i had to cancel
    it. That is why real/user/sys are "unknown".

    This patch (of 3):

    Currently an allocation of the new vmap area is done over busy list
    iteration(complexity O(n)) until a suitable hole is found between two busy
    areas. Therefore each new allocation causes the list being grown. Due to
    over fragmented list and different permissive parameters an allocation can
    take a long time. For example on embedded devices it is milliseconds.

    This patch organizes the KVA memory layout into free areas of the
    1-ULONG_MAX range. It uses an augment red-black tree that keeps blocks
    sorted by their offsets in pair with linked list keeping the free space in
    order of increasing addresses.

    Nodes are augmented with the size of the maximum available free block in
    its left or right sub-tree. Thus, that allows to take a decision and
    traversal toward the block that will fit and will have the lowest start
    address, i.e. it is sequential allocation.

    Allocation: to allocate a new block a search is done over the tree until a
    suitable lowest(left most) block is large enough to encompass: the
    requested size, alignment and vstart point. If the block is bigger than
    requested size - it is split.

    De-allocation: when a busy vmap area is freed it can either be merged or
    inserted to the tree. Red-black tree allows efficiently find a spot
    whereas a linked list provides a constant-time access to previous and next
    blocks to check if merging can be done. In case of merging of
    de-allocated memory chunk a large coalesced area is created.

    Complexity: ~O(log(N))

    [urezki@gmail.com: v3]
    Link: http://lkml.kernel.org/r/20190402162531.10888-2-urezki@gmail.com
    [urezki@gmail.com: v4]
    Link: http://lkml.kernel.org/r/20190406183508.25273-2-urezki@gmail.com
    Link: http://lkml.kernel.org/r/20190321190327.11813-2-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Cc: Matthew Wilcox
    Cc: Thomas Garnier
    Cc: Oleksiy Avramchenko
    Cc: Steven Rostedt
    Cc: Joel Fernandes
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uladzislau Rezki (Sony)
     

17 May, 2019

1 commit

  • It turned out that DEBUG_SLAB_LEAK is still broken even after recent
    recue efforts that when there is a large number of objects like
    kmemleak_object which is normal on a debug kernel,

    # grep kmemleak /proc/slabinfo
    kmemleak_object 2243606 3436210 ...

    reading /proc/slab_allocators could easily loop forever while processing
    the kmemleak_object cache and any additional freeing or allocating
    objects will trigger a reprocessing. To make a situation worse,
    soft-lockups could easily happen in this sitatuion which will call
    printk() to allocate more kmemleak objects to guarantee an infinite
    loop.

    Also, since it seems no one had noticed when it was totally broken
    more than 2-year ago - see the commit fcf88917dd43 ("slab: fix a crash
    by reading /proc/slab_allocators"), probably nobody cares about it
    anymore due to the decline of the SLAB. Just remove it entirely.

    Suggested-by: Vlastimil Babka
    Suggested-by: Linus Torvalds
    Signed-off-by: Qian Cai
    Signed-off-by: Linus Torvalds

    Qian Cai
     

15 May, 2019

12 commits

  • When a cgroup is reclaimed on behalf of a configured limit, reclaim
    needs to round-robin through all NUMA nodes that hold pages of the memcg
    in question. However, when assembling the mask of candidate NUMA nodes,
    the code only consults the *local* cgroup LRU counters, not the
    recursive counters for the entire subtree. Cgroup limits are frequently
    configured against intermediate cgroups that do not have memory on their
    own LRUs. In this case, the node mask will always come up empty and
    reclaim falls back to scanning only the current node.

    If a cgroup subtree has some memory on one node but the processes are
    bound to another node afterwards, the limit reclaim will never age or
    reclaim that memory anymore.

    To fix this, use the recursive LRU counts for a cgroup subtree to
    determine which nodes hold memory of that cgroup.

    The code has been broken like this forever, so it doesn't seem to be a
    problem in practice. I just noticed it while reviewing the way the LRU
    counters are used in general.

    Link: http://lkml.kernel.org/r/20190412151507.2769-5-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Right now, when somebody needs to know the recursive memory statistics
    and events of a cgroup subtree, they need to walk the entire subtree and
    sum up the counters manually.

    There are two issues with this:

    1. When a cgroup gets deleted, its stats are lost. The state counters
    should all be 0 at that point, of course, but the events are not.
    When this happens, the event counters, which are supposed to be
    monotonic, can go backwards in the parent cgroups.

    2. During regular operation, we always have a certain number of lazily
    freed cgroups sitting around that have been deleted, have no tasks,
    but have a few cache pages remaining. These groups' statistics do not
    change until we eventually hit memory pressure, but somebody
    watching, say, memory.stat on an ancestor has to iterate those every
    time.

    This patch addresses both issues by introducing recursive counters at
    each level that are propagated from the write side when stats change.

    Upward propagation happens when the per-cpu caches spill over into the
    local atomic counter. This is the same thing we do during charge and
    uncharge, except that the latter uses atomic RMWs, which are more
    expensive; stat changes happen at around the same rate. In a sparse
    file test (page faults and reclaim at maximum CPU speed) with 5 cgroup
    nesting levels, perf shows __mod_memcg_page state at ~1%.

    Link: http://lkml.kernel.org/r/20190412151507.2769-4-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • These are getting too big to be inlined in every callsite. They were
    stolen from vmstat.c, which already out-of-lines them, and they have
    only been growing since. The callsites aren't that hot, either.

    Move __mod_memcg_state()
    __mod_lruvec_state() and
    __count_memcg_events() out of line and add kerneldoc comments.

    Link: http://lkml.kernel.org/r/20190412151507.2769-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Patch series "mm: memcontrol: memory.stat cost & correctness".

    The cgroup memory.stat file holds recursive statistics for the entire
    subtree. The current implementation does this tree walk on-demand
    whenever the file is read. This is giving us problems in production.

    1. The cost of aggregating the statistics on-demand is high. A lot of
    system service cgroups are mostly idle and their stats don't change
    between reads, yet we always have to check them. There are also always
    some lazily-dying cgroups sitting around that are pinned by a handful
    of remaining page cache; the same applies to them.

    In an application that periodically monitors memory.stat in our
    fleet, we have seen the aggregation consume up to 5% CPU time.

    2. When cgroups die and disappear from the cgroup tree, so do their
    accumulated vm events. The result is that the event counters at
    higher-level cgroups can go backwards and confuse some of our
    automation, let alone people looking at the graphs over time.

    To address both issues, this patch series changes the stat
    implementation to spill counts upwards when the counters change.

    The upward spilling is batched using the existing per-cpu cache. In a
    sparse file stress test with 5 level cgroup nesting, the additional cost
    of the flushing was negligible (a little under 1% of CPU at 100% CPU
    utilization, compared to the 5% of reading memory.stat during regular
    operation).

    This patch (of 4):

    memcg_page_state(), lruvec_page_state(), memcg_sum_events() are
    currently returning the state of the local memcg or lruvec, not the
    recursive state.

    In practice there is a demand for both versions, although the callers
    that want the recursive counts currently sum them up by hand.

    Per default, cgroups are considered recursive entities and generally we
    expect more users of the recursive counters, with the local counts being
    special cases. To reflect that in the name, add a _local suffix to the
    current implementations.

    The following patch will re-incarnate these functions with recursive
    semantics, but with an O(1) implementation.

    [hannes@cmpxchg.org: fix bisection hole]
    Link: http://lkml.kernel.org/r/20190417160347.GC23013@cmpxchg.org
    Link: http://lkml.kernel.org/r/20190412151507.2769-2-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • I spent literally an hour trying to work out why an earlier version of
    my memory.events aggregation code doesn't work properly, only to find
    out I was calling memcg->events instead of memcg->memory_events, which
    is fairly confusing.

    This naming seems in need of reworking, so make it harder to do the
    wrong thing by using vmevents instead of events, which makes it more
    clear that these are vm counters rather than memcg-specific counters.

    There are also a few other inconsistent names in both the percpu and
    aggregated structs, so these are all cleaned up to be more coherent and
    easy to understand.

    This commit contains code cleanup only: there are no logic changes.

    [akpm@linux-foundation.org: fix it for preceding changes]
    Link: http://lkml.kernel.org/r/20190208224319.GA23801@chrisdown.name
    Signed-off-by: Chris Down
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Roman Gushchin
    Cc: Dennis Zhou
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Down
     
  • The semantics of what mincore() considers to be resident is not
    completely clear, but Linux has always (since 2.3.52, which is when
    mincore() was initially done) treated it as "page is available in page
    cache".

    That's potentially a problem, as that [in]directly exposes
    meta-information about pagecache / memory mapping state even about
    memory not strictly belonging to the process executing the syscall,
    opening possibilities for sidechannel attacks.

    Change the semantics of mincore() so that it only reveals pagecache
    information for non-anonymous mappings that belog to files that the
    calling process could (if it tried to) successfully open for writing;
    otherwise we'd be including shared non-exclusive mappings, which

    - is the sidechannel

    - is not the usecase for mincore(), as that's primarily used for data,
    not (shared) text

    [jkosina@suse.cz: v2]
    Link: http://lkml.kernel.org/r/20190312141708.6652-2-vbabka@suse.cz
    [mhocko@suse.com: restructure can_do_mincore() conditions]
    Link: http://lkml.kernel.org/r/nycvar.YFH.7.76.1903062342020.19912@cbobk.fhfr.pm
    Signed-off-by: Jiri Kosina
    Signed-off-by: Vlastimil Babka
    Acked-by: Josh Snyder
    Acked-by: Michal Hocko
    Originally-by: Linus Torvalds
    Originally-by: Dominique Martinet
    Cc: Andy Lutomirski
    Cc: Dave Chinner
    Cc: Kevin Easton
    Cc: Matthew Wilcox
    Cc: Cyril Hrubis
    Cc: Tejun Heo
    Cc: Kirill A. Shutemov
    Cc: Daniel Gruss
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Kosina
     
  • When freeing a page with an order >= shuffle_page_order randomly select
    the front or back of the list for insertion.

    While the mm tries to defragment physical pages into huge pages this can
    tend to make the page allocator more predictable over time. Inject the
    front-back randomness to preserve the initial randomness established by
    shuffle_free_memory() when the kernel was booted.

    The overhead of this manipulation is constrained by only being applied
    for MAX_ORDER sized pages by default.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/154899812788.3165233.9066631950746578517.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: Kees Cook
    Cc: Michal Hocko
    Cc: Dave Hansen
    Cc: Keith Busch
    Cc: Robert Elliott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • In preparation for runtime randomization of the zone lists, take all
    (well, most of) the list_*() functions in the buddy allocator and put
    them in helper functions. Provide a common control point for injecting
    additional behavior when freeing pages.

    [dan.j.williams@intel.com: fix buddy list helpers]
    Link: http://lkml.kernel.org/r/155033679702.1773410.13041474192173212653.stgit@dwillia2-desk3.amr.corp.intel.com
    [vbabka@suse.cz: remove del_page_from_free_area() migratetype parameter]
    Link: http://lkml.kernel.org/r/4672701b-6775-6efd-0797-b6242591419e@suse.cz
    Link: http://lkml.kernel.org/r/154899812264.3165233.5219320056406926223.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Signed-off-by: Vlastimil Babka
    Tested-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Dave Hansen
    Cc: Kees Cook
    Cc: Keith Busch
    Cc: Robert Elliott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Patch series "mm: Randomize free memory", v10.

    This patch (of 3):

    Randomization of the page allocator improves the average utilization of
    a direct-mapped memory-side-cache. Memory side caching is a platform
    capability that Linux has been previously exposed to in HPC
    (high-performance computing) environments on specialty platforms. In
    that instance it was a smaller pool of high-bandwidth-memory relative to
    higher-capacity / lower-bandwidth DRAM. Now, this capability is going
    to be found on general purpose server platforms where DRAM is a cache in
    front of higher latency persistent memory [1].

    Robert offered an explanation of the state of the art of Linux
    interactions with memory-side-caches [2], and I copy it here:

    It's been a problem in the HPC space:
    http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/

    A kernel module called zonesort is available to try to help:
    https://software.intel.com/en-us/articles/xeon-phi-software

    and this abandoned patch series proposed that for the kernel:
    https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com

    Dan's patch series doesn't attempt to ensure buffers won't conflict, but
    also reduces the chance that the buffers will. This will make performance
    more consistent, albeit slower than "optimal" (which is near impossible
    to attain in a general-purpose kernel). That's better than forcing
    users to deploy remedies like:
    "To eliminate this gradual degradation, we have added a Stream
    measurement to the Node Health Check that follows each job;
    nodes are rebooted whenever their measured memory bandwidth
    falls below 300 GB/s."

    A replacement for zonesort was merged upstream in commit cc9aec03e58f
    ("x86/numa_emulation: Introduce uniform split capability"). With this
    numa_emulation capability, memory can be split into cache sized
    ("near-memory" sized) numa nodes. A bind operation to such a node, and
    disabling workloads on other nodes, enables full cache performance.
    However, once the workload exceeds the cache size then cache conflicts
    are unavoidable. While HPC environments might be able to tolerate
    time-scheduling of cache sized workloads, for general purpose server
    platforms, the oversubscribed cache case will be the common case.

    The worst case scenario is that a server system owner benchmarks a
    workload at boot with an un-contended cache only to see that performance
    degrade over time, even below the average cache performance due to
    excessive conflicts. Randomization clips the peaks and fills in the
    valleys of cache utilization to yield steady average performance.

    Here are some performance impact details of the patches:

    1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
    3X speedup in a contrived case that tries to force cache conflicts.
    The contrived cased used the numa_emulation capability to force an
    instance of the benchmark to be run in two of the near-memory sized
    numa nodes. If both instances were placed on the same emulated they
    would fit and cause zero conflicts. While on separate emulated nodes
    without randomization they underutilized the cache and conflicted
    unnecessarily due to the in-order allocation per node.

    2/ A well known Java server application benchmark was run with a heap
    size that exceeded cache size by 3X. The cache conflict rate was 8%
    for the first run and degraded to 21% after page allocator aging. With
    randomization enabled the rate levelled out at 11%.

    3/ A MongoDB workload did not observe measurable difference in
    cache-conflict rates, but the overall throughput dropped by 7% with
    randomization in one case.

    4/ Mel Gorman ran his suite of performance workloads with randomization
    enabled on platforms without a memory-side-cache and saw a mix of some
    improvements and some losses [3].

    While there is potentially significant improvement for applications that
    depend on low latency access across a wide working-set, the performance
    may be negligible to negative for other workloads. For this reason the
    shuffle capability defaults to off unless a direct-mapped
    memory-side-cache is detected. Even then, the page_alloc.shuffle=0
    parameter can be specified to disable the randomization on those systems.

    Outside of memory-side-cache utilization concerns there is potentially
    security benefit from randomization. Some data exfiltration and
    return-oriented-programming attacks rely on the ability to infer the
    location of sensitive data objects. The kernel page allocator, especially
    early in system boot, has predictable first-in-first out behavior for
    physical pages. Pages are freed in physical address order when first
    onlined.

    Quoting Kees:
    "While we already have a base-address randomization
    (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
    memory layouts would certainly be using the predictability of
    allocation ordering (i.e. for attacks where the base address isn't
    important: only the relative positions between allocated memory).
    This is common in lots of heap-style attacks. They try to gain
    control over ordering by spraying allocations, etc.

    I'd really like to see this because it gives us something similar
    to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."

    While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
    caches it leaves vast bulk of memory to be predictably in order allocated.
    However, it should be noted, the concrete security benefits are hard to
    quantify, and no known CVE is mitigated by this randomization.

    Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform
    a Fisher-Yates shuffle of the page allocator 'free_area' lists when they
    are initially populated with free memory at boot and at hotplug time. Do
    this based on either the presence of a page_alloc.shuffle=Y command line
    parameter, or autodetection of a memory-side-cache (to be added in a
    follow-on patch).

    The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
    pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e. 10,
    4MB this trades off randomization granularity for time spent shuffling.
    MAX_ORDER-1 was chosen to be minimally invasive to the page allocator
    while still showing memory-side cache behavior improvements, and the
    expectation that the security implications of finer granularity
    randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM. The
    performance impact of the shuffling appears to be in the noise compared to
    other memory initialization work.

    This initial randomization can be undone over time so a follow-on patch is
    introduced to inject entropy on page free decisions. It is reasonable to
    ask if the page free entropy is sufficient, but it is not enough due to
    the in-order initial freeing of pages. At the start of that process
    putting page1 in front or behind page0 still keeps them close together,
    page2 is still near page1 and has a high chance of being adjacent. As
    more pages are added ordering diversity improves, but there is still high
    page locality for the low address pages and this leads to no significant
    impact to the cache conflict rate.

    [1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
    [2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM
    [3]: https://lkml.org/lkml/2018/10/12/309

    [dan.j.williams@intel.com: fix shuffle enable]
    Link: http://lkml.kernel.org/r/154943713038.3858443.4125180191382062871.stgit@dwillia2-desk3.amr.corp.intel.com
    [cai@lca.pw: fix SHUFFLE_PAGE_ALLOCATOR help texts]
    Link: http://lkml.kernel.org/r/20190425201300.75650-1-cai@lca.pw
    Link: http://lkml.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Signed-off-by: Qian Cai
    Reviewed-by: Kees Cook
    Acked-by: Michal Hocko
    Cc: Dave Hansen
    Cc: Keith Busch
    Cc: Robert Elliott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • vmap_lazy_nr variable has atomic_t type that is 4 bytes integer value on
    both 32 and 64 bit systems. lazy_max_pages() deals with "unsigned long"
    that is 8 bytes on 64 bit system, thus vmap_lazy_nr should be 8 bytes on
    64 bit as well.

    Link: http://lkml.kernel.org/r/20190131162452.25879-1-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Reviewed-by: Andrew Morton
    Reviewed-by: William Kucharski
    Cc: Michal Hocko
    Cc: Matthew Wilcox
    Cc: Thomas Garnier
    Cc: Oleksiy Avramchenko
    Cc: Joel Fernandes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uladzislau Rezki (Sony)
     
  • Commit 763b218ddfaf ("mm: add preempt points into __purge_vmap_area_lazy()")
    introduced some preempt points, one of those is making an allocation
    more prioritized over lazy free of vmap areas.

    Prioritizing an allocation over freeing does not work well all the time,
    i.e. it should be rather a compromise.

    1) Number of lazy pages directly influences the busy list length thus
    on operations like: allocation, lookup, unmap, remove, etc.

    2) Under heavy stress of vmalloc subsystem I run into a situation when
    memory usage gets increased hitting out_of_memory -> panic state due to
    completely blocking of logic that frees vmap areas in the
    __purge_vmap_area_lazy() function.

    Establish a threshold passing which the freeing is prioritized back over
    allocation creating a balance between each other.

    Using vmalloc test driver in "stress mode", i.e. When all available
    test cases are run simultaneously on all online CPUs applying a
    pressure on the vmalloc subsystem, my HiKey 960 board runs out of
    memory due to the fact that __purge_vmap_area_lazy() logic simply is
    not able to free pages in time.

    How I run it:

    1) You should build your kernel with CONFIG_TEST_VMALLOC=m
    2) ./tools/testing/selftests/vm/test_vmalloc.sh stress

    During this test "vmap_lazy_nr" pages will go far beyond acceptable
    lazy_max_pages() threshold, that will lead to enormous busy list size
    and other problems including allocation time and so on.

    Link: http://lkml.kernel.org/r/20190124115648.9433-3-urezki@gmail.com
    Signed-off-by: Uladzislau Rezki (Sony)
    Reviewed-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Matthew Wilcox
    Cc: Thomas Garnier
    Cc: Oleksiy Avramchenko
    Cc: Steven Rostedt
    Cc: Joel Fernandes
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Tejun Heo
    Cc: Joel Fernandes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Uladzislau Rezki (Sony)
     
  • Commit 0139aa7b7fa ("mm: rename _count, field of the struct page, to
    _refcount") left out a couple of references to the old field name. Fix
    that.

    Link: http://lkml.kernel.org/r/cedf87b02eb8a6b3eac57e8e91da53fb15c3c44c.1556537475.git.baruch@tkos.co.il
    Fixes: 0139aa7b7fa ("mm: rename _count, field of the struct page, to _refcount")
    Signed-off-by: Baruch Siach
    Reviewed-by: Andrew Morton
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Baruch Siach