23 Mar, 2016

2 commits

  • kcov provides code coverage collection for coverage-guided fuzzing
    (randomized testing). Coverage-guided fuzzing is a testing technique
    that uses coverage feedback to determine new interesting inputs to a
    system. A notable user-space example is AFL
    (http://lcamtuf.coredump.cx/afl/). However, this technique is not
    widely used for kernel testing due to missing compiler and kernel
    support.

    kcov does not aim to collect as much coverage as possible. It aims to
    collect more or less stable coverage that is function of syscall inputs.
    To achieve this goal it does not collect coverage in soft/hard
    interrupts and instrumentation of some inherently non-deterministic or
    non-interesting parts of kernel is disbled (e.g. scheduler, locking).

    Currently there is a single coverage collection mode (tracing), but the
    API anticipates additional collection modes. Initially I also
    implemented a second mode which exposes coverage in a fixed-size hash
    table of counters (what Quentin used in his original patch). I've
    dropped the second mode for simplicity.

    This patch adds the necessary support on kernel side. The complimentary
    compiler support was added in gcc revision 231296.

    We've used this support to build syzkaller system call fuzzer, which has
    found 90 kernel bugs in just 2 months:

    https://github.com/google/syzkaller/wiki/Found-Bugs

    We've also found 30+ bugs in our internal systems with syzkaller.
    Another (yet unexplored) direction where kcov coverage would greatly
    help is more traditional "blob mutation". For example, mounting a
    random blob as a filesystem, or receiving a random blob over wire.

    Why not gcov. Typical fuzzing loop looks as follows: (1) reset
    coverage, (2) execute a bit of code, (3) collect coverage, repeat. A
    typical coverage can be just a dozen of basic blocks (e.g. an invalid
    input). In such context gcov becomes prohibitively expensive as
    reset/collect coverage steps depend on total number of basic
    blocks/edges in program (in case of kernel it is about 2M). Cost of
    kcov depends only on number of executed basic blocks/edges. On top of
    that, kernel requires per-thread coverage because there are always
    background threads and unrelated processes that also produce coverage.
    With inlined gcov instrumentation per-thread coverage is not possible.

    kcov exposes kernel PCs and control flow to user-space which is
    insecure. But debugfs should not be mapped as user accessible.

    Based on a patch by Quentin Casasnovas.

    [akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
    [akpm@linux-foundation.org: unbreak allmodconfig]
    [akpm@linux-foundation.org: follow x86 Makefile layout standards]
    Signed-off-by: Dmitry Vyukov
    Reviewed-by: Kees Cook
    Cc: syzkaller
    Cc: Vegard Nossum
    Cc: Catalin Marinas
    Cc: Tavis Ormandy
    Cc: Will Deacon
    Cc: Quentin Casasnovas
    Cc: Kostya Serebryany
    Cc: Eric Dumazet
    Cc: Alexander Potapenko
    Cc: Kees Cook
    Cc: Bjorn Helgaas
    Cc: Sasha Levin
    Cc: David Drysdale
    Cc: Ard Biesheuvel
    Cc: Andrey Ryabinin
    Cc: Kirill A. Shutemov
    Cc: Jiri Slaby
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     
  • Commit b430e9d1c6d4 ("remove compressed copy from zram in-memory")
    applied swap_slot_free_notify call in *end_swap_bio_read* to remove
    duplicated memory between zram and memory.

    However, with the introduction of rw_page in zram: 8c7f01025f7b ("zram:
    implement rw_page operation of zram"), it became void because rw_page
    doesn't need bio.

    Memory footprint is really important in embedded platforms which have
    small memory, for example, 512M) recently because it could start to kill
    processes if memory footprint exceeds some threshold by LMK or some
    similar memory management modules.

    This patch restores the function for rw_page, thereby eliminating this
    duplication.

    Signed-off-by: Minchan Kim
    Cc: Sergey Senozhatsky
    Cc: karam.lee
    Cc:
    Cc: Chan Jeong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

22 Mar, 2016

1 commit

  • Pull drm updates from Dave Airlie:
    "This is the main drm pull request for 4.6 kernel.

    Overall the coolest thing here for me is the nouveau maxwell signed
    firmware support from NVidia, it's taken a long while to extract this
    from them.

    I also wish the ARM vendors just designed one set of display IP, ARM
    display block proliferation is definitely increasing.

    Core:
    - drm_event cleanups
    - Internal API cleanup making mode_fixup optional.
    - Apple GMUX vga switcheroo support.
    - DP AUX testing interface

    Panel:
    - Refactoring of DSI core for use over more transports.

    New driver:
    - ARM hdlcd driver

    i915:
    - FBC/PSR (framebuffer compression, panel self refresh) enabled by default.
    - Ongoing atomic display support work
    - Ongoing runtime PM work
    - Pixel clock limit checks
    - VBT DSI description support
    - GEM fixes
    - GuC firmware scheduler enhancements

    amdkfd:
    - Deferred probing fixes to avoid make file or link ordering.

    amdgpu/radeon:
    - ACP support for i2s audio support.
    - Command Submission/GPU scheduler/GPUVM optimisations
    - Initial GPU reset support for amdgpu

    vmwgfx:
    - Support for DX10 gen mipmaps
    - Pageflipping and other fixes.

    exynos:
    - Exynos5420 SoC support for FIMD
    - Exynos5422 SoC support for MIPI-DSI

    nouveau:
    - GM20x secure boot support - adds acceleration for Maxwell GPUs.
    - GM200 support
    - GM20B clock driver support
    - Power sensors work

    etnaviv:
    - Correctness fixes for GPU cache flushing
    - Better support for i.MX6 systems.

    imx-drm:
    - VBlank IRQ support
    - Fence support
    - OF endpoint support

    msm:
    - HDMI support for 8996 (snapdragon 820)
    - Adreno 430 support
    - Timestamp queries support

    virtio-gpu:
    - Fixes for Android support.

    rockchip:
    - Add support for Innosilicion HDMI

    rcar-du:
    - Support for 4 crtcs
    - R8A7795 support
    - RCar Gen 3 support

    omapdrm:
    - HDMI interlace output support
    - dma-buf import support
    - Refactoring to remove a lot of legacy code.

    tilcdc:
    - Rewrite of pageflipping code
    - dma-buf support
    - pinctrl support

    vc4:
    - HDMI modesetting bug fixes
    - Significant 3D performance improvement.

    fsl-dcu (FreeScale):
    - Lots of fixes

    tegra:
    - Two small fixes

    sti:
    - Atomic support for planes
    - Improved HDMI support"

    * 'drm-next' of git://people.freedesktop.org/~airlied/linux: (1063 commits)
    drm/amdgpu: release_pages requires linux/pagemap.h
    drm/sti: restore mode_fixup callback
    drm/amdgpu/gfx7: add MTYPE definition
    drm/amdgpu: removing BO_VAs shouldn't be interruptible
    drm/amd/powerplay: show uvd/vce power gate enablement for tonga.
    drm/amd/powerplay: show uvd/vce power gate info for fiji
    drm/amdgpu: use sched fence if possible
    drm/amdgpu: move ib.fence to job.fence
    drm/amdgpu: give a fence param to ib_free
    drm/amdgpu: include the right version of gmc header files for iceland
    drm/radeon: fix indentation.
    drm/amd/powerplay: add uvd/vce dpm enabling flag to fix the performance issue for CZ
    drm/amdgpu: switch back to 32bit hw fences v2
    drm/amdgpu: remove amdgpu_fence_is_signaled
    drm/amdgpu: drop the extra fence range check v2
    drm/amdgpu: signal fences directly in amdgpu_fence_process
    drm/amdgpu: cleanup amdgpu_fence_wait_empty v2
    drm/amdgpu: keep all fences in an RCU protected array v2
    drm/amdgpu: add number of hardware submissions to amdgpu_fence_driver_init_ring
    drm/amdgpu: RCU protected amd_sched_fence_release
    ...

    Linus Torvalds
     

21 Mar, 2016

1 commit

  • Pull x86 protection key support from Ingo Molnar:
    "This tree adds support for a new memory protection hardware feature
    that is available in upcoming Intel CPUs: 'protection keys' (pkeys).

    There's a background article at LWN.net:

    https://lwn.net/Articles/643797/

    The gist is that protection keys allow the encoding of
    user-controllable permission masks in the pte. So instead of having a
    fixed protection mask in the pte (which needs a system call to change
    and works on a per page basis), the user can map a (handful of)
    protection mask variants and can change the masks runtime relatively
    cheaply, without having to change every single page in the affected
    virtual memory range.

    This allows the dynamic switching of the protection bits of large
    amounts of virtual memory, via user-space instructions. It also
    allows more precise control of MMU permission bits: for example the
    executable bit is separate from the read bit (see more about that
    below).

    This tree adds the MM infrastructure and low level x86 glue needed for
    that, plus it adds a high level API to make use of protection keys -
    if a user-space application calls:

    mmap(..., PROT_EXEC);

    or

    mprotect(ptr, sz, PROT_EXEC);

    (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
    this special case, and will set a special protection key on this
    memory range. It also sets the appropriate bits in the Protection
    Keys User Rights (PKRU) register so that the memory becomes unreadable
    and unwritable.

    So using protection keys the kernel is able to implement 'true'
    PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
    PROT_READ as well. Unreadable executable mappings have security
    advantages: they cannot be read via information leaks to figure out
    ASLR details, nor can they be scanned for ROP gadgets - and they
    cannot be used by exploits for data purposes either.

    We know about no user-space code that relies on pure PROT_EXEC
    mappings today, but binary loaders could start making use of this new
    feature to map binaries and libraries in a more secure fashion.

    There is other pending pkeys work that offers more high level system
    call APIs to manage protection keys - but those are not part of this
    pull request.

    Right now there's a Kconfig that controls this feature
    (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
    (like most x86 CPU feature enablement code that has no runtime
    overhead), but it's not user-configurable at the moment. If there's
    any serious problem with this then we can make it configurable and/or
    flip the default"

    * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
    x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
    mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
    x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
    mm/core, x86/mm/pkeys: Add execute-only protection keys support
    x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
    x86/mm/pkeys: Allow kernel to modify user pkey rights register
    x86/fpu: Allow setting of XSAVE state
    x86/mm: Factor out LDT init from context init
    mm/core, x86/mm/pkeys: Add arch_validate_pkey()
    mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
    x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
    x86/mm/pkeys: Add Kconfig prompt to existing config option
    x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
    x86/mm/pkeys: Dump PKRU with other kernel registers
    mm/core, x86/mm/pkeys: Differentiate instruction fetches
    x86/mm/pkeys: Optimize fault handling in access_error()
    mm/core: Do not enforce PKEY permissions on remote mm access
    um, pkeys: Add UML arch_*_access_permitted() methods
    mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
    x86/mm/gup: Simplify get_user_pages() PTE bit handling
    ...

    Linus Torvalds
     

20 Mar, 2016

1 commit

  • Pull powerpc updates from Michael Ellerman:
    "This was delayed a day or two by some build-breakage on old toolchains
    which we've now fixed.

    There's two PCI commits both acked by Bjorn.

    There's one commit to mm/hugepage.c which is (co)authored by Kirill.

    Highlights:
    - Restructure Linux PTE on Book3S/64 to Radix format from Paul
    Mackerras
    - Book3s 64 MMU cleanup in preparation for Radix MMU from Aneesh
    Kumar K.V
    - Add POWER9 cputable entry from Michael Neuling
    - FPU/Altivec/VSX save/restore optimisations from Cyril Bur
    - Add support for new ftrace ABI on ppc64le from Torsten Duwe

    Various cleanups & minor fixes from:
    - Adam Buchbinder, Andrew Donnellan, Balbir Singh, Christophe Leroy,
    Cyril Bur, Luis Henriques, Madhavan Srinivasan, Pan Xinhui, Russell
    Currey, Sukadev Bhattiprolu, Suraj Jitindar Singh.

    General:
    - atomics: Allow architectures to define their own __atomic_op_*
    helpers from Boqun Feng
    - Implement atomic{, 64}_*_return_* variants and acquire/release/
    relaxed variants for (cmp)xchg from Boqun Feng
    - Add powernv_defconfig from Jeremy Kerr
    - Fix BUG_ON() reporting in real mode from Balbir Singh
    - Add xmon command to dump OPAL msglog from Andrew Donnellan
    - Add xmon command to dump process/task similar to ps(1) from Douglas
    Miller
    - Clean up memory hotplug failure paths from David Gibson

    pci/eeh:
    - Redesign SR-IOV on PowerNV to give absolute isolation between VFs
    from Wei Yang.
    - EEH Support for SRIOV VFs from Wei Yang and Gavin Shan.
    - PCI/IOV: Rename and export virtfn_{add, remove} from Wei Yang
    - PCI: Add pcibios_bus_add_device() weak function from Wei Yang
    - MAINTAINERS: Update EEH details and maintainership from Russell
    Currey

    cxl:
    - Support added to the CXL driver for running on both bare-metal and
    hypervisor systems, from Christophe Lombard and Frederic Barrat.
    - Ignore probes for virtual afu pci devices from Vaibhav Jain

    perf:
    - Export Power8 generic and cache events to sysfs from Sukadev
    Bhattiprolu
    - hv-24x7: Fix usage with chip events, display change in counter
    values, display domain indices in sysfs, eliminate domain suffix in
    event names, from Sukadev Bhattiprolu

    Freescale:
    - Updates from Scott: "Highlights include 8xx optimizations, 32-bit
    checksum optimizations, 86xx consolidation, e5500/e6500 cpu
    hotplug, more fman and other dt bits, and minor fixes/cleanup"

    * tag 'powerpc-4.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (179 commits)
    powerpc: Fix unrecoverable SLB miss during restore_math()
    powerpc/8xx: Fix do_mtspr_cpu6() build on older compilers
    powerpc/rcpm: Fix build break when SMP=n
    powerpc/book3e-64: Use hardcoded mttmr opcode
    powerpc/fsl/dts: Add "jedec,spi-nor" flash compatible
    powerpc/T104xRDB: add tdm riser card node to device tree
    powerpc32: PAGE_EXEC required for inittext
    powerpc/mpc85xx: Add pcsphy nodes to FManV3 device tree
    powerpc/mpc85xx: Add MDIO bus muxing support to the board device tree(s)
    powerpc/86xx: Introduce and use common dtsi
    powerpc/86xx: Update device tree
    powerpc/86xx: Move dts files to fsl directory
    powerpc/86xx: Switch to kconfig fragments approach
    powerpc/86xx: Update defconfigs
    powerpc/86xx: Consolidate common platform code
    powerpc32: Remove one insn in mulhdu
    powerpc32: small optimisation in flush_icache_range()
    powerpc: Simplify test in __dma_sync()
    powerpc32: move xxxxx_dcache_range() functions inline
    powerpc32: Remove clear_pages() and define clear_page() inline
    ...

    Linus Torvalds
     

19 Mar, 2016

1 commit

  • Merge second patch-bomb from Andrew Morton:

    - a couple of hotfixes

    - the rest of MM

    - a new timer slack control in procfs

    - a couple of procfs fixes

    - a few misc things

    - some printk tweaks

    - lib/ updates, notably to radix-tree.

    - add my and Nick Piggin's old userspace radix-tree test harness to
    tools/testing/radix-tree/. Matthew said it was a godsend during the
    radix-tree work he did.

    - a few code-size improvements, switching to __always_inline where gcc
    screwed up.

    - partially implement character sets in sscanf

    * emailed patches from Andrew Morton : (118 commits)
    sscanf: implement basic character sets
    lib/bug.c: use common WARN helper
    param: convert some "on"/"off" users to strtobool
    lib: add "on"/"off" support to kstrtobool
    lib: update single-char callers of strtobool()
    lib: move strtobool() to kstrtobool()
    include/linux/unaligned: force inlining of byteswap operations
    include/uapi/linux/byteorder, swab: force inlining of some byteswap operations
    include/asm-generic/atomic-long.h: force inlining of some atomic_long operations
    usb: common: convert to use match_string() helper
    ide: hpt366: convert to use match_string() helper
    ata: hpt366: convert to use match_string() helper
    power: ab8500: convert to use match_string() helper
    power: charger_manager: convert to use match_string() helper
    drm/edid: convert to use match_string() helper
    pinctrl: convert to use match_string() helper
    device property: convert to use match_string() helper
    lib/string: introduce match_string() helper
    radix-tree tests: add test for radix_tree_iter_next
    radix-tree tests: add regression3 test
    ...

    Linus Torvalds
     

18 Mar, 2016

34 commits

  • Pull trivial tree updates from Jiri Kosina.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
    drivers/rtc: broken link fix
    drm/i915 Fix typos in i915_gem_fence.c
    Docs: fix missing word in REPORTING-BUGS
    lib+mm: fix few spelling mistakes
    MAINTAINERS: add git URL for APM driver
    treewide: Fix typo in printk

    Linus Torvalds
     
  • shmem likes to occasionally drop the lock, schedule, then reacqire the
    lock and continue with the iteration from the last place it left off.
    This is currently done with a pretty ugly goto. Introduce
    radix_tree_iter_next() and use it throughout shmem.c.

    [koct9i@gmail.com: fix bug in radix_tree_iter_next() for tagged iteration]
    Signed-off-by: Matthew Wilcox
    Cc: Hugh Dickins
    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Instead of a 'goto restart', we can now use radix_tree_iter_retry() to
    restart from our current position. This will make a difference when
    there are more ways to happen across an indirect pointer. And it
    eliminates some confusing gotos.

    [vbabka@suse.cz: remove now-obsolete-and-misleading comment]
    Signed-off-by: Matthew Wilcox
    Cc: Hugh Dickins
    Cc: Konstantin Khlebnikov
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • With huge pages, it is convenient to have the radix tree be able to
    return an entry that covers multiple indices. Previous attempts to deal
    with the problem have involved inserting N duplicate entries, which is a
    waste of memory and leads to problems trying to handle aliased tags, or
    probing the tree multiple times to find alternative entries which might
    cover the requested index.

    This approach inserts one canonical entry into the tree for a given
    range of indices, and may also insert other entries in order to ensure
    that lookups find the canonical entry.

    This solution only tolerates inserting powers of two that are greater
    than the fanout of the tree. If we wish to expand the radix tree's
    abilities to support large-ish pages that is less than the fanout at the
    penultimate level of the tree, then we would need to add one more step
    in lookup to ensure that any sibling nodes in the final level of the
    tree are dereferenced and we return the canonical entry that they
    reference.

    Signed-off-by: Matthew Wilcox
    Cc: Johannes Weiner
    Cc: Matthew Wilcox
    Cc: "Kirill A. Shutemov"
    Cc: Ross Zwisler
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • There are various email addresses for me throughout the kernel. Use the
    one that will always be valid.

    Signed-off-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • After the OOM killer is disabled during suspend operation, any
    !__GFP_NOFAIL && __GFP_FS allocations are forced to fail. Thus, any
    !__GFP_NOFAIL && !__GFP_FS allocations should be forced to fail as well.

    Signed-off-by: Tetsuo Handa
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • While oom_killer_disable() is called by freeze_processes() after all
    user threads except the current thread are frozen, it is possible that
    kernel threads invoke the OOM killer and sends SIGKILL to the current
    thread due to sharing the thawed victim's memory. Therefore, checking
    for SIGKILL is preferable than TIF_MEMDIE.

    Signed-off-by: Tetsuo Handa
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Add a new column to pool stats, which will tell how many pages ideally
    can be freed by class compaction, so it will be easier to analyze
    zsmalloc fragmentation.

    At the moment, we have only numbers of FULL and ALMOST_EMPTY classes,
    but they don't tell us how badly the class is fragmented internally.

    The new /sys/kernel/debug/zsmalloc/zramX/classes output look as follows:

    class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable
    [..]
    12 224 0 2 146 5 8 4 4
    13 240 0 0 0 0 0 1 0
    14 256 1 13 1840 1672 115 1 10
    15 272 0 0 0 0 0 1 0
    [..]
    49 816 0 3 745 735 149 1 2
    51 848 3 4 361 306 76 4 8
    52 864 12 14 378 268 81 3 21
    54 896 1 12 117 57 26 2 12
    57 944 0 0 0 0 0 3 0
    [..]
    Total 26 131 12709 10994 1071 134

    For example, from this particular output we can easily conclude that
    class-896 is heavily fragmented -- it occupies 26 pages, 12 can be freed
    by compaction.

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • When unmapping a huge class page in zs_unmap_object, the page will be
    unmapped by kmap_atomic. the "!area->huge" branch in __zs_unmap_object
    is alway true, and no code set "area->huge" now, so we can drop it.

    Signed-off-by: YiPing Xu
    Reviewed-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    YiPing Xu
     
  • We have PAGE_ALIGNED() in mm.h, so let's use it instead of IS_ALIGNED()
    for checking PAGE_SIZE aligned case.

    Signed-off-by: Shawn Lin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shawn Lin
     
  • mem_cgroup_print_oom_info is always called under oom_lock, so
    oom_info_lock is redundant.

    Signed-off-by: Vladimir Davydov
    Cc: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • uncharge_list() does an unusual list walk because the function can take
    regular lists with dedicated list_heads as well as singleton lists where
    a single page is passed via the page->lru list node.

    This can sometimes lead to confusion as well as suggestions to replace
    the loop with a list_for_each_entry(), which wouldn't work.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Setting the original memory.limit_in_bytes hardlimit is subject to a
    race condition when the desired value is below the current usage. The
    code tries a few times to first reclaim and then see if the usage has
    dropped to where we would like it to be, but there is no locking, and
    the workload is free to continue making new charges up to the old limit.
    Thus, attempting to shrink a workload relies on pure luck and hope that
    the workload happens to cooperate.

    To fix this in the cgroup2 memory.max knob, do it the other way round:
    set the limit first, then try enforcement. And if reclaim is not able
    to succeed, trigger OOM kills in the group. Keep going until the new
    limit is met, we run out of OOM victims and there's only unreclaimable
    memory left, or the task writing to memory.max is killed. This allows
    users to shrink groups reliably, and the behavior is consistent with
    what happens when new charges are attempted in excess of memory.max.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When setting memory.high below usage, nothing happens until the next
    charge comes along, and then it will only reclaim its own charge and not
    the now potentially huge excess of the new memory.high. This can cause
    groups to stay in excess of their memory.high indefinitely.

    To fix that, when shrinking memory.high, kick off a reclaim cycle that
    goes after the delta.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Upstream has supported page parallel initialisation for X86 and the boot
    time is improved greately. Some tests have been done for Power.

    Here is the result I have done with different memory size.

    * 4GB memory:
    boot time is as the following:
    with patch vs without patch: 10.4s vs 24.5s
    boot time is improved 57%
    * 200GB memory:
    boot time looks the same with and without patches.
    boot time is about 38s
    * 32TB memory:
    boot time looks the same with and without patches
    boot time is about 160s.
    The boot time is much shorter than X86 with 24TB memory.
    From community discussion, it costs about 694s for X86 24T system.

    Parallel initialisation improves the performance by deferring memory
    initilisation to kswap with N kthreads, it should improve the performance
    therotically.

    In testing on X86, performance is improved greatly with huge memory. But
    on Power platform, it is improved greatly with less than 100GB memory.
    For huge memory, it is not improved greatly. But it saves the time with
    several threads at least, as the following information shows(32TB system
    log):

    [ 22.648169] node 9 initialised, 16607461 pages in 280ms
    [ 22.783772] node 3 initialised, 23937243 pages in 410ms
    [ 22.858877] node 6 initialised, 29179347 pages in 490ms
    [ 22.863252] node 2 initialised, 29179347 pages in 490ms
    [ 22.907545] node 0 initialised, 32049614 pages in 540ms
    [ 22.920891] node 15 initialised, 32212280 pages in 550ms
    [ 22.923236] node 4 initialised, 32306127 pages in 550ms
    [ 22.923384] node 12 initialised, 32314319 pages in 550ms
    [ 22.924754] node 8 initialised, 32314319 pages in 550ms
    [ 22.940780] node 13 initialised, 33353677 pages in 570ms
    [ 22.940796] node 11 initialised, 33353677 pages in 570ms
    [ 22.941700] node 5 initialised, 33353677 pages in 570ms
    [ 22.941721] node 10 initialised, 33353677 pages in 570ms
    [ 22.941876] node 7 initialised, 33353677 pages in 570ms
    [ 22.944946] node 14 initialised, 33353677 pages in 570ms
    [ 22.946063] node 1 initialised, 33345485 pages in 580ms

    It saves the time about 550*16 ms at least, although it can be ignore to
    compare the boot time about 160 seconds. What's more, the boot time is
    much shorter on Power even without patches than x86 for huge memory
    machine.

    So this patchset is still necessary to be enabled for Power.

    This patch (of 2):

    This patch is based on Mel Gorman's old patch in the mailing list,
    https://lkml.org/lkml/2015/5/5/280 which is discussed but it is fixed with
    a completion to wait for all memory initialised in page_alloc_init_late().
    It is to fix the OOM problem on X86 with 24TB memory which allocates
    memory in late initialisation. But for Power platform with 32TB memory,
    it causes a call trace in vfs_caches_init->inode_init() and inode hash
    table needs more memory. So this patch allocates 1GB for 0.25TB/node for
    large system as it is mentioned in https://lkml.org/lkml/2015/5/1/627

    This call trace is found on Power with 32TB memory, 1024CPUs, 16nodes.
    Currently, it only allocates 2GB*16=32GB for early initialisation. But
    Dentry cache hash table needes 16GB and Inode cache hash table needs 16GB.
    So the system have no enough memory for it. The log from dmesg as the
    following:

    Dentry cache hash table entries: 2147483648 (order: 18,17179869184 bytes)
    vmalloc: allocation failure, allocated 16021913600 of 17179934720 bytes
    swapper/0: page allocation failure: order:0,mode:0x2080020
    CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.0-0-ppc64
    Call Trace:
    .dump_stack+0xb4/0xb664 (unreliable)
    .warn_alloc_failed+0x114/0x160
    .__vmalloc_area_node+0x1a4/0x2b0
    .__vmalloc_node_range+0xe4/0x110
    .__vmalloc_node+0x40/0x50
    .alloc_large_system_hash+0x134/0x2a4
    .inode_init+0xa4/0xf0
    .vfs_caches_init+0x80/0x144
    .start_kernel+0x40c/0x4e0
    start_here_common+0x20/0x4a4

    Signed-off-by: Li Zhang
    Acked-by: Mel Gorman
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zhang
     
  • split_huge_pmd() tries to munlock page with munlock_vma_page(). That
    requires the page to locked.

    If the is locked by caller, we would get a deadlock:

    Unable to find swap-space signature
    INFO: task trinity-c85:1907 blocked for more than 120 seconds.
    Not tainted 4.4.0-00032-gf19d0bdced41-dirty #1606
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    trinity-c85 D ffff88084d997608 0 1907 309 0x00000000
    Call Trace:
    schedule+0x9f/0x1c0
    schedule_timeout+0x48e/0x600
    io_schedule_timeout+0x1c3/0x390
    bit_wait_io+0x29/0xd0
    __wait_on_bit_lock+0x94/0x140
    __lock_page+0x1d4/0x280
    __split_huge_pmd+0x5a8/0x10f0
    split_huge_pmd_address+0x1d9/0x230
    try_to_unmap_one+0x540/0xc70
    rmap_walk_anon+0x284/0x810
    rmap_walk_locked+0x11e/0x190
    try_to_unmap+0x1b1/0x4b0
    split_huge_page_to_list+0x49d/0x18a0
    follow_page_mask+0xa36/0xea0
    SyS_move_pages+0xaf3/0x1570
    entry_SYSCALL_64_fastpath+0x12/0x6b
    2 locks held by trinity-c85/1907:
    #0: (&mm->mmap_sem){++++++}, at: SyS_move_pages+0x933/0x1570
    #1: (&anon_vma->rwsem){++++..}, at: split_huge_page_to_list+0x402/0x18a0

    I don't think the deadlock is triggerable without split_huge_page()
    simplifilcation patchset.

    But munlock_vma_page() here is wrong: we want to munlock the page
    unconditionally, no need in rmap lookup, that munlock_vma_page() does.

    Let's use clear_page_mlock() instead. It can be called under ptl.

    Fixes: e90309c9f772 ("thp: allow mlocked THP again")
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • freeze_page() and unfreeze_page() helpers evolved in rather complex
    beasts. It would be nice to cut complexity of this code.

    This patch rewrites freeze_page() using standard try_to_unmap().
    unfreeze_page() is rewritten with remove_migration_ptes().

    The result is much simpler.

    But the new variant is somewhat slower for PTE-mapped THPs. Current
    helpers iterates over VMAs the compound page is mapped to, and then over
    ptes within this VMA. New helpers iterates over small page, then over
    VMA the small page mapped to, and only then find relevant pte.

    We have short cut for PMD-mapped THP: we directly install migration
    entries on PMD split.

    I don't think the slowdown is critical, considering how much simpler
    result is and that split_huge_page() is quite rare nowadays. It only
    happens due memory pressure or migration.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Make remove_migration_ptes() available to be used in split_huge_page().

    New parameter 'locked' added: as with try_to_umap() we need a way to
    indicate that caller holds rmap lock.

    We also shouldn't try to mlock() pte-mapped huge pages: pte-mapeed THP
    pages are never mlocked.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Add support for two ttu_flags:

    - TTU_SPLIT_HUGE_PMD would split PMD if it's there, before trying to
    unmap page;

    - TTU_RMAP_LOCKED indicates that caller holds relevant rmap lock;

    Also, change rwc->done to !page_mapcount() instead of !page_mapped().
    try_to_unmap() works on pte level, so we are really interested in the
    mappedness of this small page rather than of the compound page it's a
    part of.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • This patchset rewrites freeze_page() and unfreeze_page() using
    try_to_unmap() and remove_migration_ptes(). Result is much simpler, but
    somewhat slower.

    Migration 8GiB worth of PMD-mapped THP:

    Baseline 20.21 +/- 0.393
    Patched 20.73 +/- 0.082
    Slowdown 1.03x

    It's 3% slower, comparing to 14% in v1. I don't it should be a stopper.

    Splitting of PTE-mapped pages slowed more. But this is not a common
    case.

    Migration 8GiB worth of PMD-mapped THP:

    Baseline 20.39 +/- 0.225
    Patched 22.43 +/- 0.496
    Slowdown 1.10x

    rmap_walk_locked() is the same as rmap_walk(), but the caller takes care
    of the relevant rmap lock.

    This is preparation for switching THP splitting from custom rmap walk in
    freeze_page()/unfreeze_page() to the generic one.

    There is no support for KSM pages for now: not clear which lock is
    implied.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The primary use case for devm_memremap_pages() is to allocate an memmap
    array from persistent memory. That capabilty requires vmem_altmap which
    requires SPARSEMEM_VMEMMAP.

    Also, without SPARSEMEM_VMEMMAP the addition of ZONE_DEVICE expands
    ZONES_WIDTH and triggers the:

    "Unfortunate NUMA and NUMA Balancing config, growing page-frame for
    last_cpupid."

    ...warning in mm/memory.c. SPARSEMEM_VMEMMAP=n && ZONE_DEVICE=y is not
    a configuration we should worry about supporting.

    Signed-off-by: Dan Williams
    Reported-by: Vlastimil Babka
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Use the normal mechanism to make the logging output consistently
    "percpu:" instead of a mix of "PERCPU:" and "percpu:"

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Most of the mm subsystem uses pr_ so make it consistent.

    Miscellanea:

    - Realign arguments
    - Add missing newline to format
    - kmemleak-test.c has a "kmemleak: " prefix added to the
    "Kmemleak testing" logging message via pr_fmt

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Kernel style prefers a single string over split strings when the string is
    'user-visible'.

    Miscellanea:

    - Add a missing newline
    - Realign arguments

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • There are a mixture of pr_warning and pr_warn uses in mm. Use pr_warn
    consistently.

    Miscellanea:

    - Coalesce formats
    - Realign arguments

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • ZONE_DEVICE (merged in 4.3) and ZONE_CMA (proposed) are examples of new
    mm zones that are bumping up against the current maximum limit of 4
    zones, i.e. 2 bits in page->flags for the GFP_ZONE_TABLE.

    The GFP_ZONE_TABLE poses an interesting constraint since
    include/linux/gfp.h gets included by the 32-bit portion of a 64-bit
    build. We need to be careful to only build the table for zones that
    have a corresponding gfp_t flag. GFP_ZONES_SHIFT is introduced for this
    purpose. This patch does not attempt to solve the problem of adding a
    new zone that also has a corresponding GFP_ flag.

    Vlastimil points out that ZONE_DEVICE, by depending on x86_64 and
    SPARSEMEM_VMEMMAP implies that SECTIONS_WIDTH is zero. In other words
    even though ZONE_DEVICE does not fit in GFP_ZONE_TABLE it is free to
    consume another bit in page->flags (expand ZONES_WIDTH) with room to
    spare.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=110931
    Fixes: 033fbae988fc ("mm: ZONE_DEVICE for "device memory"")
    Signed-off-by: Dan Williams
    Reported-by: Mark
    Reported-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Sudip Mukherjee
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • - Do not take memcg_limit_mutex for resetting limits - the cgroup cannot
    be altered from userspace anymore, so no need to protect them.

    - Use plain page_counter_limit() for resetting ->memory and ->memsw
    limits instead of mem_cgrouop_resize_* helpers - we enlarge the limits,
    so no need in special handling.

    - Reset ->swap and ->tcpmem limits as well.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • online_pages() simply returns an error value if
    memory_notify(MEM_GOING_ONLINE, &arg) return a value that is not what we
    want for successfully onlining target pages. This patch arms to print
    more failure information like offline_pages() in online_pages.

    This patch also converts printk(KERN_) to pr_(), and moves
    __offline_pages() to not print failure information with KERN_INFO
    according to David Rientjes's suggestion[1].

    [1] https://lkml.org/lkml/2016/2/24/1094

    Signed-off-by: Chen Yucong
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Yucong
     
  • Commit 647757197cd3 ("mm: clarify __GFP_NOFAIL deprecation status") was
    incomplete and didn't remove the comment about __GFP_NOFAIL being
    deprecated in buffered_rmqueue.

    Let's get rid of this leftover but keep the WARN_ON_ONCE for order > 1
    because we should really discourage from using __GFP_NOFAIL with higher
    order allocations because those are just too subtle.

    Signed-off-by: Michal Hocko
    Reviewed-by: Nikolay Borisov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • CMA allocation should be guaranteed to succeed by definition, but,
    unfortunately, it would be failed sometimes. It is hard to track down
    the problem, because it is related to page reference manipulation and we
    don't have any facility to analyze it.

    This patch adds tracepoints to track down page reference manipulation.
    With it, we can find exact reason of failure and can fix the problem.
    Following is an example of tracepoint output. (note: this example is
    stale version that printing flags as the number. Recent version will
    print it as human readable string.)

    -9018 [004] 92.678375: page_ref_set: pfn=0x17ac9 flags=0x0 count=1 mapcount=0 mapping=(nil) mt=4 val=1
    -9018 [004] 92.678378: kernel_stack:
    => get_page_from_freelist (ffffffff81176659)
    => __alloc_pages_nodemask (ffffffff81176d22)
    => alloc_pages_vma (ffffffff811bf675)
    => handle_mm_fault (ffffffff8119e693)
    => __do_page_fault (ffffffff810631ea)
    => trace_do_page_fault (ffffffff81063543)
    => do_async_page_fault (ffffffff8105c40a)
    => async_page_fault (ffffffff817581d8)
    [snip]
    -9018 [004] 92.678379: page_ref_mod: pfn=0x17ac9 flags=0x40048 count=2 mapcount=1 mapping=0xffff880015a78dc1 mt=4 val=1
    [snip]
    ...
    ...
    -9131 [001] 93.174468: test_pages_isolated: start_pfn=0x17800 end_pfn=0x17c00 fin_pfn=0x17ac9 ret=fail
    [snip]
    -9018 [004] 93.174843: page_ref_mod_and_test: pfn=0x17ac9 flags=0x40068 count=0 mapcount=0 mapping=0xffff880015a78dc1 mt=4 val=-1 ret=1
    => release_pages (ffffffff8117c9e4)
    => free_pages_and_swap_cache (ffffffff811b0697)
    => tlb_flush_mmu_free (ffffffff81199616)
    => tlb_finish_mmu (ffffffff8119a62c)
    => exit_mmap (ffffffff811a53f7)
    => mmput (ffffffff81073f47)
    => do_exit (ffffffff810794e9)
    => do_group_exit (ffffffff81079def)
    => SyS_exit_group (ffffffff81079e74)
    => entry_SYSCALL_64_fastpath (ffffffff817560b6)

    This output shows that problem comes from exit path. In exit path, to
    improve performance, pages are not freed immediately. They are gathered
    and processed by batch. During this process, migration cannot be
    possible and CMA allocation is failed. This problem is hard to find
    without this page reference tracepoint facility.

    Enabling this feature bloat kernel text 30 KB in my configuration.

    text data bss dec hex filename
    12127327 2243616 1507328 15878271 f2487f vmlinux_disabled
    12157208 2258880 1507328 15923416 f2f8d8 vmlinux_enabled

    Note that, due to header file dependency problem between mm.h and
    tracepoint.h, this feature has to open code the static key functions for
    tracepoints. Proposed by Steven Rostedt in following link.

    https://lkml.org/lkml/2015/12/9/699

    [arnd@arndb.de: crypto/async_pq: use __free_page() instead of put_page()]
    [iamjoonsoo.kim@lge.com: fix build failure for xtensa]
    [akpm@linux-foundation.org: tweak Kconfig text, per Vlastimil]
    Signed-off-by: Joonsoo Kim
    Acked-by: Michal Nazarewicz
    Acked-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Sergey Senozhatsky
    Acked-by: Steven Rostedt
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • The success of CMA allocation largely depends on the success of
    migration and key factor of it is page reference count. Until now, page
    reference is manipulated by direct calling atomic functions so we cannot
    follow up who and where manipulate it. Then, it is hard to find actual
    reason of CMA allocation failure. CMA allocation should be guaranteed
    to succeed so finding offending place is really important.

    In this patch, call sites where page reference is manipulated are
    converted to introduced wrapper function. This is preparation step to
    add tracepoint to each page reference manipulation function. With this
    facility, we can easily find reason of CMA allocation failure. There is
    no functional change in this patch.

    In addition, this patch also converts reference read sites. It will
    help a second step that renames page._count to something else and
    prevents later attempt to direct access to it (Suggested by Andrew).

    Signed-off-by: Joonsoo Kim
    Acked-by: Michal Nazarewicz
    Acked-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Sergey Senozhatsky
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • THP defrag is enabled by default to direct reclaim/compact but not wake
    kswapd in the event of a THP allocation failure. The problem is that
    THP allocation requests potentially enter reclaim/compaction. This
    potentially incurs a severe stall that is not guaranteed to be offset by
    reduced TLB misses. While there has been considerable effort to reduce
    the impact of reclaim/compaction, it is still a high cost and workloads
    that should fit in memory fail to do so. Specifically, a simple
    anon/file streaming workload will enter direct reclaim on NUMA at least
    even though the working set size is 80% of RAM. It's been years and
    it's time to throw in the towel.

    First, this patch defines THP defrag as follows;

    madvise: A failed allocation will direct reclaim/compact if the application requests it
    never: Neither reclaim/compact nor wake kswapd
    defer: A failed allocation will wake kswapd/kcompactd
    always: A failed allocation will direct reclaim/compact (historical behaviour)
    khugepaged defrag will enter direct/reclaim but not wake kswapd.

    Next it sets the default defrag option to be "madvise" to only enter
    direct reclaim/compaction for applications that specifically requested
    it.

    Lastly, it removes a check from the page allocator slowpath that is
    related to __GFP_THISNODE to allow "defer" to work. The callers that
    really cares are slub/slab and they are updated accordingly. The slab
    one may be surprising because it also corrects a comment as kswapd was
    never woken up by that path.

    This means that a THP fault will no longer stall for most applications
    by default and the ideal for most users that get THP if they are
    immediately available. There are still options for users that prefer a
    stall at startup of a new application by either restoring historical
    behaviour with "always" or pick a half-way point with "defer" where
    kswapd does some of the work in the background and wakes kcompactd if
    necessary. THP defrag for khugepaged remains enabled and will enter
    direct/reclaim but no wakeup kswapd or kcompactd.

    After this patch a THP allocation failure will quickly fallback and rely
    on khugepaged to recover the situation at some time in the future. In
    some cases, this will reduce THP usage but the benefit of THP is hard to
    measure and not a universal win where as a stall to reclaim/compaction
    is definitely measurable and can be painful.

    The first test for this is using "usemem" to read a large file and write
    a large anonymous mapping (to avoid the zero page) multiple times. The
    total size of the mappings is 80% of RAM and the benchmark simply
    measures how long it takes to complete. It uses multiple threads to see
    if that is a factor. On UMA, the performance is almost identical so is
    not reported but on NUMA, we see this

    usemem
    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    Amean System-1 102.86 ( 0.00%) 46.81 ( 54.50%)
    Amean System-4 37.85 ( 0.00%) 34.02 ( 10.12%)
    Amean System-7 48.12 ( 0.00%) 46.89 ( 2.56%)
    Amean System-12 51.98 ( 0.00%) 56.96 ( -9.57%)
    Amean System-21 80.16 ( 0.00%) 79.05 ( 1.39%)
    Amean System-30 110.71 ( 0.00%) 107.17 ( 3.20%)
    Amean System-48 127.98 ( 0.00%) 124.83 ( 2.46%)
    Amean Elapsd-1 185.84 ( 0.00%) 105.51 ( 43.23%)
    Amean Elapsd-4 26.19 ( 0.00%) 25.58 ( 2.33%)
    Amean Elapsd-7 21.65 ( 0.00%) 21.62 ( 0.16%)
    Amean Elapsd-12 18.58 ( 0.00%) 17.94 ( 3.43%)
    Amean Elapsd-21 17.53 ( 0.00%) 16.60 ( 5.33%)
    Amean Elapsd-30 17.45 ( 0.00%) 17.13 ( 1.84%)
    Amean Elapsd-48 15.40 ( 0.00%) 15.27 ( 0.82%)

    For a single thread, the benchmark completes 43.23% faster with this
    patch applied with smaller benefits as the thread increases. Similar,
    notice the large reduction in most cases in system CPU usage. The
    overall CPU time is

    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    User 10357.65 10438.33
    System 3988.88 3543.94
    Elapsed 2203.01 1634.41

    Which is substantial. Now, the reclaim figures

    4.4.0 4.4.0
    kcompactd-v1r1nodefrag-v1r3
    Minor Faults 128458477 278352931
    Major Faults 2174976 225
    Swap Ins 16904701 0
    Swap Outs 17359627 0
    Allocation stalls 43611 0
    DMA allocs 0 0
    DMA32 allocs 19832646 19448017
    Normal allocs 614488453 580941839
    Movable allocs 0 0
    Direct pages scanned 24163800 0
    Kswapd pages scanned 0 0
    Kswapd pages reclaimed 0 0
    Direct pages reclaimed 20691346 0
    Compaction stalls 42263 0
    Compaction success 938 0
    Compaction failures 41325 0

    This patch eliminates almost all swapping and direct reclaim activity.
    There is still overhead but it's from NUMA balancing which does not
    identify that it's pointless trying to do anything with this workload.

    I also tried the thpscale benchmark which forces a corner case where
    compaction can be used heavily and measures the latency of whether base
    or huge pages were used

    thpscale Fault Latencies
    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    Amean fault-base-1 5288.84 ( 0.00%) 2817.12 ( 46.73%)
    Amean fault-base-3 6365.53 ( 0.00%) 3499.11 ( 45.03%)
    Amean fault-base-5 6526.19 ( 0.00%) 4363.06 ( 33.15%)
    Amean fault-base-7 7142.25 ( 0.00%) 4858.08 ( 31.98%)
    Amean fault-base-12 13827.64 ( 0.00%) 10292.11 ( 25.57%)
    Amean fault-base-18 18235.07 ( 0.00%) 13788.84 ( 24.38%)
    Amean fault-base-24 21597.80 ( 0.00%) 24388.03 (-12.92%)
    Amean fault-base-30 26754.15 ( 0.00%) 19700.55 ( 26.36%)
    Amean fault-base-32 26784.94 ( 0.00%) 19513.57 ( 27.15%)
    Amean fault-huge-1 4223.96 ( 0.00%) 2178.57 ( 48.42%)
    Amean fault-huge-3 2194.77 ( 0.00%) 2149.74 ( 2.05%)
    Amean fault-huge-5 2569.60 ( 0.00%) 2346.95 ( 8.66%)
    Amean fault-huge-7 3612.69 ( 0.00%) 2997.70 ( 17.02%)
    Amean fault-huge-12 3301.75 ( 0.00%) 6727.02 (-103.74%)
    Amean fault-huge-18 6696.47 ( 0.00%) 6685.72 ( 0.16%)
    Amean fault-huge-24 8000.72 ( 0.00%) 9311.43 (-16.38%)
    Amean fault-huge-30 13305.55 ( 0.00%) 9750.45 ( 26.72%)
    Amean fault-huge-32 9981.71 ( 0.00%) 10316.06 ( -3.35%)

    The average time to fault pages is substantially reduced in the majority
    of caseds but with the obvious caveat that fewer THPs are actually used
    in this adverse workload

    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    Percentage huge-1 0.71 ( 0.00%) 14.04 (1865.22%)
    Percentage huge-3 10.77 ( 0.00%) 33.05 (206.85%)
    Percentage huge-5 60.39 ( 0.00%) 38.51 (-36.23%)
    Percentage huge-7 45.97 ( 0.00%) 34.57 (-24.79%)
    Percentage huge-12 68.12 ( 0.00%) 40.07 (-41.17%)
    Percentage huge-18 64.93 ( 0.00%) 47.82 (-26.35%)
    Percentage huge-24 62.69 ( 0.00%) 44.23 (-29.44%)
    Percentage huge-30 43.49 ( 0.00%) 55.38 ( 27.34%)
    Percentage huge-32 50.72 ( 0.00%) 51.90 ( 2.35%)

    4.4.0 4.4.0
    kcompactd-v1r1nodefrag-v1r3
    Minor Faults 37429143 47564000
    Major Faults 1916 1558
    Swap Ins 1466 1079
    Swap Outs 2936863 149626
    Allocation stalls 62510 3
    DMA allocs 0 0
    DMA32 allocs 6566458 6401314
    Normal allocs 216361697 216538171
    Movable allocs 0 0
    Direct pages scanned 25977580 17998
    Kswapd pages scanned 0 3638931
    Kswapd pages reclaimed 0 207236
    Direct pages reclaimed 8833714 88
    Compaction stalls 103349 5
    Compaction success 270 4
    Compaction failures 103079 1

    Note again that while this does swap as it's an aggressive workload, the
    direct relcim activity and allocation stalls is substantially reduced.
    There is some kswapd activity but ftrace showed that the kswapd activity
    was due to normal wakeups from 4K pages being allocated.
    Compaction-related stalls and activity are almost eliminated.

    I also tried the stutter benchmark. For this, I do not have figures for
    NUMA but it's something that does impact UMA so I'll report what is
    available

    stutter
    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    Min mmap 7.3571 ( 0.00%) 7.3438 ( 0.18%)
    1st-qrtle mmap 7.5278 ( 0.00%) 17.9200 (-138.05%)
    2nd-qrtle mmap 7.6818 ( 0.00%) 21.6055 (-181.25%)
    3rd-qrtle mmap 11.0889 ( 0.00%) 21.8881 (-97.39%)
    Max-90% mmap 27.8978 ( 0.00%) 22.1632 ( 20.56%)
    Max-93% mmap 28.3202 ( 0.00%) 22.3044 ( 21.24%)
    Max-95% mmap 28.5600 ( 0.00%) 22.4580 ( 21.37%)
    Max-99% mmap 29.6032 ( 0.00%) 25.5216 ( 13.79%)
    Max mmap 4109.7289 ( 0.00%) 4813.9832 (-17.14%)
    Mean mmap 12.4474 ( 0.00%) 19.3027 (-55.07%)

    This benchmark is trying to fault an anonymous mapping while there is a
    heavy IO load -- a scenario that desktop users used to complain about
    frequently. This shows a mix because the ideal case of mapping with THP
    is not hit as often. However, note that 99% of the mappings complete
    13.79% faster. The CPU usage here is particularly interesting

    4.4.0 4.4.0
    kcompactd-v1r1nodefrag-v1r3
    User 67.50 0.99
    System 1327.88 91.30
    Elapsed 2079.00 2128.98

    And once again we look at the reclaim figures

    4.4.0 4.4.0
    kcompactd-v1r1nodefrag-v1r3
    Minor Faults 335241922 1314582827
    Major Faults 715 819
    Swap Ins 0 0
    Swap Outs 0 0
    Allocation stalls 532723 0
    DMA allocs 0 0
    DMA32 allocs 1822364341 1177950222
    Normal allocs 1815640808 1517844854
    Movable allocs 0 0
    Direct pages scanned 21892772 0
    Kswapd pages scanned 20015890 41879484
    Kswapd pages reclaimed 19961986 41822072
    Direct pages reclaimed 21892741 0
    Compaction stalls 1065755 0
    Compaction success 514 0
    Compaction failures 1065241 0

    Allocation stalls and all direct reclaim activity is eliminated as well
    as compaction-related stalls.

    THP gives impressive gains in some cases but only if they are quickly
    available. We're not going to reach the point where they are completely
    free so lets take the costs out of the fast paths finally and defer the
    cost to kswapd, kcompactd and khugepaged where it belongs.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • If an oom killed thread calls mempool_alloc(), it is possible that it'll
    loop forever if there are no elements on the freelist since
    __GFP_NOMEMALLOC prevents it from accessing needed memory reserves in
    oom conditions.

    Only set __GFP_NOMEMALLOC if there are elements on the freelist. If
    there are no free elements, allow allocations without the bit set so
    that memory reserves can be accessed if needed.

    Additionally, using mempool_alloc() with __GFP_NOMEMALLOC is not
    supported since the implementation can loop forever without accessing
    memory reserves when needed.

    Signed-off-by: David Rientjes
    Cc: Greg Thelen
    Cc: Michal Hocko
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • In machines with 140G of memory and enterprise flash storage, we have
    seen read and write bursts routinely exceed the kswapd watermarks and
    cause thundering herds in direct reclaim. Unfortunately, the only way
    to tune kswapd aggressiveness is through adjusting min_free_kbytes - the
    system's emergency reserves - which is entirely unrelated to the
    system's latency requirements. In order to get kswapd to maintain a
    250M buffer of free memory, the emergency reserves need to be set to 1G.
    That is a lot of memory wasted for no good reason.

    On the other hand, it's reasonable to assume that allocation bursts and
    overall allocation concurrency scale with memory capacity, so it makes
    sense to make kswapd aggressiveness a function of that as well.

    Change the kswapd watermark scale factor from the currently fixed 25% of
    the tunable emergency reserve to a tunable 0.1% of memory.

    Beyond 1G of memory, this will produce bigger watermark steps than the
    current formula in default settings. Ensure that the new formula never
    chooses steps smaller than that, i.e. 25% of the emergency reserve.

    On a 140G machine, this raises the default watermark steps - the
    distance between min and low, and low and high - from 16M to 143M.

    Signed-off-by: Johannes Weiner
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner