04 Apr, 2018

1 commit

  • Pull sparc updates from David Miller:

    1) Add support for ADI (Application Data Integrity) found in more
    recent sparc64 cpus. Essentially this is keyed based access to
    virtual memory, and if the key encoded in the virual address is
    wrong you get a trap.

    The mm changes were reviewed by Andrew Morton and others.

    Work by Khalid Aziz.

    2) Validate DAX completion index range properly, from Rob Gardner.

    3) Add proper Kconfig deps for DAX driver. From Guenter Roeck.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next:
    sparc64: Make atomic_xchg() an inline function rather than a macro.
    sparc64: Properly range check DAX completion index
    sparc: Make auxiliary vectors for ADI available on 32-bit as well
    sparc64: Oracle DAX driver depends on SPARC64
    sparc64: Update signal delivery to use new helper functions
    sparc64: Add support for ADI (Application Data Integrity)
    mm: Allow arch code to override copy_highpage()
    mm: Clear arch specific VM flags on protection change
    mm: Add address parameter to arch_validate_prot()
    sparc64: Add auxiliary vectors to report platform ADI properties
    sparc64: Add handler for "Memory Corruption Detected" trap
    sparc64: Add HV fault type handlers for ADI related faults
    sparc64: Add support for ADI register fields, ASIs and traps
    mm, swap: Add infrastructure for saving page metadata on swap
    signals, sparc: Add signal codes for ADI violations

    Linus Torvalds
     

03 Apr, 2018

10 commits

  • Pull removal of in-kernel calls to syscalls from Dominik Brodowski:
    "System calls are interaction points between userspace and the kernel.
    Therefore, system call functions such as sys_xyzzy() or
    compat_sys_xyzzy() should only be called from userspace via the
    syscall table, but not from elsewhere in the kernel.

    At least on 64-bit x86, it will likely be a hard requirement from
    v4.17 onwards to not call system call functions in the kernel: It is
    better to use use a different calling convention for system calls
    there, where struct pt_regs is decoded on-the-fly in a syscall wrapper
    which then hands processing over to the actual syscall function. This
    means that only those parameters which are actually needed for a
    specific syscall are passed on during syscall entry, instead of
    filling in six CPU registers with random user space content all the
    time (which may cause serious trouble down the call chain). Those
    x86-specific patches will be pushed through the x86 tree in the near
    future.

    Moreover, rules on how data may be accessed may differ between kernel
    data and user data. This is another reason why calling sys_xyzzy() is
    generally a bad idea, and -- at most -- acceptable in arch-specific
    code.

    This patchset removes all in-kernel calls to syscall functions in the
    kernel with the exception of arch/. On top of this, it cleans up the
    three places where many syscalls are referenced or prototyped, namely
    kernel/sys_ni.c, include/linux/syscalls.h and include/linux/compat.h"

    * 'syscalls-next' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux: (109 commits)
    bpf: whitelist all syscalls for error injection
    kernel/sys_ni: remove {sys_,sys_compat} from cond_syscall definitions
    kernel/sys_ni: sort cond_syscall() entries
    syscalls/x86: auto-create compat_sys_*() prototypes
    syscalls: sort syscall prototypes in include/linux/compat.h
    net: remove compat_sys_*() prototypes from net/compat.h
    syscalls: sort syscall prototypes in include/linux/syscalls.h
    kexec: move sys_kexec_load() prototype to syscalls.h
    x86/sigreturn: use SYSCALL_DEFINE0
    x86: fix sys_sigreturn() return type to be long, not unsigned long
    x86/ioport: add ksys_ioperm() helper; remove in-kernel calls to sys_ioperm()
    mm: add ksys_readahead() helper; remove in-kernel calls to sys_readahead()
    mm: add ksys_mmap_pgoff() helper; remove in-kernel calls to sys_mmap_pgoff()
    mm: add ksys_fadvise64_64() helper; remove in-kernel call to sys_fadvise64_64()
    fs: add ksys_fallocate() wrapper; remove in-kernel calls to sys_fallocate()
    fs: add ksys_p{read,write}64() helpers; remove in-kernel calls to syscalls
    fs: add ksys_truncate() wrapper; remove in-kernel calls to sys_truncate()
    fs: add ksys_sync_file_range helper(); remove in-kernel calls to syscall
    kernel: add ksys_setsid() helper; remove in-kernel call to sys_setsid()
    kernel: add ksys_unshare() helper; remove in-kernel calls to sys_unshare()
    ...

    Linus Torvalds
     
  • Pul removal of obsolete architecture ports from Arnd Bergmann:
    "This removes the entire architecture code for blackfin, cris, frv,
    m32r, metag, mn10300, score, and tile, including the associated device
    drivers.

    I have been working with the (former) maintainers for each one to
    ensure that my interpretation was right and the code is definitely
    unused in mainline kernels. Many had fond memories of working on the
    respective ports to start with and getting them included in upstream,
    but also saw no point in keeping the port alive without any users.

    In the end, it seems that while the eight architectures are extremely
    different, they all suffered the same fate: There was one company in
    charge of an SoC line, a CPU microarchitecture and a software
    ecosystem, which was more costly than licensing newer off-the-shelf
    CPU cores from a third party (typically ARM, MIPS, or RISC-V). It
    seems that all the SoC product lines are still around, but have not
    used the custom CPU architectures for several years at this point. In
    contrast, CPU instruction sets that remain popular and have actively
    maintained kernel ports tend to all be used across multiple licensees.

    [ See the new nds32 port merged in the previous commit for the next
    generation of "one company in charge of an SoC line, a CPU
    microarchitecture and a software ecosystem" - Linus ]

    The removal came out of a discussion that is now documented at
    https://lwn.net/Articles/748074/. Unlike the original plans, I'm not
    marking any ports as deprecated but remove them all at once after I
    made sure that they are all unused. Some architectures (notably tile,
    mn10300, and blackfin) are still being shipped in products with old
    kernels, but those products will never be updated to newer kernel
    releases.

    After this series, we still have a few architectures without mainline
    gcc support:

    - unicore32 and hexagon both have very outdated gcc releases, but the
    maintainers promised to work on providing something newer. At least
    in case of hexagon, this will only be llvm, not gcc.

    - openrisc, risc-v and nds32 are still in the process of finishing
    their support or getting it added to mainline gcc in the first
    place. They all have patched gcc-7.3 ports that work to some
    degree, but complete upstream support won't happen before gcc-8.1.
    Csky posted their first kernel patch set last week, their situation
    will be similar

    [ Palmer Dabbelt points out that RISC-V support is in mainline gcc
    since gcc-7, although gcc-7.3.0 is the recommended minimum - Linus ]"

    This really says it all:

    2498 files changed, 95 insertions(+), 467668 deletions(-)

    * tag 'arch-removal' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: (74 commits)
    MAINTAINERS: UNICORE32: Change email account
    staging: iio: remove iio-trig-bfin-timer driver
    tty: hvc: remove tile driver
    tty: remove bfin_jtag_comm and hvc_bfin_jtag drivers
    serial: remove tile uart driver
    serial: remove m32r_sio driver
    serial: remove blackfin drivers
    serial: remove cris/etrax uart drivers
    usb: Remove Blackfin references in USB support
    usb: isp1362: remove blackfin arch glue
    usb: musb: remove blackfin port
    usb: host: remove tilegx platform glue
    pwm: remove pwm-bfin driver
    i2c: remove bfin-twi driver
    spi: remove blackfin related host drivers
    watchdog: remove bfin_wdt driver
    can: remove bfin_can driver
    mmc: remove bfin_sdh driver
    input: misc: remove blackfin rotary driver
    input: keyboard: remove bf54x driver
    ...

    Linus Torvalds
     
  • Pull x86 mm updates from Ingo Molnar:

    - Extend the memmap= boot parameter syntax to allow the redeclaration
    and dropping of existing ranges, and to support all e820 range types
    (Jan H. Schönherr)

    - Improve the W+X boot time security checks to remove false positive
    warnings on Xen (Jan Beulich)

    - Support booting as Xen PVH guest (Juergen Gross)

    - Improved 5-level paging (LA57) support, in particular it's possible
    now to have a single kernel image for both 4-level and 5-level
    hardware (Kirill A. Shutemov)

    - AMD hardware RAM encryption support (SME/SEV) fixes (Tom Lendacky)

    - Preparatory commits for hardware-encrypted RAM support on Intel CPUs.
    (Kirill A. Shutemov)

    - Improved Intel-MID support (Andy Shevchenko)

    - Show EFI page tables in page_tables debug files (Andy Lutomirski)

    - ... plus misc fixes and smaller cleanups

    * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (56 commits)
    x86/cpu/tme: Fix spelling: "configuation" -> "configuration"
    x86/boot: Fix SEV boot failure from change to __PHYSICAL_MASK_SHIFT
    x86/mm: Update comment in detect_tme() regarding x86_phys_bits
    x86/mm/32: Remove unused node_memmap_size_bytes() & CONFIG_NEED_NODE_MEMMAP_SIZE logic
    x86/mm: Remove pointless checks in vmalloc_fault
    x86/platform/intel-mid: Add special handling for ACPI HW reduced platforms
    ACPI, x86/boot: Introduce the ->reduced_hw_early_init() ACPI callback
    ACPI, x86/boot: Split out acpi_generic_reduce_hw_init() and export
    x86/pconfig: Provide defines and helper to run MKTME_KEY_PROG leaf
    x86/pconfig: Detect PCONFIG targets
    x86/tme: Detect if TME and MKTME is activated by BIOS
    x86/boot/compressed/64: Handle 5-level paging boot if kernel is above 4G
    x86/boot/compressed/64: Use page table in trampoline memory
    x86/boot/compressed/64: Use stack from trampoline memory
    x86/boot/compressed/64: Make sure we have a 32-bit code segment
    x86/mm: Do not use paravirtualized calls in native_set_p4d()
    kdump, vmcoreinfo: Export pgtable_l5_enabled value
    x86/boot/compressed/64: Prepare new top-level page table for trampoline
    x86/boot/compressed/64: Set up trampoline memory
    x86/boot/compressed/64: Save and restore trampoline memory
    ...

    Linus Torvalds
     
  • Using this helper allows us to avoid the in-kernel calls to the
    sys_readahead() syscall. The ksys_ prefix denotes that this function is
    meant as a drop-in replacement for the syscall. In particular, it uses the
    same calling convention as sys_readahead().

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Andrew Morton
    Cc: linux-mm@kvack.org
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     
  • Using this helper allows us to avoid the in-kernel calls to the
    sys_mmap_pgoff() syscall. The ksys_ prefix denotes that this function is
    meant as a drop-in replacement for the syscall. In particular, it uses the
    same calling convention as sys_mmap_pgoff().

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Andrew Morton
    Cc: linux-mm@kvack.org
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     
  • Using the ksys_fadvise64_64() helper allows us to avoid the in-kernel
    calls to the sys_fadvise64_64() syscall. The ksys_ prefix denotes that
    this function is meant as a drop-in replacement for the syscall. In
    particular, it uses the same calling convention as ksys_fadvise64_64().

    Some compat stubs called sys_fadvise64(), which then just passed through
    the arguments to sys_fadvise64_64(). Get rid of this indirection, and call
    ksys_fadvise64_64() directly.

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Andrew Morton
    Cc: linux-mm@kvack.org
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     
  • Using the mm-internal kernel_[sg]et_mempolicy() helper allows us to get
    rid of the mm-internal calls to the sys_[sg]et_mempolicy() syscalls.

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Cc: linux-mm@kvack.org
    Cc: Andrew Morton
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     
  • Using the mm-internal kernel_mbind() helper allows us to get rid of the
    mm-internal call to the sys_mbind() syscall.

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Cc: linux-mm@kvack.org
    Cc: Andrew Morton
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     
  • Move compat_sys_move_pages() to mm/migrate.c and make it call a newly
    introduced helper -- kernel_move_pages() -- instead of the syscall.

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Cc: linux-mm@kvack.org
    Cc: Andrew Morton
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     
  • Move compat_sys_migrate_pages() to mm/mempolicy.c and make it call a newly
    introduced helper -- kernel_migrate_pages() -- instead of the syscall.

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Cc: linux-mm@kvack.org
    Cc: Andrew Morton
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     

29 Mar, 2018

5 commits

  • A crash is observed when kmemleak_scan accesses the object->pointer,
    likely due to the following race.

    TASK A TASK B TASK C
    kmemleak_write
    (with "scan" and
    NOT "scan=on")
    kmemleak_scan()
    create_object
    kmem_cache_alloc fails
    kmemleak_disable
    kmemleak_do_cleanup
    kmemleak_free_enabled = 0
    kfree
    kmemleak_free bails out
    (kmemleak_free_enabled is 0)
    slub frees object->pointer
    update_checksum
    crash - object->pointer
    freed (DEBUG_PAGEALLOC)

    kmemleak_do_cleanup waits for the scan thread to complete, but not for
    direct call to kmemleak_scan via kmemleak_write. So add a wait for
    kmemleak_scan completion before disabling kmemleak_free, and while at it
    fix the comment on stop_scan_thread.

    [vinmenon@codeaurora.org: fix stop_scan_thread comment]
    Link: http://lkml.kernel.org/r/1522219972-22809-1-git-send-email-vinmenon@codeaurora.org
    Link: http://lkml.kernel.org/r/1522063429-18992-1-git-send-email-vinmenon@codeaurora.org
    Signed-off-by: Vinayak Menon
    Reviewed-by: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vinayak Menon
     
  • There are a couple of places where parameter description and function
    name do not match the actual code. Fix it.

    Link: http://lkml.kernel.org/r/1520843448-17347-1-git-send-email-honglei.wang@oracle.com
    Signed-off-by: Honglei Wang
    Acked-by: Tejun Heo
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Honglei Wang
     
  • Attempting to hotplug CPUs with CONFIG_VM_EVENT_COUNTERS enabled can
    cause vmstat_update() to report a BUG due to preemption not being
    disabled around smp_processor_id().

    Discovered on Ubiquiti EdgeRouter Pro with Cavium Octeon II processor.

    BUG: using smp_processor_id() in preemptible [00000000] code:
    kworker/1:1/269
    caller is vmstat_update+0x50/0xa0
    CPU: 0 PID: 269 Comm: kworker/1:1 Not tainted
    4.16.0-rc4-Cavium-Octeon-00009-gf83bbd5-dirty #1
    Workqueue: mm_percpu_wq vmstat_update
    Call Trace:
    show_stack+0x94/0x128
    dump_stack+0xa4/0xe0
    check_preemption_disabled+0x118/0x120
    vmstat_update+0x50/0xa0
    process_one_work+0x144/0x348
    worker_thread+0x150/0x4b8
    kthread+0x110/0x140
    ret_from_kernel_thread+0x14/0x1c

    Link: http://lkml.kernel.org/r/1520881552-25659-1-git-send-email-steven.hill@cavium.com
    Signed-off-by: Steven J. Hill
    Reviewed-by: Andrew Morton
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven J. Hill
     
  • This patch fixes commit 5f48f0bd4e36 ("mm, page_owner: skip unnecessary
    stack_trace entries").

    Because if we skip first two entries then logic of checking count value
    as 2 for recursion is broken and code will go in one depth recursion.

    so we need to check only one call of _RET_IP(__set_page_owner) while
    checking for recursion.

    Current Backtrace while checking for recursion:-

    (save_stack) from (__set_page_owner) // (But recursion returns true here)
    (__set_page_owner) from (get_page_from_freelist)
    (get_page_from_freelist) from (__alloc_pages_nodemask)
    (__alloc_pages_nodemask) from (depot_save_stack)
    (depot_save_stack) from (save_stack) // recursion should return true here
    (save_stack) from (__set_page_owner)
    (__set_page_owner) from (get_page_from_freelist)
    (get_page_from_freelist) from (__alloc_pages_nodemask+)
    (__alloc_pages_nodemask) from (depot_save_stack)
    (depot_save_stack) from (save_stack)
    (save_stack) from (__set_page_owner)
    (__set_page_owner) from (get_page_from_freelist)

    Correct Backtrace with fix:

    (save_stack) from (__set_page_owner) // recursion returned true here
    (__set_page_owner) from (get_page_from_freelist)
    (get_page_from_freelist) from (__alloc_pages_nodemask+)
    (__alloc_pages_nodemask) from (depot_save_stack)
    (depot_save_stack) from (save_stack)
    (save_stack) from (__set_page_owner)
    (__set_page_owner) from (get_page_from_freelist)

    Link: http://lkml.kernel.org/r/1521607043-34670-1-git-send-email-maninder1.s@samsung.com
    Fixes: 5f48f0bd4e36 ("mm, page_owner: skip unnecessary stack_trace entries")
    Signed-off-by: Maninder Singh
    Signed-off-by: Vaneet Narang
    Acked-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Oscar Salvador
    Cc: Greg Kroah-Hartman
    Cc: Ayush Mittal
    Cc: Prakash Gupta
    Cc: Vinayak Menon
    Cc: Vasyl Gomonovych
    Cc: Amit Sahrawat
    Cc:
    Cc: Vaneet Narang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Maninder Singh
     
  • All the root caches are linked into slab_root_caches which was
    introduced by the commit 510ded33e075 ("slab: implement slab_root_caches
    list") but it missed to add the SLAB's kmem_cache.

    While experimenting with opt-in/opt-out kmem accounting, I noticed
    system crashes due to NULL dereference inside cache_from_memcg_idx()
    while deferencing kmem_cache.memcg_params.memcg_caches. The upstream
    clean kernel will not see these crashes but SLAB should be consistent
    with SLUB which does linked its boot caches (kmem_cache_node and
    kmem_cache) into slab_root_caches.

    Link: http://lkml.kernel.org/r/20180319210020.60289-1-shakeelb@google.com
    Fixes: 510ded33e075c ("slab: implement slab_root_caches list")
    Signed-off-by: Shakeel Butt
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Cc: Greg Thelen
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

27 Mar, 2018

2 commits


26 Mar, 2018

1 commit


23 Mar, 2018

9 commits

  • Commit 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and
    madvised allocations") changed the page allocator to no longer detect
    thp allocations based on __GFP_NORETRY.

    It did not, however, modify the mem cgroup try_charge() path to avoid
    oom kill for either khugepaged collapsing or thp faulting. It is never
    expected to oom kill a process to allocate a hugepage for thp; reclaim
    is governed by the thp defrag mode and MADV_HUGEPAGE, but allocations
    (and charging) should fallback instead of oom killing processes.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803191409420.124411@chino.kir.corp.google.com
    Fixes: 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
    Signed-off-by: David Rientjes
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Commit 726d061fbd36 ("mm: vmscan: kick flushers when we encounter dirty
    pages on the LRU") added flusher invocation to shrink_inactive_list()
    when many dirty pages on the LRU are encountered.

    However, shrink_inactive_list() doesn't wake up flushers for legacy
    cgroup reclaim, so the next commit bbef938429f5 ("mm: vmscan: remove old
    flusher wakeup from direct reclaim path") removed the only source of
    flusher's wake up in legacy mem cgroup reclaim path.

    This leads to premature OOM if there is too many dirty pages in cgroup:
    # mkdir /sys/fs/cgroup/memory/test
    # echo $$ > /sys/fs/cgroup/memory/test/tasks
    # echo 50M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
    # dd if=/dev/zero of=tmp_file bs=1M count=100
    Killed

    dd invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0

    Call Trace:
    dump_stack+0x46/0x65
    dump_header+0x6b/0x2ac
    oom_kill_process+0x21c/0x4a0
    out_of_memory+0x2a5/0x4b0
    mem_cgroup_out_of_memory+0x3b/0x60
    mem_cgroup_oom_synchronize+0x2ed/0x330
    pagefault_out_of_memory+0x24/0x54
    __do_page_fault+0x521/0x540
    page_fault+0x45/0x50

    Task in /test killed as a result of limit of /test
    memory: usage 51200kB, limit 51200kB, failcnt 73
    memory+swap: usage 51200kB, limit 9007199254740988kB, failcnt 0
    kmem: usage 296kB, limit 9007199254740988kB, failcnt 0
    Memory cgroup stats for /test: cache:49632KB rss:1056KB rss_huge:0KB shmem:0KB
    mapped_file:0KB dirty:49500KB writeback:0KB swap:0KB inactive_anon:0KB
    active_anon:1168KB inactive_file:24760KB active_file:24960KB unevictable:0KB
    Memory cgroup out of memory: Kill process 3861 (bash) score 88 or sacrifice child
    Killed process 3876 (dd) total-vm:8484kB, anon-rss:1052kB, file-rss:1720kB, shmem-rss:0kB
    oom_reaper: reaped process 3876 (dd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

    Wake up flushers in legacy cgroup reclaim too.

    Link: http://lkml.kernel.org/r/20180315164553.17856-1-aryabinin@virtuozzo.com
    Fixes: bbef938429f5 ("mm: vmscan: remove old flusher wakeup from direct reclaim path")
    Signed-off-by: Andrey Ryabinin
    Tested-by: Shakeel Butt
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Tejun Heo
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • This reverts commit b92df1de5d28 ("mm: page_alloc: skip over regions of
    invalid pfns where possible"). The commit is meant to be a boot init
    speed up skipping the loop in memmap_init_zone() for invalid pfns.

    But given some specific memory mapping on x86_64 (or more generally
    theoretically anywhere but on arm with CONFIG_HAVE_ARCH_PFN_VALID) the
    implementation also skips valid pfns which is plain wrong and causes
    'kernel BUG at mm/page_alloc.c:1389!'

    crash> log | grep -e BUG -e RIP -e Call.Trace -e move_freepages_block -e rmqueue -e freelist -A1
    kernel BUG at mm/page_alloc.c:1389!
    invalid opcode: 0000 [#1] SMP
    --
    RIP: 0010: move_freepages+0x15e/0x160
    --
    Call Trace:
    move_freepages_block+0x73/0x80
    __rmqueue+0x263/0x460
    get_page_from_freelist+0x7e1/0x9e0
    __alloc_pages_nodemask+0x176/0x420
    --

    crash> page_init_bug -v | grep RAM
    1000 - 9bfff System RAM (620.00 KiB)
    100000 - 430bffff System RAM ( 1.05 GiB = 1071.75 MiB = 1097472.00 KiB)
    4b0c8000 - 4bf9cfff System RAM ( 14.83 MiB = 15188.00 KiB)
    4bfac000 - 646b1fff System RAM (391.02 MiB = 400408.00 KiB)
    7b788000 - 7b7fffff System RAM (480.00 KiB)
    100000000 - 67fffffff System RAM ( 22.00 GiB)

    crash> page_init_bug | head -6
    7b788000 - 7b7fffff System RAM (480.00 KiB)
    1fffff00000000 0 1 DMA32 4096 1048575
    505736 505344 505855
    0 0 0 DMA 1 4095
    1fffff00000400 0 1 DMA32 4096 1048575
    BUG, zones differ!

    crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b787000 7b788000
    PAGE PHYSICAL MAPPING INDEX CNT FLAGS
    ffffea0001e00000 78000000 0 0 0 0
    ffffea0001ed7fc0 7b5ff000 0 0 0 0
    ffffea0001ed8000 7b600000 0 0 0 0 <<<<
    ffffea0001ede1c0 7b787000 0 0 0 0
    ffffea0001ede200 7b788000 0 0 1 1fffff00000000

    Link: http://lkml.kernel.org/r/20180316143855.29838-1-neelx@redhat.com
    Fixes: b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns where possible")
    Signed-off-by: Daniel Vacek
    Acked-by: Ard Biesheuvel
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Pavel Tatashin
    Cc: Paul Burton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Vacek
     
  • shmem_unused_huge_shrink() gets called from reclaim path. Waiting for
    page lock may lead to deadlock there.

    There was a bug report that may be attributed to this:

    http://lkml.kernel.org/r/alpine.LRH.2.11.1801242349220.30642@mail.ewheeler.net

    Replace lock_page() with trylock_page() and skip the page if we failed
    to lock it. We will get to the page on the next scan.

    We can test for the PageTransHuge() outside the page lock as we only
    need protection against splitting the page under us. Holding pin oni
    the page is enough for this.

    Link: http://lkml.kernel.org/r/20180316210830.43738-1-kirill.shutemov@linux.intel.com
    Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure")
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Eric Wheeler
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Tetsuo Handa
    Cc: Hugh Dickins
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • deferred_split_scan() gets called from reclaim path. Waiting for page
    lock may lead to deadlock there.

    Replace lock_page() with trylock_page() and skip the page if we failed
    to lock it. We will get to the page on the next scan.

    Link: http://lkml.kernel.org/r/20180315150747.31945-1-kirill.shutemov@linux.intel.com
    Fixes: 9a982250f773 ("thp: introduce deferred_split_huge_page()")
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • khugepaged is not yet able to convert PTE-mapped huge pages back to PMD
    mapped. We do not collapse such pages. See check
    khugepaged_scan_pmd().

    But if between khugepaged_scan_pmd() and __collapse_huge_page_isolate()
    somebody managed to instantiate THP in the range and then split the PMD
    back to PTEs we would have a problem --
    VM_BUG_ON_PAGE(PageCompound(page)) will get triggered.

    It's possible since we drop mmap_sem during collapse to re-take for
    write.

    Replace the VM_BUG_ON() with graceful collapse fail.

    Link: http://lkml.kernel.org/r/20180315152353.27989-1-kirill.shutemov@linux.intel.com
    Fixes: b1caa957ae6d ("khugepaged: ignore pmd tables with THP mapped with ptes")
    Signed-off-by: Kirill A. Shutemov
    Cc: Laura Abbott
    Cc: Jerome Marchand
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • A vma with vm_pgoff large enough to overflow a loff_t type when
    converted to a byte offset can be passed via the remap_file_pages system
    call. The hugetlbfs mmap routine uses the byte offset to calculate
    reservations and file size.

    A sequence such as:

    mmap(0x20a00000, 0x600000, 0, 0x66033, -1, 0);
    remap_file_pages(0x20a00000, 0x600000, 0, 0x20000000000000, 0);

    will result in the following when task exits/file closed,

    kernel BUG at mm/hugetlb.c:749!
    Call Trace:
    hugetlbfs_evict_inode+0x2f/0x40
    evict+0xcb/0x190
    __dentry_kill+0xcb/0x150
    __fput+0x164/0x1e0
    task_work_run+0x84/0xa0
    exit_to_usermode_loop+0x7d/0x80
    do_syscall_64+0x18b/0x190
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    The overflowed pgoff value causes hugetlbfs to try to set up a mapping
    with a negative range (end < start) that leaves invalid state which
    causes the BUG.

    The previous overflow fix to this code was incomplete and did not take
    the remap_file_pages system call into account.

    [mike.kravetz@oracle.com: v3]
    Link: http://lkml.kernel.org/r/20180309002726.7248-1-mike.kravetz@oracle.com
    [akpm@linux-foundation.org: include mmdebug.h]
    [akpm@linux-foundation.org: fix -ve left shift count on sh]
    Link: http://lkml.kernel.org/r/20180308210502.15952-1-mike.kravetz@oracle.com
    Fixes: 045c7a3f53d9 ("hugetlbfs: fix offset overflow in hugetlbfs mmap")
    Signed-off-by: Mike Kravetz
    Reported-by: Nic Losby
    Acked-by: Michal Hocko
    Cc: "Kirill A . Shutemov"
    Cc: Yisheng Xie
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Dave Jones reported fs_reclaim lockdep warnings.

    ============================================
    WARNING: possible recursive locking detected
    4.15.0-rc9-backup-debug+ #1 Not tainted
    --------------------------------------------
    sshd/24800 is trying to acquire lock:
    (fs_reclaim){+.+.}, at: [] fs_reclaim_acquire.part.102+0x5/0x30

    but task is already holding lock:
    (fs_reclaim){+.+.}, at: [] fs_reclaim_acquire.part.102+0x5/0x30

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(fs_reclaim);
    lock(fs_reclaim);

    *** DEADLOCK ***

    May be due to missing lock nesting notation

    2 locks held by sshd/24800:
    #0: (sk_lock-AF_INET6){+.+.}, at: [] tcp_sendmsg+0x19/0x40
    #1: (fs_reclaim){+.+.}, at: [] fs_reclaim_acquire.part.102+0x5/0x30

    stack backtrace:
    CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
    Call Trace:
    dump_stack+0xbc/0x13f
    __lock_acquire+0xa09/0x2040
    lock_acquire+0x12e/0x350
    fs_reclaim_acquire.part.102+0x29/0x30
    kmem_cache_alloc+0x3d/0x2c0
    alloc_extent_state+0xa7/0x410
    __clear_extent_bit+0x3ea/0x570
    try_release_extent_mapping+0x21a/0x260
    __btrfs_releasepage+0xb0/0x1c0
    btrfs_releasepage+0x161/0x170
    try_to_release_page+0x162/0x1c0
    shrink_page_list+0x1d5a/0x2fb0
    shrink_inactive_list+0x451/0x940
    shrink_node_memcg.constprop.88+0x4c9/0x5e0
    shrink_node+0x12d/0x260
    try_to_free_pages+0x418/0xaf0
    __alloc_pages_slowpath+0x976/0x1790
    __alloc_pages_nodemask+0x52c/0x5c0
    new_slab+0x374/0x3f0
    ___slab_alloc.constprop.81+0x47e/0x5a0
    __slab_alloc.constprop.80+0x32/0x60
    __kmalloc_track_caller+0x267/0x310
    __kmalloc_reserve.isra.40+0x29/0x80
    __alloc_skb+0xee/0x390
    sk_stream_alloc_skb+0xb8/0x340
    tcp_sendmsg_locked+0x8e6/0x1d30
    tcp_sendmsg+0x27/0x40
    inet_sendmsg+0xd0/0x310
    sock_write_iter+0x17a/0x240
    __vfs_write+0x2ab/0x380
    vfs_write+0xfb/0x260
    SyS_write+0xb6/0x140
    do_syscall_64+0x1e5/0xc05
    entry_SYSCALL64_slow_path+0x25/0x25

    This warning is caused by commit d92a8cfcb37e ("locking/lockdep:
    Rework FS_RECLAIM annotation") which replaced the use of
    lockdep_{set,clear}_current_reclaim_state() in __perform_reclaim()
    and lockdep_trace_alloc() in slab_pre_alloc_hook() with
    fs_reclaim_acquire()/ fs_reclaim_release().

    Since __kmalloc_reserve() from __alloc_skb() adds __GFP_NOMEMALLOC |
    __GFP_NOWARN to gfp_mask, and all reclaim path simply propagates
    __GFP_NOMEMALLOC, fs_reclaim_acquire() in slab_pre_alloc_hook() is
    trying to grab the 'fake' lock again when __perform_reclaim() already
    grabbed the 'fake' lock.

    The

    /* this guy won't enter reclaim */
    if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
    return false;

    test which causes slab_pre_alloc_hook() to try to grab the 'fake' lock
    was added by commit cf40bd16fdad ("lockdep: annotate reclaim context
    (__GFP_NOFS)"). But that test is outdated because PF_MEMALLOC thread
    won't enter reclaim regardless of __GFP_NOMEMALLOC after commit
    341ce06f69ab ("page allocator: calculate the alloc_flags for allocation
    only once") added the PF_MEMALLOC safeguard (

    /* Avoid recursion of direct reclaim */
    if (p->flags & PF_MEMALLOC)
    goto nopage;

    in __alloc_pages_slowpath()).

    Thus, let's fix outdated test by removing __GFP_NOMEMALLOC test and
    allow __need_fs_reclaim() to return false.

    Link: http://lkml.kernel.org/r/201802280650.FJC73911.FOSOMLJVFFQtHO@I-love.SAKURA.ne.jp
    Fixes: d92a8cfcb37ecd13 ("locking/lockdep: Rework FS_RECLAIM annotation")
    Signed-off-by: Tetsuo Handa
    Reported-by: Dave Jones
    Tested-by: Dave Jones
    Cc: Peter Zijlstra
    Cc: Nick Piggin
    Cc: Ingo Molnar
    Cc: Nikolay Borisov
    Cc: Michal Hocko
    Cc: [4.14+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Alexander reported a use of uninitialized memory in __mpol_equal(),
    which is caused by incorrect use of preferred_node.

    When mempolicy in mode MPOL_PREFERRED with flags MPOL_F_LOCAL, it uses
    numa_node_id() instead of preferred_node, however, __mpol_equal() uses
    preferred_node without checking whether it is MPOL_F_LOCAL or not.

    [akpm@linux-foundation.org: slight comment tweak]
    Link: http://lkml.kernel.org/r/4ebee1c2-57f6-bcb8-0e2d-1833d1ee0bb7@huawei.com
    Fixes: fc36b8d3d819 ("mempolicy: use MPOL_F_LOCAL to Indicate Preferred Local Policy")
    Signed-off-by: Yisheng Xie
    Reported-by: Alexander Potapenko
    Tested-by: Alexander Potapenko
    Reviewed-by: Andrew Morton
    Cc: Dmitriy Vyukov
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yisheng Xie
     

20 Mar, 2018

3 commits

  • Pull percpu fixes from Tejun Heo:
    "Late percpu pull request for v4.16-rc6.

    - percpu allocator pool replenishing no longer triggers OOM or
    warning messages.

    Also, the alloc interface now understands __GFP_NORETRY and
    __GFP_NOWARN. This is to allow avoiding OOMs from userland
    triggered actions like bpf map creation.

    Also added cond_resched() in alloc loop.

    - perpcu allocation now can be interrupted by kill sigs to avoid
    deadlocking OOM killer.

    - Added Dennis Zhou as a co-maintainer.

    He has rewritten the area map allocator, understands most of the
    code base and has been responsive for all bug reports"

    * 'for-4.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
    percpu_ref: Update doc to dissuade users from depending on internal RCU grace periods
    mm: Allow to kill tasks doing pcpu_alloc() and waiting for pcpu_balance_workfn()
    percpu: include linux/sched.h for cond_resched()
    percpu: add a schedule point in pcpu_balance_workfn()
    percpu: allow select gfp to be passed to underlying allocators
    percpu: add __GFP_NORETRY semantics to the percpu balancing path
    percpu: match chunk allocator declarations with definitions
    percpu: add Dennis Zhou as a percpu co-maintainer

    Linus Torvalds
     
  • In case of memory deficit and low percpu memory pages,
    pcpu_balance_workfn() takes pcpu_alloc_mutex for a long
    time (as it makes memory allocations itself and waits
    for memory reclaim). If tasks doing pcpu_alloc() are
    choosen by OOM killer, they can't exit, because they
    are waiting for the mutex.

    The patch makes pcpu_alloc() to care about killing signal
    and use mutex_lock_killable(), when it's allowed by GFP
    flags. This guarantees, a task does not miss SIGKILL
    from OOM killer.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Tejun Heo

    Kirill Tkhai
     
  • microblaze build broke due to missing declaration of the
    cond_resched() invocation added recently. Let's include linux/sched.h
    explicitly.

    Signed-off-by: Tejun Heo
    Reported-by: kbuild test robot

    Tejun Heo
     

18 Mar, 2018

4 commits

  • ADI is a new feature supported on SPARC M7 and newer processors to allow
    hardware to catch rogue accesses to memory. ADI is supported for data
    fetches only and not instruction fetches. An app can enable ADI on its
    data pages, set version tags on them and use versioned addresses to
    access the data pages. Upper bits of the address contain the version
    tag. On M7 processors, upper four bits (bits 63-60) contain the version
    tag. If a rogue app attempts to access ADI enabled data pages, its
    access is blocked and processor generates an exception. Please see
    Documentation/sparc/adi.txt for further details.

    This patch extends mprotect to enable ADI (TSTATE.mcde), enable/disable
    MCD (Memory Corruption Detection) on selected memory ranges, enable
    TTE.mcd in PTEs, return ADI parameters to userspace and save/restore ADI
    version tags on page swap out/in or migration. ADI is not enabled by
    default for any task. A task must explicitly enable ADI on a memory
    range and set version tag for ADI to be effective for the task.

    Signed-off-by: Khalid Aziz
    Cc: Khalid Aziz
    Reviewed-by: Anthony Yznaga
    Signed-off-by: David S. Miller

    Khalid Aziz
     
  • When protection bits are changed on a VMA, some of the architecture
    specific flags should be cleared as well. An examples of this are the
    PKEY flags on x86. This patch expands the current code that clears
    PKEY flags for x86, to support similar functionality for other
    architectures as well.

    Signed-off-by: Khalid Aziz
    Cc: Khalid Aziz
    Reviewed-by: Anthony Yznaga
    Acked-by: Andrew Morton
    Signed-off-by: David S. Miller

    Khalid Aziz
     
  • A protection flag may not be valid across entire address space and
    hence arch_validate_prot() might need the address a protection bit is
    being set on to ensure it is a valid protection flag. For example, sparc
    processors support memory corruption detection (as part of ADI feature)
    flag on memory addresses mapped on to physical RAM but not on PFN mapped
    pages or addresses mapped on to devices. This patch adds address to the
    parameters being passed to arch_validate_prot() so protection bits can
    be validated in the relevant context.

    Signed-off-by: Khalid Aziz
    Cc: Khalid Aziz
    Reviewed-by: Anthony Yznaga
    Acked-by: Michael Ellerman (powerpc)
    Acked-by: Andrew Morton
    Signed-off-by: David S. Miller

    Khalid Aziz
     
  • If a processor supports special metadata for a page, for example ADI
    version tags on SPARC M7, this metadata must be saved when the page is
    swapped out. The same metadata must be restored when the page is swapped
    back in. This patch adds two new architecture specific functions -
    arch_do_swap_page() to be called when a page is swapped in, and
    arch_unmap_one() to be called when a page is being unmapped for swap
    out. These architecture hooks allow page metadata to be saved if the
    architecture supports it.

    Signed-off-by: Khalid Aziz
    Cc: Khalid Aziz
    Acked-by: Jerome Marchand
    Reviewed-by: Anthony Yznaga
    Acked-by: Andrew Morton
    Signed-off-by: David S. Miller

    Khalid Aziz
     

16 Mar, 2018

2 commits

  • Tile was the only remaining architecture to implement alloc_remap(),
    and since that is being removed, there is no point in keeping this
    function.

    Removing all callers simplifies the mem_map handling.

    Reviewed-by: Pavel Tatashin
    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     
  • The CONFIG_MPU option was only defined on blackfin, and that architecture
    is now being removed, so the respective code can be simplified.

    A lot of other microcontrollers have an MPU, but I suspect that if we
    want to bring that support back, we'd do it differently anyway.

    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

15 Mar, 2018

2 commits

  • This reverts commit 864b75f9d6b0100bb24fdd9a20d156e7cda9b5ae.

    Commit 864b75f9d6b0 ("mm/page_alloc: fix memmap_init_zone pageblock
    alignment") modified the logic in memmap_init_zone() to initialize
    struct pages associated with invalid PFNs, to appease a VM_BUG_ON()
    in move_freepages(), which is redundant by its own admission, and
    dereferences struct page fields to obtain the zone without checking
    whether the struct pages in question are valid to begin with.

    Commit 864b75f9d6b0 only makes it worse, since the rounding it does
    may cause pfn assume the same value it had in a prior iteration of
    the loop, resulting in an infinite loop and a hang very early in the
    boot. Also, since it doesn't perform the same rounding on start_pfn
    itself but only on intermediate values following an invalid PFN, we
    may still hit the same VM_BUG_ON() as before.

    So instead, let's fix this at the core, and ensure that the BUG
    check doesn't dereference struct page fields of invalid pages.

    Fixes: 864b75f9d6b0 ("mm/page_alloc: fix memmap_init_zone pageblock alignment")
    Tested-by: Jan Glauber
    Tested-by: Shanker Donthineni
    Cc: Daniel Vacek
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Paul Burton
    Cc: Pavel Tatashin
    Cc: Vlastimil Babka
    Cc: Andrew Morton
    Signed-off-by: Ard Biesheuvel
    Signed-off-by: Linus Torvalds

    Ard Biesheuvel
     
  • Thomas Gleixner
     

10 Mar, 2018

1 commit

  • Commit b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns
    where possible") introduced a bug where move_freepages() triggers a
    VM_BUG_ON() on uninitialized page structure due to pageblock alignment.
    To fix this, simply align the skipped pfns in memmap_init_zone() the
    same way as in move_freepages_block().

    Seen in one of the RHEL reports:

    crash> log | grep -e BUG -e RIP -e Call.Trace -e move_freepages_block -e rmqueue -e freelist -A1
    kernel BUG at mm/page_alloc.c:1389!
    invalid opcode: 0000 [#1] SMP
    --
    RIP: 0010:[] [] move_freepages+0x15e/0x160
    RSP: 0018:ffff88054d727688 EFLAGS: 00010087
    --
    Call Trace:
    [] move_freepages_block+0x73/0x80
    [] __rmqueue+0x263/0x460
    [] get_page_from_freelist+0x7e1/0x9e0
    [] __alloc_pages_nodemask+0x176/0x420
    --
    RIP [] move_freepages+0x15e/0x160
    RSP

    crash> page_init_bug -v | grep RAM
    1000 - 9bfff System RAM (620.00 KiB)
    100000 - 430bffff System RAM ( 1.05 GiB = 1071.75 MiB = 1097472.00 KiB)
    4b0c8000 - 4bf9cfff System RAM ( 14.83 MiB = 15188.00 KiB)
    4bfac000 - 646b1fff System RAM (391.02 MiB = 400408.00 KiB)
    7b788000 - 7b7fffff System RAM (480.00 KiB)
    100000000 - 67fffffff System RAM ( 22.00 GiB)

    crash> page_init_bug | head -6
    7b788000 - 7b7fffff System RAM (480.00 KiB)
    1fffff00000000 0 1 DMA32 4096 1048575
    505736 505344 505855
    0 0 0 DMA 1 4095
    1fffff00000400 0 1 DMA32 4096 1048575
    BUG, zones differ!

    Note that this range follows two not populated sections
    68000000-77ffffff in this zone. 7b788000-7b7fffff is the first one
    after a gap. This makes memmap_init_zone() skip all the pfns up to the
    beginning of this range. But this range is not pageblock (2M) aligned.
    In fact no range has to be.

    crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b787000 7b788000
    PAGE PHYSICAL MAPPING INDEX CNT FLAGS
    ffffea0001e00000 78000000 0 0 0 0
    ffffea0001ed7fc0 7b5ff000 0 0 0 0
    ffffea0001ed8000 7b600000 0 0 0 0 <<<<
    ffffea0001ede1c0 7b787000 0 0 0 0
    ffffea0001ede200 7b788000 0 0 1 1fffff00000000

    Top part of page flags should contain nodeid and zonenr, which is not
    the case for page ffffea0001ed8000 here (<<< log | grep -o fffea0001ed[^\ ]* | sort -u
    fffea0001ed8000
    fffea0001eded20
    fffea0001edffc0

    crash> bt -r | grep -o fffea0001ed[^\ ]* | sort -u
    fffea0001ed8000
    fffea0001eded00
    fffea0001eded20
    fffea0001edffc0

    Initialization of the whole beginning of the section is skipped up to
    the start of the range due to the commit b92df1de5d28. Now any code
    calling move_freepages_block() (like reusing the page from a freelist as
    in this example) with a page from the beginning of the range will get
    the page rounded down to start_page ffffea0001ed8000 and passed to
    move_freepages() which crashes on assertion getting wrong zonenr.

    > VM_BUG_ON(page_zone(start_page) != page_zone(end_page));

    Note, page_zone() derives the zone from page flags here.

    From similar machine before commit b92df1de5d28:

    crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b7fe000 7b7ff000
    PAGE PHYSICAL MAPPING INDEX CNT FLAGS
    fffff73941e00000 78000000 0 0 1 1fffff00000000
    fffff73941ed7fc0 7b5ff000 0 0 1 1fffff00000000
    fffff73941ed8000 7b600000 0 0 1 1fffff00000000
    fffff73941edff80 7b7fe000 0 0 1 1fffff00000000
    fffff73941edffc0 7b7ff000 ffff8e67e04d3ae0 ad84 1 1fffff00020068 uptodate,lru,active,mappedtodisk

    All the pages since the beginning of the section are initialized.
    move_freepages()' not gonna blow up.

    The same machine with this fix applied:

    crash> kmem -p 77fff000 78000000 7b5ff000 7b600000 7b7fe000 7b7ff000
    PAGE PHYSICAL MAPPING INDEX CNT FLAGS
    ffffea0001e00000 78000000 0 0 0 0
    ffffea0001e00000 7b5ff000 0 0 0 0
    ffffea0001ed8000 7b600000 0 0 1 1fffff00000000
    ffffea0001edff80 7b7fe000 0 0 1 1fffff00000000
    ffffea0001edffc0 7b7ff000 ffff88017fb13720 8 2 1fffff00020068 uptodate,lru,active,mappedtodisk

    At least the bare minimum of pages is initialized preventing the crash
    as well.

    Customers started to report this as soon as 7.4 (where b92df1de5d28 was
    merged in RHEL) was released. I remember reports from
    September/October-ish times. It's not easily reproduced and happens on
    a handful of machines only. I guess that's why. But that does not make
    it less serious, I think.

    Though there actually is a report here:
    https://bugzilla.kernel.org/show_bug.cgi?id=196443

    And there are reports for Fedora from July:
    https://bugzilla.redhat.com/show_bug.cgi?id=1473242
    and CentOS:
    https://bugs.centos.org/view.php?id=13964
    and we internally track several dozens reports for RHEL bug
    https://bugzilla.redhat.com/show_bug.cgi?id=1525121

    Link: http://lkml.kernel.org/r/0485727b2e82da7efbce5f6ba42524b429d0391a.1520011945.git.neelx@redhat.com
    Fixes: b92df1de5d28 ("mm: page_alloc: skip over regions of invalid pfns where possible")
    Signed-off-by: Daniel Vacek
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Paul Burton
    Cc: Pavel Tatashin
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Vacek