28 Feb, 2018

3 commits

  • commit b7f0554a56f21fb3e636a627450a9add030889be upstream.

    Until there is a solution to the dma-to-dax vs truncate problem it is
    not safe to allow V4L2, Exynos, and other frame vector users to create
    long standing / irrevocable memory registrations against filesytem-dax
    vmas.

    [dan.j.williams@intel.com: add comment for vma_is_fsdax() check in get_vaddr_frames(), per Jan]
    Link: http://lkml.kernel.org/r/151197874035.26211.4061781453123083667.stgit@dwillia2-desk3.amr.corp.intel.com
    Link: http://lkml.kernel.org/r/151068939985.7446.15684639617389154187.stgit@dwillia2-desk3.amr.corp.intel.com
    Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
    Signed-off-by: Dan Williams
    Reviewed-by: Jan Kara
    Cc: Inki Dae
    Cc: Seung-Woo Kim
    Cc: Joonyoung Shim
    Cc: Kyungmin Park
    Cc: Mauro Carvalho Chehab
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Christoph Hellwig
    Cc: Doug Ledford
    Cc: Hal Rosenstock
    Cc: Jason Gunthorpe
    Cc: Jeff Moyer
    Cc: Ross Zwisler
    Cc: Sean Hefty
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     
  • commit 2bb6d2837083de722bfdc369cb0d76ce188dd9b4 upstream.

    Patch series "introduce get_user_pages_longterm()", v2.

    Here is a new get_user_pages api for cases where a driver intends to
    keep an elevated page count indefinitely. This is distinct from usages
    like iov_iter_get_pages where the elevated page counts are transient.
    The iov_iter_get_pages cases immediately turn around and submit the
    pages to a device driver which will put_page when the i/o operation
    completes (under kernel control).

    In the longterm case userspace is responsible for dropping the page
    reference at some undefined point in the future. This is untenable for
    filesystem-dax case where the filesystem is in control of the lifetime
    of the block / page and needs reasonable limits on how long it can wait
    for pages in a mapping to become idle.

    Fixing filesystems to actually wait for dax pages to be idle before
    blocks from a truncate/hole-punch operation are repurposed is saved for
    a later patch series.

    Also, allowing longterm registration of dax mappings is a future patch
    series that introduces a "map with lease" semantic where the kernel can
    revoke a lease and force userspace to drop its page references.

    I have also tagged these for -stable to purposely break cases that might
    assume that longterm memory registrations for filesystem-dax mappings
    were supported by the kernel. The behavior regression this policy
    change implies is one of the reasons we maintain the "dax enabled.
    Warning: EXPERIMENTAL, use at your own risk" notification when mounting
    a filesystem in dax mode.

    It is worth noting the device-dax interface does not suffer the same
    constraints since it does not support file space management operations
    like hole-punch.

    This patch (of 4):

    Until there is a solution to the dma-to-dax vs truncate problem it is
    not safe to allow long standing memory registrations against
    filesytem-dax vmas. Device-dax vmas do not have this problem and are
    explicitly allowed.

    This is temporary until a "memory registration with layout-lease"
    mechanism can be implemented for the affected sub-systems (RDMA and
    V4L2).

    [akpm@linux-foundation.org: use kcalloc()]
    Link: http://lkml.kernel.org/r/151068939435.7446.13560129395419350737.stgit@dwillia2-desk3.amr.corp.intel.com
    Fixes: 3565fce3a659 ("mm, x86: get_user_pages() for dax mappings")
    Signed-off-by: Dan Williams
    Suggested-by: Christoph Hellwig
    Cc: Doug Ledford
    Cc: Hal Rosenstock
    Cc: Inki Dae
    Cc: Jan Kara
    Cc: Jason Gunthorpe
    Cc: Jeff Moyer
    Cc: Joonyoung Shim
    Cc: Kyungmin Park
    Cc: Mauro Carvalho Chehab
    Cc: Mel Gorman
    Cc: Ross Zwisler
    Cc: Sean Hefty
    Cc: Seung-Woo Kim
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     
  • commit d0f0931de936a0a468d7e59284d39581c16d3a73 upstream.

    When the pmd_devmap() checks were added by 5c7fb56e5e3f ("mm, dax:
    dax-pmd vs thp-pmd vs hugetlbfs-pmd") to add better support for DAX huge
    pages, they were all added to the end of if() statements after existing
    pmd_trans_huge() checks. So, things like:

    - if (pmd_trans_huge(*pmd))
    + if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))

    When further checks were added after pmd_trans_unstable() checks by
    commit 7267ec008b5c ("mm: postpone page table allocation until we have
    page to map") they were also added at the end of the conditional:

    + if (pmd_trans_unstable(fe->pmd) || pmd_devmap(*fe->pmd))

    This ordering is fine for pmd_trans_huge(), but doesn't work for
    pmd_trans_unstable(). This is because DAX huge pages trip the bad_pmd()
    check inside of pmd_none_or_trans_huge_or_clear_bad() (called by
    pmd_trans_unstable()), which prints out a warning and returns 1. So, we
    do end up doing the right thing, but only after spamming dmesg with
    suspicious looking messages:

    mm/pgtable-generic.c:39: bad pmd ffff8808daa49b88(84000001006000a5)

    Reorder these checks in a helper so that pmd_devmap() is checked first,
    avoiding the error messages, and add a comment explaining why the
    ordering is important.

    Fixes: commit 7267ec008b5c ("mm: postpone page table allocation until we have page to map")
    Link: http://lkml.kernel.org/r/20170522215749.23516-1-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: Pawel Lebioda
    Cc: "Darrick J. Wong"
    Cc: Alexander Viro
    Cc: Christoph Hellwig
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: Matthew Wilcox
    Cc: "Kirill A . Shutemov"
    Cc: Dave Jiang
    Cc: Xiong Zhou
    Cc: Eryu Guan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Ross Zwisler
     

25 Feb, 2018

5 commits

  • commit f1f5929cd9715c1cdfe07a890f12ac7d2c5304ec upstream.

    Compiling shmem.c with SHMEM and TRANSAPRENT_HUGE_PAGECACHE enabled
    raises warnings on two unused functions when CONFIG_TMPFS and
    CONFIG_SYSFS are both disabled:

    mm/shmem.c:390:20: warning: `shmem_format_huge' defined but not used [-Wunused-function]
    static const char *shmem_format_huge(int huge)
    ^~~~~~~~~~~~~~~~~
    mm/shmem.c:373:12: warning: `shmem_parse_huge' defined but not used [-Wunused-function]
    static int shmem_parse_huge(const char *str)
    ^~~~~~~~~~~~~~~~

    A conditional compilation on tmpfs or sysfs removes the warnings.

    Link: http://lkml.kernel.org/r/20161118055749.11313-1-jeremy.lefaure@lse.epita.fr
    Signed-off-by: Jérémy Lefaure
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jérémy Lefaure
     
  • commit 23f919d4ad0eb325595f10f55be4301b2965d6d6 upstream.

    After enabling -Wmaybe-uninitialized warnings, we get a false-postive
    warning for shmem:

    mm/shmem.c: In function `shmem_getpage_gfp':
    include/linux/spinlock.h:332:21: error: `info' may be used uninitialized in this function [-Werror=maybe-uninitialized]

    This can be easily avoided, since the correct 'info' pointer is known at
    the time we first enter the function, so we can simply move the
    initialization up. Moving it before the first label avoids the warning
    and lets us remove two later initializations.

    Note that the function is so hard to read that it not only confuses the
    compiler, but also most readers and without this patch it could\ easily
    break if one of the 'goto's changed.

    Link: https://www.spinics.net/lists/kernel/msg2368133.html
    Link: http://lkml.kernel.org/r/20161024205725.786455-1-arnd@arndb.de
    Signed-off-by: Arnd Bergmann
    Acked-by: Michal Hocko
    Acked-by: Kirill A. Shutemov
    Acked-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Andreas Gruenbacher
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Arnd Bergmann
     
  • [ Upstream commit 7f6f60a1ba52538c16f26930bfbcfe193d9d746a ]

    earlyprintk=efi,keep does not work any more with a warning
    in mm/early_ioremap.c: WARN_ON(system_state != SYSTEM_BOOTING):
    Boot just hangs because of the earlyprintk within the earlyprintk
    implementation code itself.

    This is caused by a new introduced middle state in:

    69a78ff226fe ("init: Introduce SYSTEM_SCHEDULING state")

    early_ioremap() is fine in both SYSTEM_BOOTING and SYSTEM_SCHEDULING
    states, original condition should be updated accordingly.

    Signed-off-by: Dave Young
    Acked-by: Thomas Gleixner
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: bp@suse.de
    Cc: linux-efi@vger.kernel.org
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20171209041610.GA3249@dhcp-128-65.nay.redhat.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Dave Young
     
  • commit f35157417215ec138c920320c746fdb3e04ef1d5 upstream.

    Provide a function, kmemdup_nul(), that will create a NUL-terminated string
    from an unterminated character array where the length is known in advance.

    This is better than kstrndup() in situations where we already know the
    string length as the strnlen() in kstrndup() is superfluous.

    Signed-off-by: David Howells
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    David Howells
     
  • commit bb422a738f6566f7439cd347d54e321e4fe92a9f upstream.

    Syzbot caught an oops at unregister_shrinker() because combination of
    commit 1d3d4437eae1bb29 ("vmscan: per-node deferred work") and fault
    injection made register_shrinker() fail and the caller of
    register_shrinker() did not check for failure.

    ----------
    [ 554.881422] FAULT_INJECTION: forcing a failure.
    [ 554.881422] name failslab, interval 1, probability 0, space 0, times 0
    [ 554.881438] CPU: 1 PID: 13231 Comm: syz-executor1 Not tainted 4.14.0-rc8+ #82
    [ 554.881443] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    [ 554.881445] Call Trace:
    [ 554.881459] dump_stack+0x194/0x257
    [ 554.881474] ? arch_local_irq_restore+0x53/0x53
    [ 554.881486] ? find_held_lock+0x35/0x1d0
    [ 554.881507] should_fail+0x8c0/0xa40
    [ 554.881522] ? fault_create_debugfs_attr+0x1f0/0x1f0
    [ 554.881537] ? check_noncircular+0x20/0x20
    [ 554.881546] ? find_next_zero_bit+0x2c/0x40
    [ 554.881560] ? ida_get_new_above+0x421/0x9d0
    [ 554.881577] ? find_held_lock+0x35/0x1d0
    [ 554.881594] ? __lock_is_held+0xb6/0x140
    [ 554.881628] ? check_same_owner+0x320/0x320
    [ 554.881634] ? lock_downgrade+0x990/0x990
    [ 554.881649] ? find_held_lock+0x35/0x1d0
    [ 554.881672] should_failslab+0xec/0x120
    [ 554.881684] __kmalloc+0x63/0x760
    [ 554.881692] ? lock_downgrade+0x990/0x990
    [ 554.881712] ? register_shrinker+0x10e/0x2d0
    [ 554.881721] ? trace_event_raw_event_module_request+0x320/0x320
    [ 554.881737] register_shrinker+0x10e/0x2d0
    [ 554.881747] ? prepare_kswapd_sleep+0x1f0/0x1f0
    [ 554.881755] ? _down_write_nest_lock+0x120/0x120
    [ 554.881765] ? memcpy+0x45/0x50
    [ 554.881785] sget_userns+0xbcd/0xe20
    (...snipped...)
    [ 554.898693] kasan: CONFIG_KASAN_INLINE enabled
    [ 554.898724] kasan: GPF could be caused by NULL-ptr deref or user memory access
    [ 554.898732] general protection fault: 0000 [#1] SMP KASAN
    [ 554.898737] Dumping ftrace buffer:
    [ 554.898741] (ftrace buffer empty)
    [ 554.898743] Modules linked in:
    [ 554.898752] CPU: 1 PID: 13231 Comm: syz-executor1 Not tainted 4.14.0-rc8+ #82
    [ 554.898755] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    [ 554.898760] task: ffff8801d1dbe5c0 task.stack: ffff8801c9e38000
    [ 554.898772] RIP: 0010:__list_del_entry_valid+0x7e/0x150
    [ 554.898775] RSP: 0018:ffff8801c9e3f108 EFLAGS: 00010246
    [ 554.898780] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
    [ 554.898784] RDX: 0000000000000000 RSI: ffff8801c53c6f98 RDI: ffff8801c53c6fa0
    [ 554.898788] RBP: ffff8801c9e3f120 R08: 1ffff100393c7d55 R09: 0000000000000004
    [ 554.898791] R10: ffff8801c9e3ef70 R11: 0000000000000000 R12: 0000000000000000
    [ 554.898795] R13: dffffc0000000000 R14: 1ffff100393c7e45 R15: ffff8801c53c6f98
    [ 554.898800] FS: 0000000000000000(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000
    [ 554.898804] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
    [ 554.898807] CR2: 00000000dbc23000 CR3: 00000001c7269000 CR4: 00000000001406e0
    [ 554.898813] DR0: 0000000020000000 DR1: 0000000020000000 DR2: 0000000000000000
    [ 554.898816] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
    [ 554.898818] Call Trace:
    [ 554.898828] unregister_shrinker+0x79/0x300
    [ 554.898837] ? perf_trace_mm_vmscan_writepage+0x750/0x750
    [ 554.898844] ? down_write+0x87/0x120
    [ 554.898851] ? deactivate_super+0x139/0x1b0
    [ 554.898857] ? down_read+0x150/0x150
    [ 554.898864] ? check_same_owner+0x320/0x320
    [ 554.898875] deactivate_locked_super+0x64/0xd0
    [ 554.898883] deactivate_super+0x141/0x1b0
    ----------

    Since allowing register_shrinker() callers to call unregister_shrinker()
    when register_shrinker() failed can simplify error recovery path, this
    patch makes unregister_shrinker() no-op when register_shrinker() failed.
    Also, reset shrinker->nr_deferred in case unregister_shrinker() was
    by error called twice.

    Signed-off-by: Tetsuo Handa
    Signed-off-by: Aliaksei Karaliou
    Reported-by: syzbot
    Cc: Glauber Costa
    Cc: Al Viro
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Tetsuo Handa
     

22 Feb, 2018

1 commit

  • commit af27d9403f5b80685b79c88425086edccecaf711 upstream.

    We get a warning about some slow configurations in randconfig kernels:

    mm/memory.c:83:2: error: #warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid. [-Werror=cpp]

    The warning is reasonable by itself, but gets in the way of randconfig
    build testing, so I'm hiding it whenever CONFIG_COMPILE_TEST is set.

    The warning was added in 2013 in commit 75980e97dacc ("mm: fold
    page->_last_nid into page->flags where possible").

    Cc: stable@vger.kernel.org
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Arnd Bergmann
     

04 Feb, 2018

1 commit

  • [ Upstream commit bde5f6bc68db51128f875a756e9082a6c6ff7b4c ]

    kmemleak_scan() will scan struct page for each node and it can be really
    large and resulting in a soft lockup. We have seen a soft lockup when
    do scan while compile kernel:

    watchdog: BUG: soft lockup - CPU#53 stuck for 22s! [bash:10287]
    [...]
    Call Trace:
    kmemleak_scan+0x21a/0x4c0
    kmemleak_write+0x312/0x350
    full_proxy_write+0x5a/0xa0
    __vfs_write+0x33/0x150
    vfs_write+0xad/0x1a0
    SyS_write+0x52/0xc0
    do_syscall_64+0x61/0x1a0
    entry_SYSCALL64_slow_path+0x25/0x25

    Fix this by adding cond_resched every MAX_SCAN_SIZE.

    Link: http://lkml.kernel.org/r/1511439788-20099-1-git-send-email-xieyisheng1@huawei.com
    Signed-off-by: Yisheng Xie
    Suggested-by: Catalin Marinas
    Acked-by: Catalin Marinas
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Yisheng Xie
     

31 Jan, 2018

5 commits

  • commit c73322d098e4b6f5f0f0fa1330bf57e218775539 upstream.

    Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
    cleanups".

    Jia reported a scenario in which the kswapd of a node indefinitely spins
    at 100% CPU usage. We have seen similar cases at Facebook.

    The kernel's current method of judging its ability to reclaim a node (or
    whether to back off and sleep) is based on the amount of scanned pages
    in proportion to the amount of reclaimable pages. In Jia's and our
    scenarios, there are no reclaimable pages in the node, however, and the
    condition for backing off is never met. Kswapd busyloops in an attempt
    to restore the watermarks while having nothing to work with.

    This series reworks the definition of an unreclaimable node based not on
    scanning but on whether kswapd is able to actually reclaim pages in
    MAX_RECLAIM_RETRIES (16) consecutive runs. This is the same criteria
    the page allocator uses for giving up on direct reclaim and invoking the
    OOM killer. If it cannot free any pages, kswapd will go to sleep and
    leave further attempts to direct reclaim invocations, which will either
    make progress and re-enable kswapd, or invoke the OOM killer.

    Patch #1 fixes the immediate problem Jia reported, the remainder are
    smaller fixlets, cleanups, and overall phasing out of the old method.

    Patch #6 is the odd one out. It's a nice cleanup to get_scan_count(),
    and directly related to #5, but in itself not relevant to the series.

    If the whole series is too ambitious for 4.11, I would consider the
    first three patches fixes, the rest cleanups.

    This patch (of 9):

    Jia He reports a problem with kswapd spinning at 100% CPU when
    requesting more hugepages than memory available in the system:

    $ echo 4000 >/proc/sys/vm/nr_hugepages

    top - 13:42:59 up 3:37, 1 user, load average: 1.09, 1.03, 1.01
    Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie
    %Cpu(s): 0.0 us, 12.5 sy, 0.0 ni, 85.5 id, 2.0 wa, 0.0 hi, 0.0 si, 0.0 st
    KiB Mem: 31371520 total, 30915136 used, 456384 free, 320 buffers
    KiB Swap: 6284224 total, 115712 used, 6168512 free. 48192 cached Mem

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    76 root 20 0 0 0 0 R 100.0 0.000 217:17.29 kswapd3

    At that time, there are no reclaimable pages left in the node, but as
    kswapd fails to restore the high watermarks it refuses to go to sleep.

    Kswapd needs to back away from nodes that fail to balance. Up until
    commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of
    nodes") kswapd had such a mechanism. It considered zones whose
    theoretically reclaimable pages it had reclaimed six times over as
    unreclaimable and backed away from them. This guard was erroneously
    removed as the patch changed the definition of a balanced node.

    However, simply restoring this code wouldn't help in the case reported
    here: there *are* no reclaimable pages that could be scanned until the
    threshold is met. Kswapd would stay awake anyway.

    Introduce a new and much simpler way of backing off. If kswapd runs
    through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
    page, make it back off from the node. This is the same number of shots
    direct reclaim takes before declaring OOM. Kswapd will go to sleep on
    that node until a direct reclaimer manages to reclaim some pages, thus
    proving the node reclaimable again.

    [hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
    Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
    [shakeelb@google.com: fix condition for throttle_direct_reclaim]
    Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
    Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Signed-off-by: Shakeel Butt
    Reported-by: Jia He
    Tested-by: Jia He
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Cc: Dmitry Shmidt
    Signed-off-by: Greg Kroah-Hartman

    Johannes Weiner
     
  • commit b050e3769c6b4013bb937e879fc43bf1847ee819 upstream.

    Since commit 97a16fc82a7c ("mm, page_alloc: only enforce watermarks for
    order-0 allocations"), __zone_watermark_ok() check for high-order
    allocations will shortcut per-migratetype free list checks for
    ALLOC_HARDER allocations, and return true as long as there's free page
    of any migratetype. The intention is that ALLOC_HARDER can allocate
    from MIGRATE_HIGHATOMIC free lists, while normal allocations can't.

    However, as a side effect, the watermark check will then also return
    true when there are pages only on the MIGRATE_ISOLATE list, or (prior to
    CMA conversion to ZONE_MOVABLE) on the MIGRATE_CMA list. Since the
    allocation cannot actually obtain isolated pages, and might not be able
    to obtain CMA pages, this can result in a false positive.

    The condition should be rare and perhaps the outcome is not a fatal one.
    Still, it's better if the watermark check is correct. There also
    shouldn't be a performance tradeoff here.

    Link: http://lkml.kernel.org/r/20171102125001.23708-1-vbabka@suse.cz
    Fixes: 97a16fc82a7c ("mm, page_alloc: only enforce watermarks for order-0 allocations")
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Rik van Riel
    Cc: David Rientjes
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     
  • commit e048cb32f69038aa1c8f11e5c1b331be4181659d upstream.

    The align_offset parameter is used by bitmap_find_next_zero_area_off()
    to represent the offset of map's base from the previous alignment
    boundary; the function ensures that the returned index, plus the
    align_offset, honors the specified align_mask.

    The logic introduced by commit b5be83e308f7 ("mm: cma: align to physical
    address, not CMA region position") has the cma driver calculate the
    offset to the *next* alignment boundary. In most cases, the base
    alignment is greater than that specified when making allocations,
    resulting in a zero offset whether we align up or down. In the example
    given with the commit, the base alignment (8MB) was half the requested
    alignment (16MB) so the math also happened to work since the offset is
    8MB in both directions. However, when requesting allocations with an
    alignment greater than twice that of the base, the returned index would
    not be correctly aligned.

    Also, the align_order arguments of cma_bitmap_aligned_mask() and
    cma_bitmap_aligned_offset() should not be negative so the argument type
    was made unsigned.

    Fixes: b5be83e308f7 ("mm: cma: align to physical address, not CMA region position")
    Link: http://lkml.kernel.org/r/20170628170742.2895-1-opendmb@gmail.com
    Signed-off-by: Angus Clark
    Signed-off-by: Doug Berger
    Acked-by: Gregory Fong
    Cc: Doug Berger
    Cc: Angus Clark
    Cc: Laura Abbott
    Cc: Vlastimil Babka
    Cc: Greg Kroah-Hartman
    Cc: Lucas Stach
    Cc: Catalin Marinas
    Cc: Shiraz Hashim
    Cc: Jaewon Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Greg Kroah-Hartman

    Doug Berger
     
  • commit 18365225f0440d09708ad9daade2ec11275c3df9 upstream.

    Laurent Dufour has noticed that hwpoinsoned pages are kept charged. In
    his particular case he has hit a bad_page("page still charged to
    cgroup") when onlining a hwpoison page. While this looks like something
    that shouldn't happen in the first place because onlining hwpages and
    returning them to the page allocator makes only little sense it shows a
    real problem.

    hwpoison pages do not get freed usually so we do not uncharge them (at
    least not since commit 0a31bc97c80c ("mm: memcontrol: rewrite uncharge
    API")). Each charge pins memcg (since e8ea14cc6ead ("mm: memcontrol:
    take a css reference for each charged page")) as well and so the
    mem_cgroup and the associated state will never go away. Fix this leak
    by forcibly uncharging a LRU hwpoisoned page in delete_from_lru_cache().
    We also have to tweak uncharge_list because it cannot rely on zero ref
    count for these pages.

    [akpm@linux-foundation.org: coding-style fixes]
    Fixes: 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API")
    Link: http://lkml.kernel.org/r/20170502185507.GB19165@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: Laurent Dufour
    Tested-by: Laurent Dufour
    Reviewed-by: Balbir Singh
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     
  • commit 561b5e0709e4a248c67d024d4d94b6e31e3edf2f upstream.

    Commit 1be7107fbe18 ("mm: larger stack guard gap, between vmas") has
    introduced a regression in some rust and Java environments which are
    trying to implement their own stack guard page. They are punching a new
    MAP_FIXED mapping inside the existing stack Vma.

    This will confuse expand_{downwards,upwards} into thinking that the
    stack expansion would in fact get us too close to an existing non-stack
    vma which is a correct behavior wrt safety. It is a real regression on
    the other hand.

    Let's work around the problem by considering PROT_NONE mapping as a part
    of the stack. This is a gros hack but overflowing to such a mapping
    would trap anyway an we only can hope that usespace knows what it is
    doing and handle it propely.

    Fixes: 1be7107fbe18 ("mm: larger stack guard gap, between vmas")
    Link: http://lkml.kernel.org/r/20170705182849.GA18027@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Debugged-by: Vlastimil Babka
    Cc: Ben Hutchings
    Cc: Willy Tarreau
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

17 Jan, 2018

1 commit

  • commit fd5bb66cd934987e49557455b6497fc006521940 upstream.

    Change the zpool/compressor param callback function to release the
    zswap_pools_lock spinlock before calling param_set_charp, since that
    function may sleep when it calls kmalloc with GFP_KERNEL.

    While this problem has existed for a while, I wasn't able to trigger it
    using a tight loop changing either/both the zpool and compressor params; I
    think it's very unlikely to be an issue on the stable kernels, especially
    since most zswap users will change the compressor and/or zpool from sysfs
    only one time each boot - or zero times, if they add the params to the
    kernel boot.

    Fixes: c99b42c3529e ("zswap: use charp for zswap param strings")
    Link: http://lkml.kernel.org/r/20170126155821.4545-1-ddstreet@ieee.org
    Signed-off-by: Dan Streetman
    Reported-by: Sergey Senozhatsky
    Cc: Michal Hocko
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Greg Kroah-Hartman

    Dan Streetman
     

05 Jan, 2018

1 commit

  • The kaiser update made an interesting choice, never to free any shadow
    page tables. Contention on global spinlock was worrying, particularly
    with it held across page table scans when freeing. Something had to be
    done: I was going to add refcounting; but simply never to free them is
    an appealing choice, minimizing contention without complicating the code
    (the more a page table is found already, the less the spinlock is used).

    But leaking pages in this way is also a worry: can we get away with it?
    At the very least, we need a count to show how bad it actually gets:
    in principle, one might end up wasting about 1/256 of memory that way
    (1/512 for when direct-mapped pages have to be user-mapped, plus 1/512
    for when they are user-mapped from the vmalloc area on another occasion
    (but we don't have vmalloc'ed stacks, so only large ldts are vmalloc'ed).

    Add per-cpu stat NR_KAISERTABLE: including 256 at startup for the
    shared pgd entries, and 1 for each intermediate page table added
    thereafter for user-mapping - but leave out the 1 per mm, for its
    shadow pgd, because that distracts from the monotonic increase.
    Shown in /proc/vmstat as nr_overhead (0 if kaiser not enabled).

    In practice, it doesn't look so bad so far: more like 1/12000 after
    nine hours of gtests below; and movable pageblock segregation should
    tend to cluster the kaiser tables into a subset of the address space
    (if not, they will be bad for compaction too). But production may
    tell a different story: keep an eye on this number, and bring back
    lighter freeing if it gets out of control (maybe a shrinker).

    ["nr_overhead" should of course say "nr_kaisertable", if it needs
    to stay; but for the moment we are being coy, preferring that when
    Joe Blow notices a new line in his /proc/vmstat, he does not get
    too curious about what this "kaiser" stuff might be.]

    Signed-off-by: Hugh Dickins
    Signed-off-by: Greg Kroah-Hartman

    Hugh Dickins
     

14 Dec, 2017

3 commits

  • [ Upstream commit 1aedcafbf32b3f232c159b14cd0d423fcfe2b861 ]

    Use BUG_ON(in_interrupt()) in zs_map_object(). This is not a new
    BUG_ON(), it's always been there, but was recently changed to
    VM_BUG_ON(). There are several problems there. First, we use use
    per-CPU mappings both in zsmalloc and in zram, and interrupt may easily
    corrupt those buffers. Second, and more importantly, we believe it's
    possible to start leaking sensitive information. Consider the following
    case:

    -> process P
    swap out
    zram
    per-cpu mapping CPU1
    compress page A
    -> IRQ

    swap out
    zram
    per-cpu mapping CPU1
    compress page B
    write page from per-cpu mapping CPU1 to zsmalloc pool
    iret

    -> process P
    write page from per-cpu mapping CPU1 to zsmalloc pool [*]
    return

    * so we store overwritten data that actually belongs to another
    page (task) and potentially contains sensitive data. And when
    process P will page fault it's going to read (swap in) that
    other task's data.

    Link: http://lkml.kernel.org/r/20170929045140.4055-1-sergey.senozhatsky@gmail.com
    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Sergey Senozhatsky
     
  • commit ced108037c2aa542b3ed8b7afd1576064ad1362a upstream.

    In case prot_numa, we are under down_read(mmap_sem). It's critical to
    not clear pmd intermittently to avoid race with MADV_DONTNEED which is
    also under down_read(mmap_sem):

    CPU0: CPU1:
    change_huge_pmd(prot_numa=1)
    pmdp_huge_get_and_clear_notify()
    madvise_dontneed()
    zap_pmd_range()
    pmd_trans_huge(*pmd) == 0 (without ptl)
    // skip the pmd
    set_pmd_at();
    // pmd is re-established

    The race makes MADV_DONTNEED miss the huge pmd and don't clear it
    which may break userspace.

    Found by code analysis, never saw triggered.

    Link: http://lkml.kernel.org/r/20170302151034.27829-3-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Hillf Danton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    [jwang: adjust context for 4.9 ]
    Signed-off-by: Jack Wang
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     
  • commit 0a85e51d37645e9ce57e5e1a30859e07810ed07c upstream.

    Patch series "thp: fix few MADV_DONTNEED races"

    For MADV_DONTNEED to work properly with huge pages, it's critical to not
    clear pmd intermittently unless you hold down_write(mmap_sem).

    Otherwise MADV_DONTNEED can miss the THP which can lead to userspace
    breakage.

    See example of such race in commit message of patch 2/4.

    All these races are found by code inspection. I haven't seen them
    triggered. I don't think it's worth to apply them to stable@.

    This patch (of 4):

    Restructure code in preparation for a fix.

    Link: http://lkml.kernel.org/r/20170302151034.27829-2-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hillf Danton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    [jwang: adjust context for 4.9]
    Signed-off-by: Jack Wang
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     

10 Dec, 2017

2 commits

  • [ Upstream commit 2df26639e708a88dcc22171949da638a9998f3bc ]

    Jia He has noticed that commit b9f00e147f27 ("mm, page_alloc: reduce
    branches in zone_statistics") has an unintentional side effect that
    remote node allocation requests are accounted as NUMA_MISS rathat than
    NUMA_HIT and NUMA_OTHER if such a request doesn't use __GFP_OTHER_NODE.

    There are many of these potentially because the flag is used very rarely
    while we have many users of __alloc_pages_node.

    Fix this by simply ignoring __GFP_OTHER_NODE (it can be removed in a
    follow up patch) and treat all allocations that were satisfied from the
    preferred zone's node as NUMA_HITS because this is the same node we
    requested the allocation from in most cases. If this is not the local
    node then we just account it as NUMA_OTHER rather than NUMA_LOCAL.

    One downsize would be that an allocation request for a node which is
    outside of the mempolicy nodemask would be reported as a hit which is a
    bit weird but that was the case before b9f00e147f27 already.

    Fixes: b9f00e147f27 ("mm, page_alloc: reduce branches in zone_statistics")
    Link: http://lkml.kernel.org/r/20170102153057.9451-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Jia He
    Reviewed-by: Vlastimil Babka # with cbmc[1] superpowers
    Acked-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Taku Izumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     
  • commit 687cb0884a714ff484d038e9190edc874edcf146 upstream.

    tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory
    space. In this case, tlb->fullmm is true. Some archs like arm64
    doesn't flush TLB when tlb->fullmm is true:

    commit 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1").

    Which causes leaking of tlb entries.

    Will clarifies his patch:
    "Basically, we tag each address space with an ASID (PCID on x86) which
    is resident in the TLB. This means we can elide TLB invalidation when
    pulling down a full mm because we won't ever assign that ASID to
    another mm without doing TLB invalidation elsewhere (which actually
    just nukes the whole TLB).

    I think that means that we could potentially not fault on a kernel
    uaccess, because we could hit in the TLB"

    There could be a window between complete_signal() sending IPI to other
    cores and all threads sharing this mm are really kicked off from cores.
    In this window, the oom reaper may calls tlb_flush_mmu_tlbonly() to
    flush TLB then frees pages. However, due to the above problem, the TLB
    entries are not really flushed on arm64. Other threads are possible to
    access these pages through TLB entries. Moreover, a copy_to_user() can
    also write to these pages without generating page fault, causes
    use-after-free bugs.

    This patch gathers each vma instead of gathering full vm space. In this
    case tlb->fullmm is not true. The behavior of oom reaper become similar
    to munmapping before do_exit, which should be safe for all archs.

    Link: http://lkml.kernel.org/r/20171107095453.179940-1-wangnan0@huawei.com
    Fixes: aac453635549 ("mm, oom: introduce oom reaper")
    Signed-off-by: Wang Nan
    Acked-by: Michal Hocko
    Acked-by: David Rientjes
    Cc: Minchan Kim
    Cc: Will Deacon
    Cc: Bob Liu
    Cc: Ingo Molnar
    Cc: Roman Gushchin
    Cc: Konstantin Khlebnikov
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    [backported to 4.9 stable tree]
    Signed-off-by: Michal Hocko
    Signed-off-by: Greg Kroah-Hartman

    Wang Nan
     

05 Dec, 2017

4 commits

  • commit 6ea8d958a2c95a1d514015d4e29ba21a8c0a1a91 upstream.

    MADVISE_WILLNEED has always been a noop for DAX (formerly XIP) mappings.
    Unfortunately madvise_willneed() doesn't communicate this information
    properly to the generic madvise syscall implementation. The calling
    convention is quite subtle there. madvise_vma() is supposed to either
    return an error or update &prev otherwise the main loop will never
    advance to the next vma and it will keep looping for ever without a way
    to get out of the kernel.

    It seems this has been broken since introduction. Nobody has noticed
    because nobody seems to be using MADVISE_WILLNEED on these DAX mappings.

    [mhocko@suse.com: rewrite changelog]
    Link: http://lkml.kernel.org/r/20171127115318.911-1-guoxuenan@huawei.com
    Fixes: fe77ba6f4f97 ("[PATCH] xip: madvice/fadvice: execute in place")
    Signed-off-by: chenjie
    Signed-off-by: guoxuenan
    Acked-by: Michal Hocko
    Cc: Minchan Kim
    Cc: zhangyi (F)
    Cc: Miao Xie
    Cc: Mike Rapoport
    Cc: Shaohua Li
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Kirill A. Shutemov
    Cc: David Rientjes
    Cc: Anshuman Khandual
    Cc: Rik van Riel
    Cc: Carsten Otte
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    chenjie
     
  • commit 31383c6865a578834dd953d9dbc88e6b19fe3997 upstream.

    Patch series "device-dax: fix unaligned munmap handling"

    When device-dax is operating in huge-page mode we want it to behave like
    hugetlbfs and fail attempts to split vmas into unaligned ranges. It
    would be messy to teach the munmap path about device-dax alignment
    constraints in the same (hstate) way that hugetlbfs communicates this
    constraint. Instead, these patches introduce a new ->split() vm
    operation.

    This patch (of 2):

    The device-dax interface has similar constraints as hugetlbfs in that it
    requires the munmap path to unmap in huge page aligned units. Rather
    than add more custom vma handling code in __split_vma() introduce a new
    vm operation to perform this vma specific check.

    Link: http://lkml.kernel.org/r/151130418135.4029.6783191281930729710.stgit@dwillia2-desk3.amr.corp.intel.com
    Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap")
    Signed-off-by: Dan Williams
    Cc: Jeff Moyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     
  • commit 63cd448908b5eb51d84c52f02b31b9b4ccd1cb5a upstream.

    If the call __alloc_contig_migrate_range() in alloc_contig_range returns
    -EBUSY, processing continues so that test_pages_isolated() is called
    where there is a tracepoint to identify the busy pages. However, it is
    possible for busy pages to become available between the calls to these
    two routines. In this case, the range of pages may be allocated.
    Unfortunately, the original return code (ret == -EBUSY) is still set and
    returned to the caller. Therefore, the caller believes the pages were
    not allocated and they are leaked.

    Update the comment to indicate that allocation is still possible even if
    __alloc_contig_migrate_range returns -EBUSY. Also, clear return code in
    this case so that it is not accidentally used or returned to caller.

    Link: http://lkml.kernel.org/r/20171122185214.25285-1-mike.kravetz@oracle.com
    Fixes: 8ef5849fa8a2 ("mm/cma: always check which page caused allocation failure")
    Signed-off-by: Mike Kravetz
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Acked-by: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Laura Abbott
    Cc: Michal Hocko
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mike Kravetz
     
  • commit a8f97366452ed491d13cf1e44241bc0b5740b1f0 upstream.

    Currently, we unconditionally make page table dirty in touch_pmd().
    It may result in false-positive can_follow_write_pmd().

    We may avoid the situation, if we would only make the page table entry
    dirty if caller asks for write access -- FOLL_WRITE.

    The patch also changes touch_pud() in the same way.

    Signed-off-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Signed-off-by: Linus Torvalds
    [Salvatore Bonaccorso: backport for 4.9:
    - Adjust context
    - Drop specific part for PUD-sized transparent hugepages. Support
    for PUD-sized transparent hugepages was added in v4.11-rc1
    ]
    Signed-off-by: Ben Hutchings
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     

24 Nov, 2017

2 commits

  • commit 373c4557d2aa362702c4c2d41288fb1e54990b7c upstream.

    This matters at least for the mincore syscall, which will otherwise copy
    uninitialized memory from the page allocator to userspace. It is
    probably also a correctness error for /proc/$pid/pagemap, but I haven't
    tested that.

    Removing the `walk->hugetlb_entry` condition in walk_hugetlb_range() has
    no effect because the caller already checks for that.

    This only reports holes in hugetlb ranges to callers who have specified
    a hugetlb_entry callback.

    This issue was found using an AFL-based fuzzer.

    v2:
    - don't crash on ->pte_hole==NULL (Andrew Morton)
    - add Cc stable (Andrew Morton)

    Changed for 4.4/4.9 stable backport:
    - fix up conflict in the huge_pte_offset() call

    Fixes: 1e25a271c8ac ("mincore: apply page table walker on do_mincore()")
    Signed-off-by: Jann Horn
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     
  • commit d135e5750205a21a212a19dbb05aeb339e2cbea7 upstream.

    In reset_deferred_meminit() we determine number of pages that must not
    be deferred. We initialize pages for at least 2G of memory, but also
    pages for reserved memory in this node.

    The reserved memory is determined in this function:
    memblock_reserved_memory_within(), which operates over physical
    addresses, and returns size in bytes. However, reset_deferred_meminit()
    assumes that that this function operates with pfns, and returns page
    count.

    The result is that in the best case machine boots slower than expected
    due to initializing more pages than needed in single thread, and in the
    worst case panics because fewer than needed pages are initialized early.

    Link: http://lkml.kernel.org/r/20171021011707.15191-1-pasha.tatashin@oracle.com
    Fixes: 864b9a393dcb ("mm: consider memblock reservations for deferred memory initialization sizing")
    Signed-off-by: Pavel Tatashin
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Pavel Tatashin
     

21 Oct, 2017

2 commits

  • [ Upstream commit c6e28895a4372992961888ffaadc9efc643b5bfe ]

    In case CONFIG_SLUB_DEBUG_ON=n, find_mergeable() gets debug features from
    commandline but never checks if there are features from the
    SLAB_NEVER_MERGE set.

    As a result selected by slub_debug caches are always mergeable if they
    have been created without a custom constructor set or without one of the
    SLAB_* debug features on.

    This moves the SLAB_NEVER_MERGE check below the flags update from
    commandline to make sure it won't merge the slab cache if one of the debug
    features is on.

    Link: http://lkml.kernel.org/r/20170101124451.GA4740@lp-laptop-d
    Signed-off-by: Grygorii Maistrenko
    Reviewed-by: Pekka Enberg
    Acked-by: David Rientjes
    Acked-by: Christoph Lameter
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Grygorii Maistrenko
     
  • [ Upstream commit ddffe98d166f4a93d996d5aa628fd745311fc1e7 ]

    To identify that pages of page table are allocated from bootmem
    allocator, magic number sets to page->lru.next.

    But page->lru list is initialized in reserve_bootmem_region(). So when
    calling free_pagetable(), the function cannot find the magic number of
    pages. And free_pagetable() frees the pages by free_reserved_page() not
    put_page_bootmem().

    But if the pages are allocated from bootmem allocator and used as page
    table, the pages have private flag. So before freeing the pages, we
    should clear the private flag by put_page_bootmem().

    Before applying the commit 7bfec6f47bb0 ("mm, page_alloc: check multiple
    page fields with a single branch"), we could find the following visible
    issue:

    BUG: Bad page state in process kworker/u1024:1
    page:ffffea103cfd8040 count:0 mapcount:0 mappi
    flags: 0x6fffff80000800(private)
    page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
    bad because of flags: 0x800(private)

    Call Trace:
    [...] dump_stack+0x63/0x87
    [...] bad_page+0x114/0x130
    [...] free_pages_prepare+0x299/0x2d0
    [...] free_hot_cold_page+0x31/0x150
    [...] __free_pages+0x25/0x30
    [...] free_pagetable+0x6f/0xb4
    [...] remove_pagetable+0x379/0x7ff
    [...] vmemmap_free+0x10/0x20
    [...] sparse_remove_one_section+0x149/0x180
    [...] __remove_pages+0x2e9/0x4f0
    [...] arch_remove_memory+0x63/0xc0
    [...] remove_memory+0x8c/0xc0
    [...] acpi_memory_device_remove+0x79/0xa5
    [...] acpi_bus_trim+0x5a/0x8d
    [...] acpi_bus_trim+0x38/0x8d
    [...] acpi_device_hotplug+0x1b7/0x418
    [...] acpi_hotplug_work_fn+0x1e/0x29
    [...] process_one_work+0x152/0x400
    [...] worker_thread+0x125/0x4b0
    [...] kthread+0xd8/0xf0
    [...] ret_from_fork+0x22/0x40

    And the issue still silently occurs.

    Until freeing the pages of page table allocated from bootmem allocator,
    the page->freelist is never used. So the patch sets magic number to
    page->freelist instead of page->lru.next.

    [isimatu.yasuaki@jp.fujitsu.com: fix merge issue]
    Link: http://lkml.kernel.org/r/722b1cc4-93ac-dd8b-2be2-7a7e313b3b0b@gmail.com
    Link: http://lkml.kernel.org/r/2c29bd9f-5b67-02d0-18a3-8828e78bbb6f@gmail.com
    Signed-off-by: Yasuaki Ishimatsu
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc: Dave Hansen
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Xishi Qiu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Yasuaki Ishimatsu
     

12 Oct, 2017

1 commit

  • commit 4d4bbd8526a8fbeb2c090ea360211fceff952383 upstream.

    Andrea has noticed that the oom_reaper doesn't invalidate the range via
    mmu notifiers (mmu_notifier_invalidate_range_start/end) and that can
    corrupt the memory of the kvm guest for example.

    tlb_flush_mmu_tlbonly already invokes mmu notifiers but that is not
    sufficient as per Andrea:

    "mmu_notifier_invalidate_range cannot be used in replacement of
    mmu_notifier_invalidate_range_start/end. For KVM
    mmu_notifier_invalidate_range is a noop and rightfully so. A MMU
    notifier implementation has to implement either ->invalidate_range
    method or the invalidate_range_start/end methods, not both. And if you
    implement invalidate_range_start/end like KVM is forced to do, calling
    mmu_notifier_invalidate_range in common code is a noop for KVM.

    For those MMU notifiers that can get away only implementing
    ->invalidate_range, the ->invalidate_range is implicitly called by
    mmu_notifier_invalidate_range_end(). And only those secondary MMUs
    that share the same pagetable with the primary MMU (like AMD iommuv2)
    can get away only implementing ->invalidate_range"

    As the callback is allowed to sleep and the implementation is out of
    hand of the MM it is safer to simply bail out if there is an mmu
    notifier registered. In order to not fail too early make the
    mm_has_notifiers check under the oom_lock and have a little nap before
    failing to give the current oom victim some more time to exit.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20170913113427.2291-1-mhocko@kernel.org
    Fixes: aac453635549 ("mm, oom: introduce oom reaper")
    Signed-off-by: Michal Hocko
    Reported-by: Andrea Arcangeli
    Reviewed-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

08 Oct, 2017

1 commit

  • [ Upstream commit bfc7228b9a9647e1c353e50b40297a2929801759 ]

    The system may panic when initialisation is done when almost all the
    memory is assigned to the huge pages using the kernel command line
    parameter hugepage=xxxx. Panic may occur like this:

    Unable to handle kernel paging request for data at address 0x00000000
    Faulting instruction address: 0xc000000000302b88
    Oops: Kernel access of bad area, sig: 11 [#1]
    SMP NR_CPUS=2048 [ 0.082424] NUMA
    pSeries
    Modules linked in:
    CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.9.0-15-generic #16-Ubuntu
    task: c00000021ed01600 task.stack: c00000010d108000
    NIP: c000000000302b88 LR: c000000000270e04 CTR: c00000000016cfd0
    REGS: c00000010d10b2c0 TRAP: 0300 Not tainted (4.9.0-15-generic)
    MSR: 8000000002009033 [ 0.082770] CR: 28424422 XER: 00000000
    CFAR: c0000000003d28b8 DAR: 0000000000000000 DSISR: 40000000 SOFTE: 1
    GPR00: c000000000270e04 c00000010d10b540 c00000000141a300 c00000010fff6300
    GPR04: 0000000000000000 00000000026012c0 c00000010d10b630 0000000487ab0000
    GPR08: 000000010ee90000 c000000001454fd8 0000000000000000 0000000000000000
    GPR12: 0000000000004400 c00000000fb80000 00000000026012c0 00000000026012c0
    GPR16: 00000000026012c0 0000000000000000 0000000000000000 0000000000000002
    GPR20: 000000000000000c 0000000000000000 0000000000000000 00000000024200c0
    GPR24: c0000000016eef48 0000000000000000 c00000010fff7d00 00000000026012c0
    GPR28: 0000000000000000 c00000010fff7d00 c00000010fff6300 c00000010d10b6d0
    NIP mem_cgroup_soft_limit_reclaim+0xf8/0x4f0
    LR do_try_to_free_pages+0x1b4/0x450
    Call Trace:
    do_try_to_free_pages+0x1b4/0x450
    try_to_free_pages+0xf8/0x270
    __alloc_pages_nodemask+0x7a8/0xff0
    new_slab+0x104/0x8e0
    ___slab_alloc+0x620/0x700
    __slab_alloc+0x34/0x60
    kmem_cache_alloc_node_trace+0xdc/0x310
    mem_cgroup_init+0x158/0x1c8
    do_one_initcall+0x68/0x1d0
    kernel_init_freeable+0x278/0x360
    kernel_init+0x24/0x170
    ret_from_kernel_thread+0x5c/0x74
    Instruction dump:
    eb81ffe0 eba1ffe8 ebc1fff0 ebe1fff8 4e800020 3d230001 e9499a42 3d220004
    3929acd8 794a1f24 7d295214 eac90100 2fa90000 419eff74 3b200000
    ---[ end trace 342f5208b00d01b6 ]---

    This is a chicken and egg issue where the kernel try to get free memory
    when allocating per node data in mem_cgroup_init(), but in that path
    mem_cgroup_soft_limit_reclaim() is called which assumes that these data
    are allocated.

    As mem_cgroup_soft_limit_reclaim() is best effort, it should return when
    these data are not yet allocated.

    This patch also fixes potential null pointer access in
    mem_cgroup_remove_from_trees() and mem_cgroup_update_tree().

    Link: http://lkml.kernel.org/r/1487856999-16581-2-git-send-email-ldufour@linux.vnet.ibm.com
    Signed-off-by: Laurent Dufour
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Acked-by: Balbir Singh
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Laurent Dufour
     

27 Sep, 2017

1 commit

  • commit 4855e4a7f29d6d10b0b9c84e189c770c9a94e91e upstream.

    There is race between page freeing and unreserved highatomic.

    CPU 0 CPU 1

    free_hot_cold_page
    mt = get_pfnblock_migratetype
    set_pcppage_migratetype(page, mt)
    unreserve_highatomic_pageblock
    spin_lock_irqsave(&zone->lock)
    move_freepages_block
    set_pageblock_migratetype(page)
    spin_unlock_irqrestore(&zone->lock)
    free_pcppages_bulk
    __free_one_page(mt)
    Acked-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Sangseok Lee
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Cc: Miles Chen
    Signed-off-by: Greg Kroah-Hartman

    Minchan Kim
     

14 Sep, 2017

1 commit

  • commit de0c799bba2610a8e1e9a50d76a28614520a4cd4 upstream.

    Seen while reading the code, in handle_mm_fault(), in the case
    arch_vma_access_permitted() is failing the call to
    mem_cgroup_oom_disable() is not made.

    To fix that, move the call to mem_cgroup_oom_enable() after calling
    arch_vma_access_permitted() as it should not have entered the memcg OOM.

    Link: http://lkml.kernel.org/r/1504625439-31313-1-git-send-email-ldufour@linux.vnet.ibm.com
    Fixes: bae473a423f6 ("mm: introduce fault_env")
    Signed-off-by: Laurent Dufour
    Acked-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Laurent Dufour
     

07 Sep, 2017

1 commit

  • commit c461ad6a63b37ba74632e90c063d14823c884247 upstream.

    Wendy Wang reported off-list that a RAS HWPOISON-SOFT test case failed
    and bisected it to the commit 479f854a207c ("mm, page_alloc: defer
    debugging checks of pages allocated from the PCP").

    The problem is that a page that was poisoned with madvise() is reused.
    The commit removed a check that would trigger if DEBUG_VM was enabled
    but re-enabling the check only fixes the problem as a side-effect by
    printing a bad_page warning and recovering.

    The root of the problem is that an madvise() can leave a poisoned page
    on the per-cpu list. This patch drains all per-cpu lists after pages
    are poisoned so that they will not be reused. Wendy reports that the
    test case in question passes with this patch applied. While this could
    be done in a targeted fashion, it is over-complicated for such a rare
    operation.

    Link: http://lkml.kernel.org/r/20170828133414.7qro57jbepdcyz5x@techsingularity.net
    Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
    Signed-off-by: Mel Gorman
    Reported-by: Wang, Wendy
    Tested-by: Wang, Wendy
    Acked-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: "Hansen, Dave"
    Cc: "Luck, Tony"
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     

30 Aug, 2017

3 commits

  • commit 91b540f98872a206ea1c49e4aa6ea8eed0886644 upstream.

    In recently introduced memblock_discard() there is a reversed logic bug.
    Memory is freed of static array instead of dynamically allocated one.

    Link: http://lkml.kernel.org/r/1503511441-95478-2-git-send-email-pasha.tatashin@oracle.com
    Fixes: 3010f876500f ("mm: discard memblock data later")
    Signed-off-by: Pavel Tatashin
    Reported-by: Woody Suwalski
    Tested-by: Woody Suwalski
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Pavel Tatashin
     
  • commit 263630e8d176d87308481ebdcd78ef9426739c6b upstream.

    If madvise(..., MADV_FREE) split a transparent hugepage, it called
    put_page() before unlock_page().

    This was wrong because put_page() can free the page, e.g. if a
    concurrent madvise(..., MADV_DONTNEED) has removed it from the memory
    mapping. put_page() then rightfully complained about freeing a locked
    page.

    Fix this by moving the unlock_page() before put_page().

    This bug was found by syzkaller, which encountered the following splat:

    BUG: Bad page state in process syzkaller412798 pfn:1bd800
    page:ffffea0006f60000 count:0 mapcount:0 mapping: (null) index:0x20a00
    flags: 0x200000000040019(locked|uptodate|dirty|swapbacked)
    raw: 0200000000040019 0000000000000000 0000000000020a00 00000000ffffffff
    raw: ffffea0006f60020 ffffea0006f60020 0000000000000000 0000000000000000
    page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
    bad because of flags: 0x1(locked)
    Modules linked in:
    CPU: 1 PID: 3037 Comm: syzkaller412798 Not tainted 4.13.0-rc5+ #35
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:16 [inline]
    dump_stack+0x194/0x257 lib/dump_stack.c:52
    bad_page+0x230/0x2b0 mm/page_alloc.c:565
    free_pages_check_bad+0x1f0/0x2e0 mm/page_alloc.c:943
    free_pages_check mm/page_alloc.c:952 [inline]
    free_pages_prepare mm/page_alloc.c:1043 [inline]
    free_pcp_prepare mm/page_alloc.c:1068 [inline]
    free_hot_cold_page+0x8cf/0x12b0 mm/page_alloc.c:2584
    __put_single_page mm/swap.c:79 [inline]
    __put_page+0xfb/0x160 mm/swap.c:113
    put_page include/linux/mm.h:814 [inline]
    madvise_free_pte_range+0x137a/0x1ec0 mm/madvise.c:371
    walk_pmd_range mm/pagewalk.c:50 [inline]
    walk_pud_range mm/pagewalk.c:108 [inline]
    walk_p4d_range mm/pagewalk.c:134 [inline]
    walk_pgd_range mm/pagewalk.c:160 [inline]
    __walk_page_range+0xc3a/0x1450 mm/pagewalk.c:249
    walk_page_range+0x200/0x470 mm/pagewalk.c:326
    madvise_free_page_range.isra.9+0x17d/0x230 mm/madvise.c:444
    madvise_free_single_vma+0x353/0x580 mm/madvise.c:471
    madvise_dontneed_free mm/madvise.c:555 [inline]
    madvise_vma mm/madvise.c:664 [inline]
    SYSC_madvise mm/madvise.c:832 [inline]
    SyS_madvise+0x7d3/0x13c0 mm/madvise.c:760
    entry_SYSCALL_64_fastpath+0x1f/0xbe

    Here is a C reproducer:

    #define _GNU_SOURCE
    #include
    #include
    #include

    #define MADV_FREE 8
    #define PAGE_SIZE 4096

    static void *mapping;
    static const size_t mapping_size = 0x1000000;

    static void *madvise_thrproc(void *arg)
    {
    madvise(mapping, mapping_size, (long)arg);
    }

    int main(void)
    {
    pthread_t t[2];

    for (;;) {
    mapping = mmap(NULL, mapping_size, PROT_WRITE,
    MAP_POPULATE|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);

    munmap(mapping + mapping_size / 2, PAGE_SIZE);

    pthread_create(&t[0], 0, madvise_thrproc, (void*)MADV_DONTNEED);
    pthread_create(&t[1], 0, madvise_thrproc, (void*)MADV_FREE);
    pthread_join(t[0], NULL);
    pthread_join(t[1], NULL);
    munmap(mapping, mapping_size);
    }
    }

    Note: to see the splat, CONFIG_TRANSPARENT_HUGEPAGE=y and
    CONFIG_DEBUG_VM=y are needed.

    Google Bug Id: 64696096

    Link: http://lkml.kernel.org/r/20170823205235.132061-1-ebiggers3@gmail.com
    Fixes: 854e9ed09ded ("mm: support madvise(MADV_FREE)")
    Signed-off-by: Eric Biggers
    Acked-by: David Rientjes
    Acked-by: Minchan Kim
    Acked-by: Michal Hocko
    Cc: Dmitry Vyukov
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Eric Biggers
     
  • commit 435c0b87d661da83771c30ed775f7c37eed193fb upstream.

    /sys/kernel/mm/transparent_hugepage/shmem_enabled controls if we want
    to allocate huge pages when allocate pages for private in-kernel shmem
    mount.

    Unfortunately, as Dan noticed, I've screwed it up and the only way to
    make kernel allocate huge page for the mount is to use "force" there.
    All other values will be effectively ignored.

    Link: http://lkml.kernel.org/r/20170822144254.66431-1-kirill.shutemov@linux.intel.com
    Fixes: 5a6e75f8110c ("shmem: prepare huge= mount option and sysfs knob")
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     

25 Aug, 2017

2 commits

  • commit 197e7e521384a23b9e585178f3f11c9fa08274b9 upstream.

    The 'move_paghes()' system call was introduced long long ago with the
    same permission checks as for sending a signal (except using
    CAP_SYS_NICE instead of CAP_SYS_KILL for the overriding capability).

    That turns out to not be a great choice - while the system call really
    only moves physical page allocations around (and you need other
    capabilities to do a lot of it), you can check the return value to map
    out some the virtual address choices and defeat ASLR of a binary that
    still shares your uid.

    So change the access checks to the more common 'ptrace_may_access()'
    model instead.

    This tightens the access checks for the uid, and also effectively
    changes the CAP_SYS_NICE check to CAP_SYS_PTRACE, but it's unlikely that
    anybody really _uses_ this legacy system call any more (we hav ebetter
    NUMA placement models these days), so I expect nobody to notice.

    Famous last words.

    Reported-by: Otto Ebeling
    Acked-by: Eric W. Biederman
    Cc: Willy Tarreau
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     
  • commit 73223e4e2e3867ebf033a5a8eb2e5df0158ccc99 upstream.

    I hit a use after free issue when executing trinity and repoduced it
    with KASAN enabled. The related call trace is as follows.

    BUG: KASan: use after free in SyS_get_mempolicy+0x3c8/0x960 at addr ffff8801f582d766
    Read of size 2 by task syz-executor1/798

    INFO: Allocated in mpol_new.part.2+0x74/0x160 age=3 cpu=1 pid=799
    __slab_alloc+0x768/0x970
    kmem_cache_alloc+0x2e7/0x450
    mpol_new.part.2+0x74/0x160
    mpol_new+0x66/0x80
    SyS_mbind+0x267/0x9f0
    system_call_fastpath+0x16/0x1b
    INFO: Freed in __mpol_put+0x2b/0x40 age=4 cpu=1 pid=799
    __slab_free+0x495/0x8e0
    kmem_cache_free+0x2f3/0x4c0
    __mpol_put+0x2b/0x40
    SyS_mbind+0x383/0x9f0
    system_call_fastpath+0x16/0x1b
    INFO: Slab 0xffffea0009cb8dc0 objects=23 used=8 fp=0xffff8801f582de40 flags=0x200000000004080
    INFO: Object 0xffff8801f582d760 @offset=5984 fp=0xffff8801f582d600

    Bytes b4 ffff8801f582d750: ae 01 ff ff 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ
    Object ffff8801f582d760: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
    Object ffff8801f582d770: 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkk.
    Redzone ffff8801f582d778: bb bb bb bb bb bb bb bb ........
    Padding ffff8801f582d8b8: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
    Memory state around the buggy address:
    ffff8801f582d600: fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff8801f582d680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    >ffff8801f582d700: fc fc fc fc fc fc fc fc fc fc fc fc fb fb fb fc

    !shared memory policy is not protected against parallel removal by other
    thread which is normally protected by the mmap_sem. do_get_mempolicy,
    however, drops the lock midway while we can still access it later.

    Early premature up_read is a historical artifact from times when
    put_user was called in this path see https://lwn.net/Articles/124754/
    but that is gone since 8bccd85ffbaf ("[PATCH] Implement sys_* do_*
    layering in the memory policy layer."). but when we have the the
    current mempolicy ref count model. The issue was introduced
    accordingly.

    Fix the issue by removing the premature release.

    Link: http://lkml.kernel.org/r/1502950924-27521-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Acked-by: Michal Hocko
    Cc: Minchan Kim
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    zhong jiang