19 Jun, 2017

1 commit

  • Stack guard page is a useful feature to reduce a risk of stack smashing
    into a different mapping. We have been using a single page gap which
    is sufficient to prevent having stack adjacent to a different mapping.
    But this seems to be insufficient in the light of the stack usage in
    userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
    used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
    which is 256kB or stack strings with MAX_ARG_STRLEN.

    This will become especially dangerous for suid binaries and the default
    no limit for the stack size limit because those applications can be
    tricked to consume a large portion of the stack and a single glibc call
    could jump over the guard page. These attacks are not theoretical,
    unfortunatelly.

    Make those attacks less probable by increasing the stack guard gap
    to 1MB (on systems with 4k pages; but make it depend on the page size
    because systems with larger base pages might cap stack allocations in
    the PAGE_SIZE units) which should cover larger alloca() and VLA stack
    allocations. It is obviously not a full fix because the problem is
    somehow inherent, but it should reduce attack space a lot.

    One could argue that the gap size should be configurable from userspace,
    but that can be done later when somebody finds that the new 1MB is wrong
    for some special case applications. For now, add a kernel command line
    option (stack_guard_gap) to specify the stack gap size (in page units).

    Implementation wise, first delete all the old code for stack guard page:
    because although we could get away with accounting one extra page in a
    stack vma, accounting a larger gap can break userspace - case in point,
    a program run with "ulimit -S -v 20000" failed when the 1MB gap was
    counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
    and strict non-overcommit mode.

    Instead of keeping gap inside the stack vma, maintain the stack guard
    gap as a gap between vmas: using vm_start_gap() in place of vm_start
    (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
    places which need to respect the gap - mainly arch_get_unmapped_area(),
    and and the vma tree's subtree_gap support for that.

    Original-patch-by: Oleg Nesterov
    Original-patch-by: Michal Hocko
    Signed-off-by: Hugh Dickins
    Acked-by: Michal Hocko
    Tested-by: Helge Deller # parisc
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

17 Jun, 2017

4 commits

  • Commit e1587a494540 ("mm: vmpressure: fix sending wrong events on
    underflow") declared that reclaimed pages exceed the scanned pages due
    to the thp reclaim.

    That is incorrect because THP will be spilt to normal page and loop
    again, which will result in the scanned pages increment.

    [akpm@linux-foundation.org: tweak comment text]
    Link: http://lkml.kernel.org/r/1496824266-25235-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhongjiang
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhongjiang
     
  • In do_huge_pmd_numa_page(), we attempt to handle a migrating thp pmd by
    waiting until the pmd is unlocked before we return and retry. However,
    we can race with migrate_misplaced_transhuge_page():

    // do_huge_pmd_numa_page // migrate_misplaced_transhuge_page()
    // Holds 0 refs on page // Holds 2 refs on page

    vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
    /* ... */
    if (pmd_trans_migrating(*vmf->pmd)) {
    page = pmd_page(*vmf->pmd);
    spin_unlock(vmf->ptl);
    ptl = pmd_lock(mm, pmd);
    if (page_count(page) != 2)) {
    /* roll back */
    }
    /* ... */
    mlock_migrate_page(new_page, page);
    /* ... */
    spin_unlock(ptl);
    put_page(page);
    put_page(page); // page freed here
    wait_on_page_locked(page);
    goto out;
    }

    This can result in the freed page having its waiters flag set
    unexpectedly, which trips the PAGE_FLAGS_CHECK_AT_PREP checks in the
    page alloc/free functions. This has been observed on arm64 KVM guests.

    We can avoid this by having do_huge_pmd_numa_page() take a reference on
    the page before dropping the pmd lock, mirroring what we do in
    __migration_entry_wait().

    When we hit the race, migrate_misplaced_transhuge_page() will see the
    reference and abort the migration, as it may do today in other cases.

    Fixes: b8916634b77bffb2 ("mm: Prevent parallel splits during THP migration")
    Link: http://lkml.kernel.org/r/1497349722-6731-2-git-send-email-will.deacon@arm.com
    Signed-off-by: Mark Rutland
    Signed-off-by: Will Deacon
    Acked-by: Steve Capper
    Acked-by: Kirill A. Shutemov
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Rutland
     
  • I saw need_resched() warnings when swapping on large swapfile (TBs)
    because continuously allocating many pages in swap_cgroup_prepare() took
    too long.

    We already cond_resched when freeing page in swap_cgroup_swapoff(). Do
    the same for the page allocation.

    Link: http://lkml.kernel.org/r/20170604200109.17606-1-yuzhao@google.com
    Signed-off-by: Yu Zhao
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yu Zhao
     
  • memory_failure() chooses a recovery action function based on the page
    flags. For huge pages it uses the tail page flags which don't have
    anything interesting set, resulting in:

    > Memory failure: 0x9be3b4: Unknown page state
    > Memory failure: 0x9be3b4: recovery action for unknown page: Failed

    Instead, save a copy of the head page's flags if this is a huge page,
    this means if there are no relevant flags for this tail page, we use the
    head pages flags instead. This results in the me_huge_page() recovery
    action being called:

    > Memory failure: 0x9b7969: recovery action for huge page: Delayed

    For hugepages that have not yet been allocated, this allows the hugepage
    to be dequeued.

    Fixes: 524fca1e7356 ("HWPOISON: fix misjudgement of page_action() for errors on mlocked pages")
    Link: http://lkml.kernel.org/r/20170524130204.21845-1-james.morse@arm.com
    Signed-off-by: James Morse
    Tested-by: Punit Agrawal
    Acked-by: Punit Agrawal
    Acked-by: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    James Morse
     

03 Jun, 2017

9 commits

  • We have seen an early OOM killer invocation on ppc64 systems with
    crashkernel=4096M:

    kthreadd invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=7, order=0, oom_score_adj=0
    kthreadd cpuset=/ mems_allowed=7
    CPU: 0 PID: 2 Comm: kthreadd Not tainted 4.4.68-1.gd7fe927-default #1
    Call Trace:
    dump_stack+0xb0/0xf0 (unreliable)
    dump_header+0xb0/0x258
    out_of_memory+0x5f0/0x640
    __alloc_pages_nodemask+0xa8c/0xc80
    kmem_getpages+0x84/0x1a0
    fallback_alloc+0x2a4/0x320
    kmem_cache_alloc_node+0xc0/0x2e0
    copy_process.isra.25+0x260/0x1b30
    _do_fork+0x94/0x470
    kernel_thread+0x48/0x60
    kthreadd+0x264/0x330
    ret_from_kernel_thread+0x5c/0xa4

    Mem-Info:
    active_anon:0 inactive_anon:0 isolated_anon:0
    active_file:0 inactive_file:0 isolated_file:0
    unevictable:0 dirty:0 writeback:0 unstable:0
    slab_reclaimable:5 slab_unreclaimable:73
    mapped:0 shmem:0 pagetables:0 bounce:0
    free:0 free_pcp:0 free_cma:0
    Node 7 DMA free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:52428800kB managed:110016kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:320kB slab_unreclaimable:4672kB kernel_stack:1152kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
    lowmem_reserve[]: 0 0 0 0
    Node 7 DMA: 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 0kB
    0 total pagecache pages
    0 pages in swap cache
    Swap cache stats: add 0, delete 0, find 0/0
    Free swap = 0kB
    Total swap = 0kB
    819200 pages RAM
    0 pages HighMem/MovableOnly
    817481 pages reserved
    0 pages cma reserved
    0 pages hwpoisoned

    the reason is that the managed memory is too low (only 110MB) while the
    rest of the the 50GB is still waiting for the deferred intialization to
    be done. update_defer_init estimates the initial memoty to initialize
    to 2GB at least but it doesn't consider any memory allocated in that
    range. In this particular case we've had

    Reserving 4096MB of memory at 128MB for crashkernel (System RAM: 51200MB)

    so the low 2GB is mostly depleted.

    Fix this by considering memblock allocations in the initial static
    initialization estimation. Move the max_initialise to
    reset_deferred_meminit and implement a simple memblock_reserved_memory
    helper which iterates all reserved blocks and sums the size of all that
    start below the given address. The cumulative size is than added on top
    of the initial estimation. This is still not ideal because
    reset_deferred_meminit doesn't consider holes and so reservation might
    be above the initial estimation whihch we ignore but let's make the
    logic simpler until we really need to handle more complicated cases.

    Fixes: 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
    Link: http://lkml.kernel.org/r/20170531104010.GI27783@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Tested-by: Srikar Dronamraju
    Cc: [4.2+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • KVM uses get_user_pages() to resolve its stage2 faults. KVM sets the
    FOLL_HWPOISON flag causing faultin_page() to return -EHWPOISON when it
    finds a VM_FAULT_HWPOISON. KVM handles these hwpoison pages as a
    special case. (check_user_page_hwpoison())

    When huge pages are involved, this doesn't work so well.
    get_user_pages() calls follow_hugetlb_page(), which stops early if it
    receives VM_FAULT_HWPOISON from hugetlb_fault(), eventually returning
    -EFAULT to the caller. The step to map this to -EHWPOISON based on the
    FOLL_ flags is missing. The hwpoison special case is skipped, and
    -EFAULT is returned to user-space, causing Qemu or kvmtool to exit.

    Instead, move this VM_FAULT_ to errno mapping code into a header file
    and use it from faultin_page() and follow_hugetlb_page().

    With this, KVM works as expected.

    This isn't a problem for arm64 today as we haven't enabled
    MEMORY_FAILURE, but I can't see any reason this doesn't happen on x86
    too, so I think this should be a fix. This doesn't apply earlier than
    stable's v4.11.1 due to all sorts of cleanup.

    [james.morse@arm.com: add vm_fault_to_errno() call to faultin_page()]
    suggested.
    Link: http://lkml.kernel.org/r/20170525171035.16359-1-james.morse@arm.com
    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20170524160900.28786-1-james.morse@arm.com
    Signed-off-by: James Morse
    Acked-by: Punit Agrawal
    Acked-by: Naoya Horiguchi
    Cc: "Kirill A . Shutemov"
    Cc: [4.11.1+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    James Morse
     
  • Kefeng reported that when running the follow test, the mlock count in
    meminfo will increase permanently:

    [1] testcase
    linux:~ # cat test_mlockal
    grep Mlocked /proc/meminfo
    for j in `seq 0 10`
    do
    for i in `seq 4 15`
    do
    ./p_mlockall >> log &
    done
    sleep 0.2
    done
    # wait some time to let mlock counter decrease and 5s may not enough
    sleep 5
    grep Mlocked /proc/meminfo

    linux:~ # cat p_mlockall.c
    #include
    #include
    #include

    #define SPACE_LEN 4096

    int main(int argc, char ** argv)
    {
    int ret;
    void *adr = malloc(SPACE_LEN);
    if (!adr)
    return -1;

    ret = mlockall(MCL_CURRENT | MCL_FUTURE);
    printf("mlcokall ret = %d\n", ret);

    ret = munlockall();
    printf("munlcokall ret = %d\n", ret);

    free(adr);
    return 0;
    }

    In __munlock_pagevec() we should decrement NR_MLOCK for each page where
    we clear the PageMlocked flag. Commit 1ebb7cc6a583 ("mm: munlock: batch
    NR_MLOCK zone state updates") has introduced a bug where we don't
    decrement NR_MLOCK for pages where we clear the flag, but fail to
    isolate them from the lru list (e.g. when the pages are on some other
    cpu's percpu pagevec). Since PageMlocked stays cleared, the NR_MLOCK
    accounting gets permanently disrupted by this.

    Fix it by counting the number of page whose PageMlock flag is cleared.

    Fixes: 1ebb7cc6a583 (" mm: munlock: batch NR_MLOCK zone state updates")
    Link: http://lkml.kernel.org/r/1495678405-54569-1-git-send-email-xieyisheng1@huawei.com
    Signed-off-by: Yisheng Xie
    Reported-by: Kefeng Wang
    Tested-by: Kefeng Wang
    Cc: Vlastimil Babka
    Cc: Joern Engel
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Xishi Qiu
    Cc: zhongjiang
    Cc: Hanjun Guo
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yisheng Xie
     
  • On failing to migrate a page, soft_offline_huge_page() performs the
    necessary update to the hugepage ref-count.

    But when !hugepage_migration_supported() , unmap_and_move_hugepage()
    also decrements the page ref-count for the hugepage. The combined
    behaviour leaves the ref-count in an inconsistent state.

    This leads to soft lockups when running the overcommitted hugepage test
    from mce-tests suite.

    Soft offlining pfn 0x83ed600 at process virtual address 0x400000000000
    soft offline: 0x83ed600: migration failed 1, type 1fffc00000008008 (uptodate|head)
    INFO: rcu_preempt detected stalls on CPUs/tasks:
    Tasks blocked on level-0 rcu_node (CPUs 0-7): P2715
    (detected by 7, t=5254 jiffies, g=963, c=962, q=321)
    thugetlb_overco R running task 0 2715 2685 0x00000008
    Call trace:
    dump_backtrace+0x0/0x268
    show_stack+0x24/0x30
    sched_show_task+0x134/0x180
    rcu_print_detail_task_stall_rnp+0x54/0x7c
    rcu_check_callbacks+0xa74/0xb08
    update_process_times+0x34/0x60
    tick_sched_handle.isra.7+0x38/0x70
    tick_sched_timer+0x4c/0x98
    __hrtimer_run_queues+0xc0/0x300
    hrtimer_interrupt+0xac/0x228
    arch_timer_handler_phys+0x3c/0x50
    handle_percpu_devid_irq+0x8c/0x290
    generic_handle_irq+0x34/0x50
    __handle_domain_irq+0x68/0xc0
    gic_handle_irq+0x5c/0xb0

    Address this by changing the putback_active_hugepage() in
    soft_offline_huge_page() to putback_movable_pages().

    This only triggers on systems that enable memory failure handling
    (ARCH_SUPPORTS_MEMORY_FAILURE) but not hugepage migration
    (!ARCH_ENABLE_HUGEPAGE_MIGRATION).

    I imagine this wasn't triggered as there aren't many systems running
    this configuration.

    [akpm@linux-foundation.org: remove dead comment, per Naoya]
    Link: http://lkml.kernel.org/r/20170525135146.32011-1-punit.agrawal@arm.com
    Reported-by: Manoj Iyer
    Tested-by: Manoj Iyer
    Suggested-by: Naoya Horiguchi
    Signed-off-by: Punit Agrawal
    Cc: Joonsoo Kim
    Cc: Wanpeng Li
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: [3.14+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Punit Agrawal
     
  • When the pmd_devmap() checks were added by 5c7fb56e5e3f ("mm, dax:
    dax-pmd vs thp-pmd vs hugetlbfs-pmd") to add better support for DAX huge
    pages, they were all added to the end of if() statements after existing
    pmd_trans_huge() checks. So, things like:

    - if (pmd_trans_huge(*pmd))
    + if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))

    When further checks were added after pmd_trans_unstable() checks by
    commit 7267ec008b5c ("mm: postpone page table allocation until we have
    page to map") they were also added at the end of the conditional:

    + if (pmd_trans_unstable(fe->pmd) || pmd_devmap(*fe->pmd))

    This ordering is fine for pmd_trans_huge(), but doesn't work for
    pmd_trans_unstable(). This is because DAX huge pages trip the bad_pmd()
    check inside of pmd_none_or_trans_huge_or_clear_bad() (called by
    pmd_trans_unstable()), which prints out a warning and returns 1. So, we
    do end up doing the right thing, but only after spamming dmesg with
    suspicious looking messages:

    mm/pgtable-generic.c:39: bad pmd ffff8808daa49b88(84000001006000a5)

    Reorder these checks in a helper so that pmd_devmap() is checked first,
    avoiding the error messages, and add a comment explaining why the
    ordering is important.

    Fixes: commit 7267ec008b5c ("mm: postpone page table allocation until we have page to map")
    Link: http://lkml.kernel.org/r/20170522215749.23516-1-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Reviewed-by: Jan Kara
    Cc: Pawel Lebioda
    Cc: "Darrick J. Wong"
    Cc: Alexander Viro
    Cc: Christoph Hellwig
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: Matthew Wilcox
    Cc: "Kirill A . Shutemov"
    Cc: Dave Jiang
    Cc: Xiong Zhou
    Cc: Eryu Guan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Roman Gushchin has reported that the OOM killer can trivially selects
    next OOM victim when a thread doing memory allocation from page fault
    path was selected as first OOM victim.

    allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, oom_score_adj=0
    allocate cpuset=/ mems_allowed=0
    CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
    Call Trace:
    oom_kill_process+0x219/0x3e0
    out_of_memory+0x11d/0x480
    __alloc_pages_slowpath+0xc84/0xd40
    __alloc_pages_nodemask+0x245/0x260
    alloc_pages_vma+0xa2/0x270
    __handle_mm_fault+0xca9/0x10c0
    handle_mm_fault+0xf3/0x210
    __do_page_fault+0x240/0x4e0
    trace_do_page_fault+0x37/0xe0
    do_async_page_fault+0x19/0x70
    async_page_fault+0x28/0x30
    ...
    Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
    Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
    allocate: page allocation failure: order:0, mode:0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null)
    allocate cpuset=/ mems_allowed=0
    CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
    Call Trace:
    __alloc_pages_slowpath+0xd32/0xd40
    __alloc_pages_nodemask+0x245/0x260
    alloc_pages_vma+0xa2/0x270
    __handle_mm_fault+0xca9/0x10c0
    handle_mm_fault+0xf3/0x210
    __do_page_fault+0x240/0x4e0
    trace_do_page_fault+0x37/0xe0
    do_async_page_fault+0x19/0x70
    async_page_fault+0x28/0x30
    ...
    oom_reaper: reaped process 492 (allocate), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    ...
    allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null), order=0, oom_score_adj=0
    allocate cpuset=/ mems_allowed=0
    CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
    Call Trace:
    oom_kill_process+0x219/0x3e0
    out_of_memory+0x11d/0x480
    pagefault_out_of_memory+0x68/0x80
    mm_fault_error+0x8f/0x190
    ? handle_mm_fault+0xf3/0x210
    __do_page_fault+0x4b2/0x4e0
    trace_do_page_fault+0x37/0xe0
    do_async_page_fault+0x19/0x70
    async_page_fault+0x28/0x30
    ...
    Out of memory: Kill process 233 (firewalld) score 10 or sacrifice child
    Killed process 233 (firewalld) total-vm:246076kB, anon-rss:20956kB, file-rss:0kB, shmem-rss:0kB

    There is a race window that the OOM reaper completes reclaiming the
    first victim's memory while nothing but mutex_trylock() prevents the
    first victim from calling out_of_memory() from pagefault_out_of_memory()
    after memory allocation for page fault path failed due to being selected
    as an OOM victim.

    This is a side effect of commit 9a67f6488eca926f ("mm: consolidate
    GFP_NOFAIL checks in the allocator slowpath") because that commit
    silently changed the behavior from

    /* Avoid allocations with no watermarks from looping endlessly */

    to

    /*
    * Give up allocations without trying memory reserves if selected
    * as an OOM victim
    */

    in __alloc_pages_slowpath() by moving the location to check TIF_MEMDIE
    flag. I have noticed this change but I didn't post a patch because I
    thought it is an acceptable change other than noise by warn_alloc()
    because !__GFP_NOFAIL allocations are allowed to fail. But we
    overlooked that failing memory allocation from page fault path makes
    difference due to the race window explained above.

    While it might be possible to add a check to pagefault_out_of_memory()
    that prevents the first victim from calling out_of_memory() or remove
    out_of_memory() from pagefault_out_of_memory(), changing
    pagefault_out_of_memory() does not suppress noise by warn_alloc() when
    allocating thread was selected as an OOM victim. There is little point
    with printing similar backtraces and memory information from both
    out_of_memory() and warn_alloc().

    Instead, if we guarantee that current thread can try allocations with no
    watermarks once when current thread looping inside
    __alloc_pages_slowpath() was selected as an OOM victim, we can follow "who
    can use memory reserves" rules and suppress noise by warn_alloc() and
    prevent memory allocations from page fault path from calling
    pagefault_out_of_memory().

    If we take the comment literally, this patch would do

    - if (test_thread_flag(TIF_MEMDIE))
    - goto nopage;
    + if (alloc_flags == ALLOC_NO_WATERMARKS || (gfp_mask & __GFP_NOMEMALLOC))
    + goto nopage;

    because gfp_pfmemalloc_allowed() returns false if __GFP_NOMEMALLOC is
    given. But if I recall correctly (I couldn't find the message), the
    condition is meant to apply to only OOM victims despite the comment.
    Therefore, this patch preserves TIF_MEMDIE check.

    Fixes: 9a67f6488eca926f ("mm: consolidate GFP_NOFAIL checks in the allocator slowpath")
    Link: http://lkml.kernel.org/r/201705192112.IAF69238.OQOHSJLFOFFMtV@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Reported-by: Roman Gushchin
    Tested-by: Roman Gushchin
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: [4.11]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • memcg_propagate_slab_attrs() abuses the sysfs attribute file functions
    to propagate settings from the root kmem_cache to a newly created
    kmem_cache. It does that with:

    attr->show(root, buf);
    attr->store(new, buf, strlen(bug);

    Aside of being a lazy and absurd hackery this is broken because it does
    not check the return value of the show() function.

    Some of the show() functions return 0 w/o touching the buffer. That
    means in such a case the store function is called with the stale content
    of the previous show(). That causes nonsense like invoking
    kmem_cache_shrink() on a newly created kmem_cache. In the worst case it
    would cause handing in an uninitialized buffer.

    This should be rewritten proper by adding a propagate() callback to
    those slub_attributes which must be propagated and avoid that insane
    conversion to and from ASCII, but that's too large for a hot fix.

    Check at least the return value of the show() function, so calling
    store() with stale content is prevented.

    Steven said:
    "It can cause a deadlock with get_online_cpus() that has been uncovered
    by recent cpu hotplug and lockdep changes that Thomas and Peter have
    been doing.

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(cpu_hotplug.lock);
    lock(slab_mutex);
    lock(cpu_hotplug.lock);
    lock(slab_mutex);

    *** DEADLOCK ***"

    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1705201244540.2255@nanos
    Signed-off-by: Thomas Gleixner
    Reported-by: Steven Rostedt
    Acked-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc: Christoph Hellwig
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     
  • While converting drm_[cm]alloc* helpers to kvmalloc* variants Chris
    Wilson has wondered why we want to try kmalloc before vmalloc fallback
    even for larger allocations requests. Let's clarify that one larger
    physically contiguous block is less likely to fragment memory than many
    scattered pages which can prevent more large blocks from being created.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20170517080932.21423-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Suggested-by: Chris Wilson
    Reviewed-by: Chris Wilson
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • "err" needs to be left set to -EFAULT if split_huge_page succeeds.
    Otherwise if "err" gets clobbered with zero and write_protect_page
    fails, try_to_merge_one_page() will succeed instead of returning -EFAULT
    and then try_to_merge_with_ksm_page() will continue thinking kpage is a
    PageKsm when in fact it's still an anonymous page. Eventually it'll
    crash in page_add_anon_rmap.

    This has been reproduced on Fedora25 kernel but I can reproduce with
    upstream too.

    The bug was introduced in commit f765f540598a ("ksm: prepare to new THP
    semantics") introduced in v4.5.

    page:fffff67546ce1cc0 count:4 mapcount:2 mapping:ffffa094551e36e1 index:0x7f0f46673
    flags: 0x2ffffc0004007c(referenced|uptodate|dirty|lru|active|swapbacked)
    page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
    page->mem_cgroup:ffffa09674bf0000
    ------------[ cut here ]------------
    kernel BUG at mm/rmap.c:1222!
    CPU: 1 PID: 76 Comm: ksmd Not tainted 4.9.3-200.fc25.x86_64 #1
    RIP: do_page_add_anon_rmap+0x1c4/0x240
    Call Trace:
    page_add_anon_rmap+0x18/0x20
    try_to_merge_with_ksm_page+0x50b/0x780
    ksm_scan_thread+0x1211/0x1410
    ? prepare_to_wait_event+0x100/0x100
    ? try_to_merge_with_ksm_page+0x780/0x780
    kthread+0xd9/0xf0
    ? kthread_park+0x60/0x60
    ret_from_fork+0x25/0x30

    Fixes: f765f54059 ("ksm: prepare to new THP semantics")
    Link: http://lkml.kernel.org/r/20170513131040.21732-1-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Reported-by: Federico Simoncelli
    Acked-by: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

13 May, 2017

8 commits

  • Although there are a ton of free swap and anonymous LRU page in elgible
    zones, OOM happened.

    balloon invoked oom-killer: gfp_mask=0x17080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), nodemask=(null), order=0, oom_score_adj=0
    CPU: 7 PID: 1138 Comm: balloon Not tainted 4.11.0-rc6-mm1-zram-00289-ge228d67e9677-dirty #17
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
    Call Trace:
    oom_kill_process+0x21d/0x3f0
    out_of_memory+0xd8/0x390
    __alloc_pages_slowpath+0xbc1/0xc50
    __alloc_pages_nodemask+0x1a5/0x1c0
    pte_alloc_one+0x20/0x50
    __pte_alloc+0x1e/0x110
    __handle_mm_fault+0x919/0x960
    handle_mm_fault+0x77/0x120
    __do_page_fault+0x27a/0x550
    trace_do_page_fault+0x43/0x150
    do_async_page_fault+0x2c/0x90
    async_page_fault+0x28/0x30
    Mem-Info:
    active_anon:424716 inactive_anon:65314 isolated_anon:0
    active_file:52 inactive_file:46 isolated_file:0
    unevictable:0 dirty:27 writeback:0 unstable:0
    slab_reclaimable:3967 slab_unreclaimable:4125
    mapped:133 shmem:43 pagetables:1674 bounce:0
    free:4637 free_pcp:225 free_cma:0
    Node 0 active_anon:1698864kB inactive_anon:261256kB active_file:208kB inactive_file:184kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:532kB dirty:108kB writeback:0kB shmem:172kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
    DMA free:7316kB min:32kB low:44kB high:56kB active_anon:8064kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:464kB slab_unreclaimable:40kB kernel_stack:0kB pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
    lowmem_reserve[]: 0 992 992 1952
    DMA32 free:9088kB min:2048kB low:3064kB high:4080kB active_anon:952176kB inactive_anon:0kB active_file:36kB inactive_file:0kB unevictable:0kB writepending:88kB present:1032192kB managed:1019388kB mlocked:0kB slab_reclaimable:13532kB slab_unreclaimable:16460kB kernel_stack:3552kB pagetables:6672kB bounce:0kB free_pcp:56kB local_pcp:24kB free_cma:0kB
    lowmem_reserve[]: 0 0 0 959
    Movable free:3644kB min:1980kB low:2960kB high:3940kB active_anon:738560kB inactive_anon:261340kB active_file:188kB inactive_file:640kB unevictable:0kB writepending:20kB present:1048444kB managed:1010816kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:832kB local_pcp:60kB free_cma:0kB
    lowmem_reserve[]: 0 0 0 0
    DMA: 1*4kB (E) 0*8kB 18*16kB (E) 10*32kB (E) 10*64kB (E) 9*128kB (ME) 8*256kB (E) 2*512kB (E) 2*1024kB (E) 0*2048kB 0*4096kB = 7524kB
    DMA32: 417*4kB (UMEH) 181*8kB (UMEH) 68*16kB (UMEH) 48*32kB (UMEH) 14*64kB (MH) 3*128kB (M) 1*256kB (H) 1*512kB (M) 2*1024kB (M) 0*2048kB 0*4096kB = 9836kB
    Movable: 1*4kB (M) 1*8kB (M) 1*16kB (M) 1*32kB (M) 0*64kB 1*128kB (M) 2*256kB (M) 4*512kB (M) 1*1024kB (M) 0*2048kB 0*4096kB = 3772kB
    378 total pagecache pages
    17 pages in swap cache
    Swap cache stats: add 17325, delete 17302, find 0/27
    Free swap = 978940kB
    Total swap = 1048572kB
    524157 pages RAM
    0 pages HighMem/MovableOnly
    12629 pages reserved
    0 pages cma reserved
    0 pages hwpoisoned
    [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
    [ 433] 0 433 4904 5 14 3 82 0 upstart-udev-br
    [ 438] 0 438 12371 5 27 3 191 -1000 systemd-udevd

    With investigation, skipping page of isolate_lru_pages makes reclaim
    void because it returns zero nr_taken easily so LRU shrinking is
    effectively nothing and just increases priority aggressively. Finally,
    OOM happens.

    The problem is that get_scan_count determines nr_to_scan with eligible
    zones so although priority drops to zero, it couldn't reclaim any pages
    if the LRU contains mostly ineligible pages.

    get_scan_count:

    size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
    size = size >> sc->priority;

    Assumes sc->priority is 0 and LRU list is as follows.

    N-N-N-N-H-H-H-H-H-H-H-H-H-H-H-H-H-H-H-H

    (Ie, small eligible pages are in the head of LRU but others are
    almost ineligible pages)

    In that case, size becomes 4 so VM want to scan 4 pages but 4 pages from
    tail of the LRU are not eligible pages. If get_scan_count counts
    skipped pages, it doesn't reclaim any pages remained after scanning 4
    pages so it ends up OOM happening.

    This patch makes isolate_lru_pages try to scan pages until it encounters
    eligible zones's pages.

    [akpm@linux-foundation.org: clean up mind-bending `for' statement. Tweak comment text]
    Fixes: 3db65812d688 ("Revert "mm, vmscan: account for skipped pages as a partial scan"")
    Link: http://lkml.kernel.org/r/1494457232-27401-1-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • We have encountered need_resched warnings in __collapse_huge_page_copy()
    while doing {clear,copy}_user_highpage() over HPAGE_PMD_NR source pages.

    mm->mmap_sem is held for write, but the iteration is well bounded.

    Reschedule as needed.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705101426380.109808@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Currently, we didn't invalidate page tables during invalidate_inode_pages2()
    for DAX. That could result in e.g. 2MiB zero page being mapped into
    page tables while there were already underlying blocks allocated and
    thus data seen through mmap were different from data seen by read(2).
    The following sequence reproduces the problem:

    - open an mmap over a 2MiB hole

    - read from a 2MiB hole, faulting in a 2MiB zero page

    - write to the hole with write(3p). The write succeeds but we
    incorrectly leave the 2MiB zero page mapping intact.

    - via the mmap, read the data that was just written. Since the zero
    page mapping is still intact we read back zeroes instead of the new
    data.

    Fix the problem by unconditionally calling invalidate_inode_pages2_range()
    in dax_iomap_actor() for new block allocations and by properly
    invalidating page tables in invalidate_inode_pages2_range() for DAX
    mappings.

    Fixes: c6dcf52c23d2d3fb5235cec42d7dd3f786b87d55
    Link: http://lkml.kernel.org/r/20170510085419.27601-3-jack@suse.cz
    Signed-off-by: Jan Kara
    Signed-off-by: Ross Zwisler
    Cc: Dan Williams
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Patch series "mm,dax: Fix data corruption due to mmap inconsistency",
    v4.

    This series fixes data corruption that can happen for DAX mounts when
    page faults race with write(2) and as a result page tables get out of
    sync with block mappings in the filesystem and thus data seen through
    mmap is different from data seen through read(2).

    The series passes testing with t_mmap_stale test program from Ross and
    also other mmap related tests on DAX filesystem.

    This patch (of 4):

    dax_invalidate_mapping_entry() currently removes DAX exceptional entries
    only if they are clean and unlocked. This is done via:

    invalidate_mapping_pages()
    invalidate_exceptional_entry()
    dax_invalidate_mapping_entry()

    However, for page cache pages removed in invalidate_mapping_pages()
    there is an additional criteria which is that the page must not be
    mapped. This is noted in the comments above invalidate_mapping_pages()
    and is checked in invalidate_inode_page().

    For DAX entries this means that we can can end up in a situation where a
    DAX exceptional entry, either a huge zero page or a regular DAX entry,
    could end up mapped but without an associated radix tree entry. This is
    inconsistent with the rest of the DAX code and with what happens in the
    page cache case.

    We aren't able to unmap the DAX exceptional entry because according to
    its comments invalidate_mapping_pages() isn't allowed to block, and
    unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem.

    Since we essentially never have unmapped DAX entries to evict from the
    radix tree, just remove dax_invalidate_mapping_entry().

    Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if appropriate")
    Link: http://lkml.kernel.org/r/20170510085419.27601-2-jack@suse.cz
    Signed-off-by: Ross Zwisler
    Signed-off-by: Jan Kara
    Reported-by: Jan Kara
    Cc: Dan Williams
    Cc: [4.10+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     
  • Commit 1f5307b1e094 ("mm, vmalloc: properly track vmalloc users") has
    pulled asm/pgtable.h include dependency to linux/vmalloc.h and that
    turned out to be a bad idea for some architectures. E.g. m68k fails
    with

    In file included from arch/m68k/include/asm/pgtable_mm.h:145:0,
    from arch/m68k/include/asm/pgtable.h:4,
    from include/linux/vmalloc.h:9,
    from arch/m68k/kernel/module.c:9:
    arch/m68k/include/asm/mcf_pgtable.h: In function 'nocache_page':
    >> arch/m68k/include/asm/mcf_pgtable.h:339:43: error: 'init_mm' undeclared (first use in this function)
    #define pgd_offset_k(address) pgd_offset(&init_mm, address)

    as spotted by kernel build bot. nios2 fails for other reason

    In file included from include/asm-generic/io.h:767:0,
    from arch/nios2/include/asm/io.h:61,
    from include/linux/io.h:25,
    from arch/nios2/include/asm/pgtable.h:18,
    from include/linux/mm.h:70,
    from include/linux/pid_namespace.h:6,
    from include/linux/ptrace.h:9,
    from arch/nios2/include/uapi/asm/elf.h:23,
    from arch/nios2/include/asm/elf.h:22,
    from include/linux/elf.h:4,
    from include/linux/module.h:15,
    from init/main.c:16:
    include/linux/vmalloc.h: In function '__vmalloc_node_flags':
    include/linux/vmalloc.h:99:40: error: 'PAGE_KERNEL' undeclared (first use in this function); did you mean 'GFP_KERNEL'?

    which is due to the newly added #include , which on nios2
    includes and thus and which
    again includes .

    Tweaking that around just turns out a bigger headache than necessary.
    This patch reverts 1f5307b1e094 and reimplements the original fix in a
    different way. __vmalloc_node_flags can stay static inline which will
    cover vmalloc* functions. We only have one external user
    (kvmalloc_node) and we can export __vmalloc_node_flags_caller and
    provide the caller directly. This is much simpler and it doesn't really
    need any games with header files.

    [akpm@linux-foundation.org: coding-style fixes]
    [mhocko@kernel.org: revert old comment]
    Link: http://lkml.kernel.org/r/20170509211054.GB16325@dhcp22.suse.cz
    Fixes: 1f5307b1e094 ("mm, vmalloc: properly track vmalloc users")
    Link: http://lkml.kernel.org/r/20170509153702.GR6481@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Cc: Tobias Klauser
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • One return case of `__collapse_huge_page_swapin()` does not invoke
    tracepoint while every other return case does. This commit adds a
    tracepoint invocation for the case.

    Link: http://lkml.kernel.org/r/20170507101813.30187-1-sj38.park@gmail.com
    Signed-off-by: SeongJae Park
    Cc: Kirill A. Shutemov
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    SeongJae Park
     
  • After commit e2ecc8a79ed4 ("mm, vmstat: print non-populated zones in
    zoneinfo"), /proc/zoneinfo will show unpopulated zones.

    A memoryless node, having no populated zones at all, was previously
    ignored, but will now trigger the WARN() in is_zone_first_populated().

    Remove this warning, as its only purpose was to warn of a situation that
    has since been enabled.

    Aside: The "per-node stats" are still printed under the first populated
    zone, but that's not necessarily the first stanza any more. I'm not
    sure which criteria is more important with regard to not breaking
    parsers, but it looks a little weird to the eye.

    Fixes: e2ecc8a79ed4 ("mm, vmstat: print node-based stats in zoneinfo file")
    Link: http://lkml.kernel.org/r/1493854905-10918-1-git-send-email-arbab@linux.vnet.ibm.com
    Signed-off-by: Reza Arbab
    Cc: David Rientjes
    Cc: Anshuman Khandual
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Reza Arbab
     
  • Laurent Dufour has noticed that hwpoinsoned pages are kept charged. In
    his particular case he has hit a bad_page("page still charged to
    cgroup") when onlining a hwpoison page. While this looks like something
    that shouldn't happen in the first place because onlining hwpages and
    returning them to the page allocator makes only little sense it shows a
    real problem.

    hwpoison pages do not get freed usually so we do not uncharge them (at
    least not since commit 0a31bc97c80c ("mm: memcontrol: rewrite uncharge
    API")). Each charge pins memcg (since e8ea14cc6ead ("mm: memcontrol:
    take a css reference for each charged page")) as well and so the
    mem_cgroup and the associated state will never go away. Fix this leak
    by forcibly uncharging a LRU hwpoisoned page in delete_from_lru_cache().
    We also have to tweak uncharge_list because it cannot rely on zero ref
    count for these pages.

    [akpm@linux-foundation.org: coding-style fixes]
    Fixes: 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API")
    Link: http://lkml.kernel.org/r/20170502185507.GB19165@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: Laurent Dufour
    Tested-by: Laurent Dufour
    Reviewed-by: Balbir Singh
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

12 May, 2017

1 commit

  • Pull more arm64 updates from Catalin Marinas:

    - Silence module allocation failures when CONFIG_ARM*_MODULE_PLTS is
    enabled. This requires a check for __GFP_NOWARN in alloc_vmap_area()

    - Improve/sanitise user tagged pointers handling in the kernel

    - Inline asm fixes/cleanups

    * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
    arm64: Silence first allocation with CONFIG_ARM64_MODULE_PLTS=y
    ARM: Silence first allocation with CONFIG_ARM_MODULE_PLTS=y
    mm: Silence vmap() allocation failures based on caller gfp_flags
    arm64: uaccess: suppress spurious clang warning
    arm64: atomic_lse: match asm register sizes
    arm64: armv8_deprecated: ensure extension of addr
    arm64: uaccess: ensure extension of access_ok() addr
    arm64: ensure extension of smp_store_release value
    arm64: xchg: hazard against entire exchange variable
    arm64: documentation: document tagged pointer stack constraints
    arm64: entry: improve data abort handling of tagged pointers
    arm64: hw_breakpoint: fix watchpoint matching for tagged pointers
    arm64: traps: fix userspace cache maintenance emulation on a tagged pointer

    Linus Torvalds
     

11 May, 2017

2 commits

  • If the caller has set __GFP_NOWARN don't print the following message:
    vmap allocation for size 15736832 failed: use vmalloc= to increase
    size.

    This can happen with the ARM/Linux or ARM64/Linux module loader built
    with CONFIG_ARM{,64}_MODULE_PLTS=y which does a first attempt at loading
    a large module from module space, then falls back to vmalloc space.

    Acked-by: Michal Hocko
    Signed-off-by: Florian Fainelli
    Signed-off-by: Catalin Marinas

    Florian Fainelli
     
  • Pull RCU updates from Ingo Molnar:
    "The main changes are:

    - Debloat RCU headers

    - Parallelize SRCU callback handling (plus overlapping patches)

    - Improve the performance of Tree SRCU on a CPU-hotplug stress test

    - Documentation updates

    - Miscellaneous fixes"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits)
    rcu: Open-code the rcu_cblist_n_lazy_cbs() function
    rcu: Open-code the rcu_cblist_n_cbs() function
    rcu: Open-code the rcu_cblist_empty() function
    rcu: Separately compile large rcu_segcblist functions
    srcu: Debloat the header
    srcu: Adjust default auto-expediting holdoff
    srcu: Specify auto-expedite holdoff time
    srcu: Expedite first synchronize_srcu() when idle
    srcu: Expedited grace periods with reduced memory contention
    srcu: Make rcutorture writer stalls print SRCU GP state
    srcu: Exact tracking of srcu_data structures containing callbacks
    srcu: Make SRCU be built by default
    srcu: Fix Kconfig botch when SRCU not selected
    rcu: Make non-preemptive schedule be Tasks RCU quiescent state
    srcu: Expedite srcu_schedule_cbs_snp() callback invocation
    srcu: Parallelize callback handling
    kvm: Move srcu_struct fields to end of struct kvm
    rcu: Fix typo in PER_RCU_NODE_PERIOD header comment
    rcu: Use true/false in assignment to bool
    rcu: Use bool value directly
    ...

    Linus Torvalds
     

10 May, 2017

1 commit


09 May, 2017

14 commits

  • Merge more updates from Andrew Morton:

    - the rest of MM

    - various misc things

    - procfs updates

    - lib/ updates

    - checkpatch updates

    - kdump/kexec updates

    - add kvmalloc helpers, use them

    - time helper updates for Y2038 issues. We're almost ready to remove
    current_fs_time() but that awaits a btrfs merge.

    - add tracepoints to DAX

    * emailed patches from Andrew Morton : (114 commits)
    drivers/staging/ccree/ssi_hash.c: fix build with gcc-4.4.4
    selftests/vm: add a test for virtual address range mapping
    dax: add tracepoint to dax_insert_mapping()
    dax: add tracepoint to dax_writeback_one()
    dax: add tracepoints to dax_writeback_mapping_range()
    dax: add tracepoints to dax_load_hole()
    dax: add tracepoints to dax_pfn_mkwrite()
    dax: add tracepoints to dax_iomap_pte_fault()
    mtd: nand: nandsim: convert to memalloc_noreclaim_*()
    treewide: convert PF_MEMALLOC manipulations to new helpers
    mm: introduce memalloc_noreclaim_{save,restore}
    mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC
    mm/huge_memory.c: deposit a pgtable for DAX PMD faults when required
    mm/huge_memory.c: use zap_deposited_table() more
    time: delete CURRENT_TIME_SEC and CURRENT_TIME
    gfs2: replace CURRENT_TIME with current_time
    apparmorfs: replace CURRENT_TIME with current_time()
    lustre: replace CURRENT_TIME macro
    fs: ubifs: replace CURRENT_TIME_SEC with current_time
    fs: ufs: use ktime_get_real_ts64() for birthtime
    ...

    Linus Torvalds
     
  • The previous patch ("mm: prevent potential recursive reclaim due to
    clearing PF_MEMALLOC") has shown that simply setting and clearing
    PF_MEMALLOC in current->flags can result in wrongly clearing a
    pre-existing PF_MEMALLOC flag and potentially lead to recursive reclaim.
    Let's introduce helpers that support proper nesting by saving the
    previous stat of the flag, similar to the existing memalloc_noio_* and
    memalloc_nofs_* helpers. Convert existing setting/clearing of
    PF_MEMALLOC within mm to the new helpers.

    There are no known issues with the converted code, but the change makes
    it more robust.

    Link: http://lkml.kernel.org/r/20170405074700.29871-3-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Suggested-by: Michal Hocko
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Andrey Ryabinin
    Cc: Boris Brezillon
    Cc: Chris Leech
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Cc: Josef Bacik
    Cc: Lee Duncan
    Cc: Michal Hocko
    Cc: Richard Weinberger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Patch series "more robust PF_MEMALLOC handling"

    This series aims to unify the setting and clearing of PF_MEMALLOC, which
    prevents recursive reclaim. There are some places that clear the flag
    unconditionally from current->flags, which may result in clearing a
    pre-existing flag. This already resulted in a bug report that Patch 1
    fixes (without the new helpers, to make backporting easier). Patch 2
    introduces the new helpers, modelled after existing memalloc_noio_* and
    memalloc_nofs_* helpers, and converts mm core to use them. Patches 3
    and 4 convert non-mm code.

    This patch (of 4):

    __alloc_pages_direct_compact() sets PF_MEMALLOC to prevent deadlock
    during page migration by lock_page() (see the comment in
    __unmap_and_move()). Then it unconditionally clears the flag, which can
    clear a pre-existing PF_MEMALLOC flag and result in recursive reclaim.
    This was not a problem until commit a8161d1ed609 ("mm, page_alloc:
    restructure direct compaction handling in slowpath"), because direct
    compation was called only after direct reclaim, which was skipped when
    PF_MEMALLOC flag was set.

    Even now it's only a theoretical issue, as the new callsite of
    __alloc_pages_direct_compact() is reached only for costly orders and
    when gfp_pfmemalloc_allowed() is true, which means either
    __GFP_NOMEMALLOC is in gfp_flags or in_interrupt() is true. There is no
    such known context, but let's play it safe and make
    __alloc_pages_direct_compact() robust for cases where PF_MEMALLOC is
    already set.

    Fixes: a8161d1ed609 ("mm, page_alloc: restructure direct compaction handling in slowpath")
    Link: http://lkml.kernel.org/r/20170405074700.29871-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Reported-by: Andrey Ryabinin
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Boris Brezillon
    Cc: Chris Leech
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Cc: Josef Bacik
    Cc: Lee Duncan
    Cc: Michal Hocko
    Cc: Richard Weinberger
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Although all architectures use a deposited page table for THP on
    anonymous VMAs, some architectures (s390 and powerpc) require the
    deposited storage even for file backed VMAs due to quirks of their MMUs.

    This patch adds support for depositing a table in DAX PMD fault handling
    path for archs that require it. Other architectures should see no
    functional changes.

    Link: http://lkml.kernel.org/r/20170411174233.21902-3-oohall@gmail.com
    Signed-off-by: Oliver O'Halloran
    Cc: Reza Arbab
    Cc: Balbir Singh
    Cc: linux-nvdimm@ml01.01.org
    Cc: Oliver O'Halloran
    Cc: Aneesh Kumar K.V
    Cc: "Kirill A. Shutemov"
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oliver O'Halloran
     
  • Depending on the flags of the PMD being zapped there may or may not be a
    deposited pgtable to be freed. In two of the three cases this is open
    coded while the third uses the zap_deposited_table() helper. This patch
    converts the others to use the helper to clean things up a bit.

    Link: http://lkml.kernel.org/r/20170411174233.21902-2-oohall@gmail.com
    Cc: Reza Arbab
    Cc: Balbir Singh
    Cc: linux-nvdimm@ml01.01.org
    Cc: Oliver O'Halloran
    Cc: Aneesh Kumar K.V
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oliver O'Halloran
     
  • Commit afddba49d18f ("fs: introduce write_begin, write_end, and
    perform_write aops") introduced AOP_FLAG_UNINTERRUPTIBLE flag which was
    checked in pagecache_write_begin(), but that check was removed by
    4e02ed4b4a2f ("fs: remove prepare_write/commit_write").

    Between these two commits, commit d9414774dc0c ("cifs: Convert cifs to
    new aops.") added a check in cifs_write_begin(), but that check was soon
    removed by commit a98ee8c1c707 ("[CIFS] fix regression in
    cifs_write_begin/cifs_write_end").

    Therefore, AOP_FLAG_UNINTERRUPTIBLE flag is checked nowhere. Let's
    remove this flag. This patch has no functionality changes.

    Link: http://lkml.kernel.org/r/1489294781-53494-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Reviewed-by: Jeff Layton
    Reviewed-by: Christoph Hellwig
    Cc: Nick Piggin
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • __vmalloc* allows users to provide gfp flags for the underlying
    allocation. This API is quite popular

    $ git grep "=[[:space:]]__vmalloc\|return[[:space:]]*__vmalloc" | wc -l
    77

    The only problem is that many people are not aware that they really want
    to give __GFP_HIGHMEM along with other flags because there is really no
    reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages
    which are mapped to the kernel vmalloc space. About half of users don't
    use this flag, though. This signals that we make the API unnecessarily
    too complex.

    This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to
    be mapped to the vmalloc space. Current users which add __GFP_HIGHMEM
    are simplified and drop the flag.

    Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Matthew Wilcox
    Cc: Al Viro
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Cristopher Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Now vzalloc() is used in swap code to allocate various data structures,
    such as swap cache, swap slots cache, cluster info, etc. Because the
    size may be too large on some system, so that normal kzalloc() may fail.
    But using kzalloc() has some advantages, for example, less memory
    fragmentation, less TLB pressure, etc. So change the data structure
    allocation in swap code to use kvzalloc() which will try kzalloc()
    firstly, and fallback to vzalloc() if kzalloc() failed.

    In general, although kmalloc() will reduce the number of high-order
    pages in short term, vmalloc() will cause more pain for memory
    fragmentation in the long term. And the swap data structure allocation
    that is changed in this patch is expected to be long term allocation.

    From Dave Hansen:
    "for example, we have a two-page data structure. vmalloc() takes two
    effectively random order-0 pages, probably from two different 2M pages
    and pins them. That "kills" two 2M pages. kmalloc(), allocating two
    *contiguous* pages, will not cross a 2M boundary. That means it will
    only "kill" the possibility of a single 2M page. More 2M pages == less
    fragmentation.

    The allocation in this patch occurs during swap on time, which is
    usually done during system boot, so usually we have high opportunity to
    allocate the contiguous pages successfully.

    The allocation for swap_map[] in struct swap_info_struct is not changed,
    because that is usually quite large and vmalloc_to_page() is used for
    it. That makes it a little harder to change.

    Link: http://lkml.kernel.org/r/20170407064911.25447-1-ying.huang@intel.com
    Signed-off-by: Huang Ying
    Acked-by: Tim Chen
    Acked-by: Michal Hocko
    Acked-by: Rik van Riel
    Cc: Dave Hansen
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • There are many code paths opencoding kvmalloc. Let's use the helper
    instead. The main difference to kvmalloc is that those users are
    usually not considering all the aspects of the memory allocator. E.g.
    allocation requests
    Reviewed-by: Boris Ostrovsky # Xen bits
    Acked-by: Kees Cook
    Acked-by: Vlastimil Babka
    Acked-by: Andreas Dilger # Lustre
    Acked-by: Christian Borntraeger # KVM/s390
    Acked-by: Dan Williams # nvdim
    Acked-by: David Sterba # btrfs
    Acked-by: Ilya Dryomov # Ceph
    Acked-by: Tariq Toukan # mlx4
    Acked-by: Leon Romanovsky # mlx5
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Herbert Xu
    Cc: Anton Vorontsov
    Cc: Colin Cross
    Cc: Tony Luck
    Cc: "Rafael J. Wysocki"
    Cc: Ben Skeggs
    Cc: Kent Overstreet
    Cc: Santosh Raspatur
    Cc: Hariprasad S
    Cc: Yishai Hadas
    Cc: Oleg Drokin
    Cc: "Yan, Zheng"
    Cc: Alexander Viro
    Cc: Alexei Starovoitov
    Cc: Eric Dumazet
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • vhost code uses __GFP_REPEAT when allocating vhost_virtqueue resp.
    vhost_vsock because it would really like to prefer kmalloc to the
    vmalloc fallback - see 23cc5a991c7a ("vhost-net: extend device
    allocation to vmalloc") for more context. Michael Tsirkin has also
    noted:

    "__GFP_REPEAT overhead is during allocation time. Using vmalloc means
    all accesses are slowed down. Allocation is not on data path, accesses
    are."

    The similar applies to other vhost_kvzalloc users.

    Let's teach kvmalloc_node to handle __GFP_REPEAT properly. There are
    two things to be careful about. First we should prevent from the OOM
    killer and so have to involve __GFP_NORETRY by default and secondly
    override __GFP_REPEAT for !costly order requests as the __GFP_REPEAT is
    ignored for !costly orders.

    Supporting __GFP_REPEAT like semantic for !costly request is possible it
    would require changes in the page allocator. This is out of scope of
    this patch.

    This patch shouldn't introduce any functional change.

    Link: http://lkml.kernel.org/r/20170306103032.2540-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Michael S. Tsirkin
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • __vmalloc_node_flags used to be static inline but this has changed by
    "mm: introduce kv[mz]alloc helpers" because kvmalloc_node needs to use
    it as well and the code is outside of the vmalloc proper. I haven't
    realized that changing this will lead to a subtle bug though. The
    function is responsible to track the caller as well. This caller is
    then printed by /proc/vmallocinfo. If __vmalloc_node_flags is not
    inline then we would get only direct users of __vmalloc_node_flags as
    callers (e.g. v[mz]alloc) which reduces usefulness of this debugging
    feature considerably. It simply doesn't help to see that the given
    range belongs to vmalloc as a caller:

    0xffffc90002c79000-0xffffc90002c7d000 16384 vmalloc+0x16/0x18 pages=3 vmalloc N0=3
    0xffffc90002c81000-0xffffc90002c85000 16384 vmalloc+0x16/0x18 pages=3 vmalloc N1=3
    0xffffc90002c8d000-0xffffc90002c91000 16384 vmalloc+0x16/0x18 pages=3 vmalloc N1=3
    0xffffc90002c95000-0xffffc90002c99000 16384 vmalloc+0x16/0x18 pages=3 vmalloc N1=3

    We really want to catch the _caller_ of the vmalloc function. Fix this
    issue by making __vmalloc_node_flags static inline again.

    Link: http://lkml.kernel.org/r/20170502134657.12381-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Patch series "kvmalloc", v5.

    There are many open coded kmalloc with vmalloc fallback instances in the
    tree. Most of them are not careful enough or simply do not care about
    the underlying semantic of the kmalloc/page allocator which means that
    a) some vmalloc fallbacks are basically unreachable because the kmalloc
    part will keep retrying until it succeeds b) the page allocator can
    invoke a really disruptive steps like the OOM killer to move forward
    which doesn't sound appropriate when we consider that the vmalloc
    fallback is available.

    As it can be seen implementing kvmalloc requires quite an intimate
    knowledge if the page allocator and the memory reclaim internals which
    strongly suggests that a helper should be implemented in the memory
    subsystem proper.

    Most callers, I could find, have been converted to use the helper
    instead. This is patch 6. There are some more relying on __GFP_REPEAT
    in the networking stack which I have converted as well and Eric Dumazet
    was not opposed [2] to convert them as well.

    [1] http://lkml.kernel.org/r/20170130094940.13546-1-mhocko@kernel.org
    [2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com

    This patch (of 9):

    Using kmalloc with the vmalloc fallback for larger allocations is a
    common pattern in the kernel code. Yet we do not have any common helper
    for that and so users have invented their own helpers. Some of them are
    really creative when doing so. Let's just add kv[mz]alloc and make sure
    it is implemented properly. This implementation makes sure to not make
    a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
    to not warn about allocation failures. This also rules out the OOM
    killer as the vmalloc is a more approapriate fallback than a disruptive
    user visible action.

    This patch also changes some existing users and removes helpers which
    are specific for them. In some cases this is not possible (e.g.
    ext4_kvmalloc, libcfs_kvzalloc) because those seems to be broken and
    require GFP_NO{FS,IO} context which is not vmalloc compatible in general
    (note that the page table allocation is GFP_KERNEL). Those need to be
    fixed separately.

    While we are at it, document that __vmalloc{_node} about unsupported gfp
    mask because there seems to be a lot of confusion out there.
    kvmalloc_node will warn about GFP_KERNEL incompatible (which are not
    superset) flags to catch new abusers. Existing ones would have to die
    slowly.

    [sfr@canb.auug.org.au: f2fs fixup]
    Link: http://lkml.kernel.org/r/20170320163735.332e64b7@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20170306103032.2540-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Stephen Rothwell
    Reviewed-by: Andreas Dilger [ext4 part]
    Acked-by: Vlastimil Babka
    Cc: John Hubbard
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The main goal of direct compaction is to form a high-order page for
    allocation, but it should also help against long-term fragmentation when
    possible.

    Most lower-than-pageblock-order compactions are for non-movable
    allocations, which means that if we compact in a movable pageblock and
    terminate as soon as we create the high-order page, it's unlikely that
    the fallback heuristics will claim the whole block. Instead there might
    be a single unmovable page in a pageblock full of movable pages, and the
    next unmovable allocation might pick another pageblock and increase
    long-term fragmentation.

    To help against such scenarios, this patch changes the termination
    criteria for compaction so that the current pageblock is finished even
    though the high-order page already exists. Note that it might be
    possible that the high-order page formed elsewhere in the zone due to
    parallel activity, but this patch doesn't try to detect that.

    This is only done with sync compaction, because async compaction is
    limited to pageblock of the same migratetype, where it cannot result in
    a migratetype fallback. (Async compaction also eagerly skips
    order-aligned blocks where isolation fails, which is against the goal of
    migrating away as much of the pageblock as possible.)

    As a result of this patch, long-term memory fragmentation should be
    reduced.

    In testing based on 4.9 kernel with stress-highalloc from mmtests
    configured for order-4 GFP_KERNEL allocations, this patch has reduced
    the number of unmovable allocations falling back to movable pageblocks
    by 20%. The number

    Link: http://lkml.kernel.org/r/20170307131545.28577-9-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The migrate scanner in async compaction is currently limited to
    MIGRATE_MOVABLE pageblocks. This is a heuristic intended to reduce
    latency, based on the assumption that non-MOVABLE pageblocks are
    unlikely to contain movable pages.

    However, with the exception of THP's, most high-order allocations are
    not movable. Should the async compaction succeed, this increases the
    chance that the non-MOVABLE allocations will fallback to a MOVABLE
    pageblock, making the long-term fragmentation worse.

    This patch attempts to help the situation by changing async direct
    compaction so that the migrate scanner only scans the pageblocks of the
    requested migratetype. If it's a non-MOVABLE type and there are such
    pageblocks that do contain movable pages, chances are that the
    allocation can succeed within one of such pageblocks, removing the need
    for a fallback. If that fails, the subsequent sync attempt will ignore
    this restriction.

    In testing based on 4.9 kernel with stress-highalloc from mmtests
    configured for order-4 GFP_KERNEL allocations, this patch has reduced
    the number of unmovable allocations falling back to movable pageblocks
    by 30%. The number of movable allocations falling back is reduced by
    12%.

    Link: http://lkml.kernel.org/r/20170307131545.28577-8-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka