31 Jul, 2014

1 commit

  • commit 694617474e33b8603fc76e090ed7d09376514b1a upstream.

    The patch 3e374919b314f20e2a04f641ebc1093d758f66a4 is supposed to fix the
    problem where kmem_cache_create incorrectly reports duplicate cache name
    and fails. The problem is described in the header of that patch.

    However, the patch doesn't really fix the problem because of these
    reasons:

    * the logic to test for debugging is reversed. It was intended to perform
    the check only if slub debugging is enabled (which implies that caches
    with the same parameters are not merged). Therefore, there should be
    #if !defined(CONFIG_SLUB) || defined(CONFIG_SLUB_DEBUG_ON)
    The current code has the condition reversed and performs the test if
    debugging is disabled.

    * slub debugging may be enabled or disabled based on kernel command line,
    CONFIG_SLUB_DEBUG_ON is just the default settings. Therefore the test
    based on definition of CONFIG_SLUB_DEBUG_ON is unreliable.

    This patch fixes the problem by removing the test
    "!defined(CONFIG_SLUB_DEBUG_ON)". Therefore, duplicate names are never
    checked if the SLUB allocator is used.

    Note to stable kernel maintainers: when backporint this patch, please
    backport also the patch 3e374919b314f20e2a04f641ebc1093d758f66a4.

    Acked-by: David Rientjes
    Acked-by: Christoph Lameter
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Pekka Enberg
    Signed-off-by: Jiri Slaby

    Mikulas Patocka
     

28 Jul, 2014

4 commits

  • commit 7f88f88f83ed609650a01b18572e605ea50cd163 upstream.

    Commit 248ac0e1943a ("mm/vmalloc: remove guard page from between vmap
    blocks") had the side effect of making vmap_area.va_end member point to
    the next vmap_area.va_start. This was creating an artificial reference
    to vmalloc'ed objects and kmemleak was rarely reporting vmalloc() leaks.

    This patch marks the vmap_area containing pointers explicitly and
    reduces the min ref_count to 2 as vm_struct still contains a reference
    to the vmalloc'ed object. The kmemleak add_scan_area() function has
    been improved to allow a SIZE_MAX argument covering the rest of the
    object (for simpler calling sites).

    Signed-off-by: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Catalin Marinas
     
  • commit b1a366500bd537b50c3aad26dc7df083ec03a448 upstream.

    shmem_fault() is the actual culprit in trinity's hole-punch starvation,
    and the most significant cause of such problems: since a page faulted is
    one that then appears page_mapped(), needing unmap_mapping_range() and
    i_mmap_mutex to be unmapped again.

    But it is not the only way in which a page can be brought into a hole in
    the radix_tree while that hole is being punched; and Vlastimil's testing
    implies that if enough other processors are busy filling in the hole,
    then shmem_undo_range() can be kept from completing indefinitely.

    shmem_file_splice_read() is the main other user of SGP_CACHE, which can
    instantiate shmem pagecache pages in the read-only case (without holding
    i_mutex, so perhaps concurrently with a hole-punch). Probably it's
    silly not to use SGP_READ already (using the ZERO_PAGE for holes): which
    ought to be safe, but might bring surprises - not a change to be rushed.

    shmem_read_mapping_page_gfp() is an internal interface used by
    drivers/gpu/drm GEM (and next by uprobes): it should be okay. And
    shmem_file_read_iter() uses the SGP_DIRTY variant of SGP_CACHE, when
    called internally by the kernel (perhaps for a stacking filesystem,
    which might rely on holes to be reserved): it's unclear whether it could
    be provoked to keep hole-punch busy or not.

    We could apply the same umbrella as now used in shmem_fault() to
    shmem_file_splice_read() and the others; but it looks ugly, and use over
    a range raises questions - should it actually be per page? can these get
    starved themselves?

    The origin of this part of the problem is my v3.1 commit d0823576bf4b
    ("mm: pincer in truncate_inode_pages_range"), once it was duplicated
    into shmem.c. It seemed like a nice idea at the time, to ensure
    (barring RCU lookup fuzziness) that there's an instant when the entire
    hole is empty; but the indefinitely repeated scans to ensure that make
    it vulnerable.

    Revert that "enhancement" to hole-punch from shmem_undo_range(), but
    retain the unproblematic rescanning when it's truncating; add a couple
    of comments there.

    Remove the "indices[0] >= end" test: that is now handled satisfactorily
    by the inner loop, and mem_cgroup_uncharge_start()/end() are too light
    to be worth avoiding here.

    But if we do not always loop indefinitely, we do need to handle the case
    of swap swizzled back to page before shmem_free_swap() gets it: add a
    retry for that case, as suggested by Konstantin Khlebnikov; and for the
    case of page swizzled back to swap, as suggested by Johannes Weiner.

    Signed-off-by: Hugh Dickins
    Reported-by: Sasha Levin
    Suggested-by: Vlastimil Babka
    Cc: Konstantin Khlebnikov
    Cc: Johannes Weiner
    Cc: Lukas Czerner
    Cc: Dave Jones
    Cc: [3.1+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Hugh Dickins
     
  • commit 8e205f779d1443a94b5ae81aa359cb535dd3021e upstream.

    Commit f00cdc6df7d7 ("shmem: fix faulting into a hole while it's
    punched") was buggy: Sasha sent a lockdep report to remind us that
    grabbing i_mutex in the fault path is a no-no (write syscall may already
    hold i_mutex while faulting user buffer).

    We tried a completely different approach (see following patch) but that
    proved inadequate: good enough for a rational workload, but not good
    enough against trinity - which forks off so many mappings of the object
    that contention on i_mmap_mutex while hole-puncher holds i_mutex builds
    into serious starvation when concurrent faults force the puncher to fall
    back to single-page unmap_mapping_range() searches of the i_mmap tree.

    So return to the original umbrella approach, but keep away from i_mutex
    this time. We really don't want to bloat every shmem inode with a new
    mutex or completion, just to protect this unlikely case from trinity.
    So extend the original with wait_queue_head on stack at the hole-punch
    end, and wait_queue item on the stack at the fault end.

    This involves further use of i_lock to guard against the races: lockdep
    has been happy so far, and I see fs/inode.c:unlock_new_inode() holds
    i_lock around wake_up_bit(), which is comparable to what we do here.
    i_lock is more convenient, but we could switch to shmem's info->lock.

    This issue has been tagged with CVE-2014-4171, which will require commit
    f00cdc6df7d7 and this and the following patch to be backported: we
    suggest to 3.1+, though in fact the trinity forkbomb effect might go
    back as far as 2.6.16, when madvise(,,MADV_REMOVE) came in - or might
    not, since much has changed, with i_mmap_mutex a spinlock before 3.0.
    Anyone running trinity on 3.0 and earlier? I don't think we need care.

    Signed-off-by: Hugh Dickins
    Reported-by: Sasha Levin
    Tested-by: Sasha Levin
    Cc: Vlastimil Babka
    Cc: Konstantin Khlebnikov
    Cc: Johannes Weiner
    Cc: Lukas Czerner
    Cc: Dave Jones
    Cc: [3.1+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Hugh Dickins
     
  • commit f00cdc6df7d7cfcabb5b740911e6788cb0802bdb upstream.

    Trinity finds that mmap access to a hole while it's punched from shmem
    can prevent the madvise(MADV_REMOVE) or fallocate(FALLOC_FL_PUNCH_HOLE)
    from completing, until the reader chooses to stop; with the puncher's
    hold on i_mutex locking out all other writers until it can complete.

    It appears that the tmpfs fault path is too light in comparison with its
    hole-punching path, lacking an i_data_sem to obstruct it; but we don't
    want to slow down the common case.

    Extend shmem_fallocate()'s existing range notification mechanism, so
    shmem_fault() can refrain from faulting pages into the hole while it's
    punched, waiting instead on i_mutex (when safe to sleep; or repeatedly
    faulting when not).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Hugh Dickins
    Reported-by: Sasha Levin
    Tested-by: Sasha Levin
    Cc: Dave Jones
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Signed-off-by: Jiri Slaby

    Hugh Dickins
     

22 Jul, 2014

1 commit

  • commit b738d764652dc5aab1c8939f637112981fce9e0e upstream.

    shrink_inactive_list() used to wait 0.1s to avoid congestion when all
    the pages that were isolated from the inactive list were dirty but not
    under active writeback. That makes no real sense, and apparently causes
    major interactivity issues under some loads since 3.11.

    The ostensible reason for it was to wait for kswapd to start writing
    pages, but that seems questionable as well, since the congestion wait
    code seems to trigger for kswapd itself as well. Also, the logic behind
    delaying anything when we haven't actually started writeback is not
    clear - it only delays actually starting that writeback.

    We'll still trigger the congestion waiting if

    (a) the process is kswapd, and we hit pages flagged for immediate
    reclaim

    (b) the process is not kswapd, and the zone backing dev writeback is
    actually congested.

    This probably needs to be revisited, but as it is this fixes a reported
    regression.

    [mhocko@suse.cz: backport to 3.12 stable tree]
    Fixes: e2be15f6c3ee ('mm: vmscan: stall page reclaim and writeback pages based on dirty/writepage pages encountered')
    Reported-by: Felipe Contreras
    Pinpointed-by: Hillf Danton
    Cc: Michal Hocko
    Cc: Andrew Morton
    Cc: Mel Gorman
    Signed-off-by: Linus Torvalds
    Signed-off-by: Michal Hocko
    Signed-off-by: Jiri Slaby

    Linus Torvalds
     

18 Jul, 2014

4 commits

  • commit 391acf970d21219a2a5446282d3b20eace0c0d7a upstream.

    When runing with the kernel(3.15-rc7+), the follow bug occurs:
    [ 9969.258987] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
    [ 9969.359906] in_atomic(): 1, irqs_disabled(): 0, pid: 160655, name: python
    [ 9969.441175] INFO: lockdep is turned off.
    [ 9969.488184] CPU: 26 PID: 160655 Comm: python Tainted: G A 3.15.0-rc7+ #85
    [ 9969.581032] Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012
    [ 9969.706052] ffffffff81a20e60 ffff8803e941fbd0 ffffffff8162f523 ffff8803e941fd18
    [ 9969.795323] ffff8803e941fbe0 ffffffff8109995a ffff8803e941fc58 ffffffff81633e6c
    [ 9969.884710] ffffffff811ba5dc ffff880405c6b480 ffff88041fdd90a0 0000000000002000
    [ 9969.974071] Call Trace:
    [ 9970.003403] [] dump_stack+0x4d/0x66
    [ 9970.065074] [] __might_sleep+0xfa/0x130
    [ 9970.130743] [] mutex_lock_nested+0x3c/0x4f0
    [ 9970.200638] [] ? kmem_cache_alloc+0x1bc/0x210
    [ 9970.272610] [] cpuset_mems_allowed+0x27/0x140
    [ 9970.344584] [] ? __mpol_dup+0x63/0x150
    [ 9970.409282] [] __mpol_dup+0xe5/0x150
    [ 9970.471897] [] ? __mpol_dup+0x63/0x150
    [ 9970.536585] [] ? copy_process.part.23+0x606/0x1d40
    [ 9970.613763] [] ? trace_hardirqs_on+0xd/0x10
    [ 9970.683660] [] ? monotonic_to_bootbased+0x2f/0x50
    [ 9970.759795] [] copy_process.part.23+0x670/0x1d40
    [ 9970.834885] [] do_fork+0xd8/0x380
    [ 9970.894375] [] ? __audit_syscall_entry+0x9c/0xf0
    [ 9970.969470] [] SyS_clone+0x16/0x20
    [ 9971.030011] [] stub_clone+0x69/0x90
    [ 9971.091573] [] ? system_call_fastpath+0x16/0x1b

    The cause is that cpuset_mems_allowed() try to take
    mutex_lock(&callback_mutex) under the rcu_read_lock(which was hold in
    __mpol_dup()). And in cpuset_mems_allowed(), the access to cpuset is
    under rcu_read_lock, so in __mpol_dup, we can reduce the rcu_read_lock
    protection region to protect the access to cpuset only in
    current_cpuset_is_being_rebound(). So that we can avoid this bug.

    This patch is a temporary solution that just addresses the bug
    mentioned above, can not fix the long-standing issue about cpuset.mems
    rebinding on fork():

    "When the forker's task_struct is duplicated (which includes
    ->mems_allowed) and it races with an update to cpuset_being_rebound
    in update_tasks_nodemask() then the task's mems_allowed doesn't get
    updated. And the child task's mems_allowed can be wrong if the
    cpuset's nodemask changes before the child has been added to the
    cgroup's tasklist."

    Signed-off-by: Gu Zheng
    Acked-by: Li Zefan
    Signed-off-by: Tejun Heo
    Signed-off-by: Jiri Slaby

    Gu Zheng
     
  • commit d05f0cdcbe6388723f1900c549b4850360545201 upstream.

    In v2.6.34 commit 9d8cebd4bcd7 ("mm: fix mbind vma merge problem")
    introduced vma merging to mbind(), but it should have also changed the
    convention of passing start vma from queue_pages_range() (formerly
    check_range()) to new_vma_page(): vma merging may have already freed
    that structure, resulting in BUG at mm/mempolicy.c:1738 and probably
    worse crashes.

    Fixes: 9d8cebd4bcd7 ("mm: fix mbind vma merge problem")
    Reported-by: Naoya Horiguchi
    Tested-by: Naoya Horiguchi
    Signed-off-by: Hugh Dickins
    Acked-by: Christoph Lameter
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Hugh Dickins
     
  • commit dc78327c0ea7da5186d8cbc1647bd6088c5c9fa5 upstream.

    With a kernel configured with ARM64_64K_PAGES && !TRANSPARENT_HUGEPAGE,
    the following is triggered at early boot:

    SMP: Total of 8 processors activated.
    devtmpfs: initialized
    Unable to handle kernel NULL pointer dereference at virtual address 00000008
    pgd = fffffe0000050000
    [00000008] *pgd=00000043fba00003, *pmd=00000043fba00003, *pte=00e0000078010407
    Internal error: Oops: 96000006 [#1] SMP
    Modules linked in:
    CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.15.0-rc864k+ #44
    task: fffffe03bc040000 ti: fffffe03bc080000 task.ti: fffffe03bc080000
    PC is at __list_add+0x10/0xd4
    LR is at free_one_page+0x270/0x638
    ...
    Call trace:
    __list_add+0x10/0xd4
    free_one_page+0x26c/0x638
    __free_pages_ok.part.52+0x84/0xbc
    __free_pages+0x74/0xbc
    init_cma_reserved_pageblock+0xe8/0x104
    cma_init_reserved_areas+0x190/0x1e4
    do_one_initcall+0xc4/0x154
    kernel_init_freeable+0x204/0x2a8
    kernel_init+0xc/0xd4

    This happens because init_cma_reserved_pageblock() calls
    __free_one_page() with pageblock_order as page order but it is bigger
    than MAX_ORDER. This in turn causes accesses past zone->free_list[].

    Fix the problem by changing init_cma_reserved_pageblock() such that it
    splits pageblock into individual MAX_ORDER pages if pageblock is bigger
    than a MAX_ORDER page.

    In cases where !CONFIG_HUGETLB_PAGE_SIZE_VARIABLE, which is all
    architectures expect for ia64, powerpc and tile at the moment, the
    “pageblock_order > MAX_ORDER” condition will be optimised out since both
    sides of the operator are constants. In cases where pageblock size is
    variable, the performance degradation should not be significant anyway
    since init_cma_reserved_pageblock() is called only at boot time at most
    MAX_CMA_AREAS times which by default is eight.

    Signed-off-by: Michal Nazarewicz
    Reported-by: Mark Salter
    Tested-by: Mark Salter
    Tested-by: Christopher Covington
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Marek Szyprowski
    Cc: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Michal Nazarewicz
     
  • commit 7cd2b0a34ab8e4db971920eef8982f985441adfb upstream.

    Oleg reports a division by zero error on zero-length write() to the
    percpu_pagelist_fraction sysctl:

    divide error: 0000 [#1] SMP DEBUG_PAGEALLOC
    CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000
    RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120
    RSP: 0018:ffff8800d87a3e78 EFLAGS: 00010246
    RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010
    RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50
    R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060
    R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800
    FS: 00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0
    Call Trace:
    proc_sys_call_handler+0xb3/0xc0
    proc_sys_write+0x14/0x20
    vfs_write+0xba/0x1e0
    SyS_write+0x46/0xb0
    tracesys+0xe1/0xe6

    However, if the percpu_pagelist_fraction sysctl is set by the user, it
    is also impossible to restore it to the kernel default since the user
    cannot write 0 to the sysctl.

    This patch allows the user to write 0 to restore the default behavior.
    It still requires a fraction equal to or larger than 8, however, as
    stated by the documentation for sanity. If a value in the range [1, 7]
    is written, the sysctl will return EINVAL.

    This successfully solves the divide by zero issue at the same time.

    Signed-off-by: David Rientjes
    Reported-by: Oleg Drokin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    David Rientjes
     

02 Jul, 2014

10 commits

  • commit 1c8349a17137b93f0a83f276c764a6df1b9a116e upstream.

    When we perform a data integrity sync we tag all the dirty pages with
    PAGECACHE_TAG_TOWRITE at start of ext4_da_writepages. Later we check
    for this tag in write_cache_pages_da and creates a struct
    mpage_da_data containing contiguously indexed pages tagged with this
    tag and sync these pages with a call to mpage_da_map_and_submit. This
    process is done in while loop until all the PAGECACHE_TAG_TOWRITE
    pages are synced. We also do journal start and stop in each iteration.
    journal_stop could initiate journal commit which would call
    ext4_writepage which in turn will call ext4_bio_write_page even for
    delayed OR unwritten buffers. When ext4_bio_write_page is called for
    such buffers, even though it does not sync them but it clears the
    PAGECACHE_TAG_TOWRITE of the corresponding page and hence these pages
    are also not synced by the currently running data integrity sync. We
    will end up with dirty pages although sync is completed.

    This could cause a potential data loss when the sync call is followed
    by a truncate_pagecache call, which is exactly the case in
    collapse_range. (It will cause generic/127 failure in xfstests)

    To avoid this issue, we can use set_page_writeback_keepwrite instead of
    set_page_writeback, which doesn't clear TOWRITE tag.

    Signed-off-by: Namjae Jeon
    Signed-off-by: Ashish Sangwan
    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Jan Kara
    Signed-off-by: Jiri Slaby

    Namjae Jeon
     
  • commit 71abdc15adf8c702a1dd535f8e30df50758848d2 upstream.

    When kswapd exits, it can end up taking locks that were previously held
    by allocating tasks while they waited for reclaim. Lockdep currently
    warns about this:

    On Wed, May 28, 2014 at 06:06:34PM +0800, Gu Zheng wrote:
    > inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
    > kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes:
    > (&sig->group_rwsem){+++++?}, at: exit_signals+0x24/0x130
    > {RECLAIM_FS-ON-W} state was registered at:
    > mark_held_locks+0xb9/0x140
    > lockdep_trace_alloc+0x7a/0xe0
    > kmem_cache_alloc_trace+0x37/0x240
    > flex_array_alloc+0x99/0x1a0
    > cgroup_attach_task+0x63/0x430
    > attach_task_by_pid+0x210/0x280
    > cgroup_procs_write+0x16/0x20
    > cgroup_file_write+0x120/0x2c0
    > vfs_write+0xc0/0x1f0
    > SyS_write+0x4c/0xa0
    > tracesys+0xdd/0xe2
    > irq event stamp: 49
    > hardirqs last enabled at (49): _raw_spin_unlock_irqrestore+0x36/0x70
    > hardirqs last disabled at (48): _raw_spin_lock_irqsave+0x2b/0xa0
    > softirqs last enabled at (0): copy_process.part.24+0x627/0x15f0
    > softirqs last disabled at (0): (null)
    >
    > other info that might help us debug this:
    > Possible unsafe locking scenario:
    >
    > CPU0
    > ----
    > lock(&sig->group_rwsem);
    >
    > lock(&sig->group_rwsem);
    >
    > *** DEADLOCK ***
    >
    > no locks held by kswapd2/1151.
    >
    > stack backtrace:
    > CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ #4
    > Call Trace:
    > dump_stack+0x19/0x1b
    > print_usage_bug+0x1f7/0x208
    > mark_lock+0x21d/0x2a0
    > __lock_acquire+0x52a/0xb60
    > lock_acquire+0xa2/0x140
    > down_read+0x51/0xa0
    > exit_signals+0x24/0x130
    > do_exit+0xb5/0xa50
    > kthread+0xdb/0x100
    > ret_from_fork+0x7c/0xb0

    This is because the kswapd thread is still marked as a reclaimer at the
    time of exit. But because it is exiting, nobody is actually waiting on
    it to make reclaim progress anymore, and it's nothing but a regular
    thread at this point. Be tidy and strip it of all its powers
    (PF_MEMALLOC, PF_SWAPWRITE, PF_KSWAPD, and the lockdep reclaim state)
    before returning from the thread function.

    Signed-off-by: Johannes Weiner
    Reported-by: Gu Zheng
    Cc: Yasuaki Ishimatsu
    Cc: Tang Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Johannes Weiner
     
  • commit 7f39dda9d86fb4f4f17af0de170decf125726f8c upstream.

    Trinity reports BUG:

    sleeping function called from invalid context at kernel/locking/rwsem.c:47
    in_atomic(): 0, irqs_disabled(): 0, pid: 5787, name: trinity-c27

    __might_sleep < down_write < __put_anon_vma < page_get_anon_vma <
    migrate_pages < compact_zone < compact_zone_order < try_to_compact_pages ..

    Right, since conversion to mutex then rwsem, we should not put_anon_vma()
    from inside an rcu_read_lock()ed section: fix the two places that did so.
    And add might_sleep() to anon_vma_free(), as suggested by Peter Zijlstra.

    Fixes: 88c22088bf23 ("mm: optimize page_lock_anon_vma() fast-path")
    Reported-by: Dave Jones
    Signed-off-by: Hugh Dickins
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Hugh Dickins
     
  • commit 3ba08129e38437561df44c36b7ea9081185d5333 upstream.

    Currently memory error handler handles action optional errors in the
    deferred manner by default. And if a recovery aware application wants
    to handle it immediately, it can do it by setting PF_MCE_EARLY flag.
    However, such signal can be sent only to the main thread, so it's
    problematic if the application wants to have a dedicated thread to
    handler such signals.

    So this patch adds dedicated thread support to memory error handler. We
    have PF_MCE_EARLY flags for each thread separately, so with this patch
    AO signal is sent to the thread with PF_MCE_EARLY flag set, not the main
    thread. If you want to implement a dedicated thread, you call prctl()
    to set PF_MCE_EARLY on the thread.

    Memory error handler collects processes to be killed, so this patch lets
    it check PF_MCE_EARLY flag on each thread in the collecting routines.

    No behavioral change for all non-early kill cases.

    Tony said:

    : The old behavior was crazy - someone with a multithreaded process might
    : well expect that if they call prctl(PF_MCE_EARLY) in just one thread, then
    : that thread would see the SIGBUS with si_code = BUS_MCEERR_A0 - even if
    : that thread wasn't the main thread for the process.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Tony Luck
    Cc: Kamil Iskra
    Cc: Andi Kleen
    Cc: Borislav Petkov
    Cc: Chen Gong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Naoya Horiguchi
     
  • commit 74614de17db6fb472370c426d4f934d8d616edf2 upstream.

    When Linux sees an "action optional" machine check (where h/w has reported
    an error that is not in the current execution path) we generally do not
    want to signal a process, since most processes do not have a SIGBUS
    handler - we'd just prematurely terminate the process for a problem that
    they might never actually see.

    task_early_kill() decides whether to consider a process - and it checks
    whether this specific process has been marked for early signals with
    "prctl", or if the system administrator has requested early signals for
    all processes using /proc/sys/vm/memory_failure_early_kill.

    But for MF_ACTION_REQUIRED case we must not defer. The error is in the
    execution path of the current thread so we must send the SIGBUS
    immediatley.

    Fix by passing a flag argument through collect_procs*() to
    task_early_kill() so it knows whether we can defer or must take action.

    Signed-off-by: Tony Luck
    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Borislav Petkov
    Cc: Chen Gong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Tony Luck
     
  • commit a70ffcac741d31a406c1d2b832ae43d658e7e1cf upstream.

    When a thread in a multi-threaded application hits a machine check because
    of an uncorrectable error in memory - we want to send the SIGBUS with
    si.si_code = BUS_MCEERR_AR to that thread. Currently we fail to do that
    if the active thread is not the primary thread in the process.
    collect_procs() just finds primary threads and this test:

    if ((flags & MF_ACTION_REQUIRED) && t == current) {

    will see that the thread we found isn't the current thread and so send a
    si.si_code = BUS_MCEERR_AO to the primary (and nothing to the active
    thread at this time).

    We can fix this by checking whether "current" shares the same mm with the
    process that collect_procs() said owned the page. If so, we send the
    SIGBUS to current (with code BUS_MCEERR_AR).

    Signed-off-by: Tony Luck
    Signed-off-by: Naoya Horiguchi
    Reported-by: Otto Bruggeman
    Cc: Andi Kleen
    Cc: Borislav Petkov
    Cc: Chen Gong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Tony Luck
     
  • commit e58469bafd0524e848c3733bc3918d854595e20f upstream.

    The test_bit operations in get/set pageblock flags are expensive. This
    patch reads the bitmap on a word basis and use shifts and masks to isolate
    the bits of interest. Similarly masks are used to set a local copy of the
    bitmap and then use cmpxchg to update the bitmap if there have been no
    other changes made in parallel.

    In a test running dd onto tmpfs the overhead of the pageblock-related
    functions went from 1.27% in profiles to 0.5%.

    In addition to the performance benefits, this patch closes races that are
    possible between:

    a) get_ and set_pageblock_migratetype(), where get_pageblock_migratetype()
    reads part of the bits before and other part of the bits after
    set_pageblock_migratetype() has updated them.

    b) set_pageblock_migratetype() and set_pageblock_skip(), where the non-atomic
    read-modify-update set bit operation in set_pageblock_skip() will cause
    lost updates to some bits changed in the set_pageblock_migratetype().

    Joonsoo Kim first reported the case a) via code inspection. Vlastimil
    Babka's testing with a debug patch showed that either a) or b) occurs
    roughly once per mmtests' stress-highalloc benchmark (although not
    necessarily in the same pageblock). Furthermore during development of
    unrelated compaction patches, it was observed that frequent calls to
    {start,undo}_isolate_page_range() the race occurs several thousands of
    times and has resulted in NULL pointer dereferences in move_freepages()
    and free_one_page() in places where free_list[migratetype] is
    manipulated by e.g. list_move(). Further debugging confirmed that
    migratetype had invalid value of 6, causing out of bounds access to the
    free_list array.

    That confirmed that the race exist, although it may be extremely rare,
    and currently only fatal where page isolation is performed due to
    memory hot remove. Races on pageblocks being updated by
    set_pageblock_migratetype(), where both old and new migratetype are
    lower MIGRATE_RESERVE, currently cannot result in an invalid value
    being observed, although theoretically they may still lead to
    unexpected creation or destruction of MIGRATE_RESERVE pageblocks.
    Furthermore, things could get suddenly worse when memory isolation is
    used more, or when new migratetypes are added.

    After this patch, the race has no longer been observed in testing.

    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Reported-by: Joonsoo Kim
    Reported-and-tested-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit 675becce15f320337499bc1a9356260409a5ba29 upstream.

    throttle_direct_reclaim() is meant to trigger during swap-over-network
    during which the min watermark is treated as a pfmemalloc reserve. It
    throttes on the first node in the zonelist but this is flawed.

    The user-visible impact is that a process running on CPU whose local
    memory node has no ZONE_NORMAL will stall for prolonged periods of time,
    possibly indefintely. This is due to throttle_direct_reclaim thinking the
    pfmemalloc reserves are depleted when in fact they don't exist on that
    node.

    On a NUMA machine running a 32-bit kernel (I know) allocation requests
    from CPUs on node 1 would detect no pfmemalloc reserves and the process
    gets throttled. This patch adjusts throttling of direct reclaim to
    throttle based on the first node in the zonelist that has a usable
    ZONE_NORMAL or lower zone.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit c177c81e09e517bbf75b67762cdab1b83aba6976 upstream.

    Currently hugepage migration is available for all archs which support
    pmd-level hugepage, but testing is done only for x86_64 and there're
    bugs for other archs. So to avoid breaking such archs, this patch
    limits the availability strictly to x86_64 until developers of other
    archs get interested in enabling this feature.

    Simply disabling hugepage migration on non-x86_64 archs is not enough to
    fix the reported problem where sys_move_pages() hits the BUG_ON() in
    follow_page(FOLL_GET), so let's fix this by checking if hugepage
    migration is supported in vma_migratable().

    Signed-off-by: Naoya Horiguchi
    Reported-by: Michael Ellerman
    Tested-by: Michael Ellerman
    Acked-by: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: Tony Luck
    Cc: Russell King
    Cc: Martin Schwidefsky
    Cc: James Hogan
    Cc: Ralf Baechle
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Naoya Horiguchi
     
  • commit d4c54919ed86302094c0ca7d48a8cbd4ee753e92 upstream.

    The age table walker doesn't check non-present hugetlb entry in common
    path, so hugetlb_entry() callbacks must check it. The reason for this
    behavior is that some callers want to handle it in its own way.

    [ I think that reason is bogus, btw - it should just do what the regular
    code does, which is to call the "pte_hole()" function for such hugetlb
    entries - Linus]

    However, some callers don't check it now, which causes unpredictable
    result, for example when we have a race between migrating hugepage and
    reading /proc/pid/numa_maps. This patch fixes it by adding !pte_present
    checks on buggy callbacks.

    This bug exists for years and got visible by introducing hugepage
    migration.

    ChangeLog v2:
    - fix if condition (check !pte_present() instead of pte_present())

    Reported-by: Sasha Levin
    Signed-off-by: Naoya Horiguchi
    Cc: Rik van Riel
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    [ Backported to 3.15. Signed-off-by: Josh Boyer ]
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Naoya Horiguchi
     

20 Jun, 2014

5 commits

  • commit 624483f3ea82598ab0f62f1bdb9177f531ab1892 upstream.

    While working address sanitizer for kernel I've discovered
    use-after-free bug in __put_anon_vma.

    For the last anon_vma, anon_vma->root freed before child anon_vma.
    Later in anon_vma_free(anon_vma) we are referencing to already freed
    anon_vma->root to check rwsem.

    This fixes it by freeing the child anon_vma before freeing
    anon_vma->root.

    Signed-off-by: Andrey Ryabinin
    Acked-by: Peter Zijlstra
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Andrey Ryabinin
     
  • commit 3e030ecc0fc7de10fd0da10c1c19939872a31717 upstream.

    When a memory error happens on an in-use page or (free and in-use)
    hugepage, the victim page is isolated with its refcount set to one.

    When you try to unpoison it later, unpoison_memory() calls put_page()
    for it twice in order to bring the page back to free page pool (buddy or
    free hugepage list). However, if another memory error occurs on the
    page which we are unpoisoning, memory_failure() returns without
    releasing the refcount which was incremented in the same call at first,
    which results in memory leak and unconsistent num_poisoned_pages
    statistics. This patch fixes it.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Naoya Horiguchi
     
  • commit 49e068f0b73dd042c186ffa9b420a9943e90389a upstream.

    The compaction freepage scanner implementation in isolate_freepages()
    starts by taking the current cc->free_pfn value as the first pfn. In a
    for loop, it scans from this first pfn to the end of the pageblock, and
    then subtracts pageblock_nr_pages from the first pfn to obtain the first
    pfn for the next for loop iteration.

    This means that when cc->free_pfn starts at offset X rather than being
    aligned on pageblock boundary, the scanner will start at offset X in all
    scanned pageblock, ignoring potentially many free pages. Currently this
    can happen when

    a) zone's end pfn is not pageblock aligned, or

    b) through zone->compact_cached_free_pfn with CONFIG_HOLES_IN_ZONE
    enabled and a hole spanning the beginning of a pageblock

    This patch fixes the problem by aligning the initial pfn in
    isolate_freepages() to pageblock boundary. This also permits replacing
    the end-of-pageblock alignment within the for loop with a simple
    pageblock_nr_pages increment.

    Signed-off-by: Vlastimil Babka
    Reported-by: Heesub Shin
    Acked-by: Minchan Kim
    Cc: Mel Gorman
    Acked-by: Joonsoo Kim
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Acked-by: Rik van Riel
    Cc: Dongjun Shin
    Cc: Sunghwan Yun
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Vlastimil Babka
     
  • commit 7ed695e069c3cbea5e1fd08f84a04536da91f584 upstream.

    Compaction of a zone is finished when the migrate scanner (which begins
    at the zone's lowest pfn) meets the free page scanner (which begins at
    the zone's highest pfn). This is detected in compact_zone() and in the
    case of direct compaction, the compact_blockskip_flush flag is set so
    that kswapd later resets the cached scanner pfn's, and a new compaction
    may again start at the zone's borders.

    The meeting of the scanners can happen during either scanner's activity.
    However, it may currently fail to be detected when it occurs in the free
    page scanner, due to two problems. First, isolate_freepages() keeps
    free_pfn at the highest block where it isolated pages from, for the
    purposes of not missing the pages that are returned back to allocator
    when migration fails. Second, failing to isolate enough free pages due
    to scanners meeting results in -ENOMEM being returned by
    migrate_pages(), which makes compact_zone() bail out immediately without
    calling compact_finished() that would detect scanners meeting.

    This failure to detect scanners meeting might result in repeated
    attempts at compaction of a zone that keep starting from the cached
    pfn's close to the meeting point, and quickly failing through the
    -ENOMEM path, without the cached pfns being reset, over and over. This
    has been observed (through additional tracepoints) in the third phase of
    the mmtests stress-highalloc benchmark, where the allocator runs on an
    otherwise idle system. The problem was observed in the DMA32 zone,
    which was used as a fallback to the preferred Normal zone, but on the
    4GB system it was actually the largest zone. The problem is even
    amplified for such fallback zone - the deferred compaction logic, which
    could (after being fixed by a previous patch) reset the cached scanner
    pfn's, is only applied to the preferred zone and not for the fallbacks.

    The problem in the third phase of the benchmark was further amplified by
    commit 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") which
    resulted in a non-deterministic regression of the allocation success
    rate from ~85% to ~65%. This occurs in about half of benchmark runs,
    making bisection problematic. It is unlikely that the commit itself is
    buggy, but it should put more pressure on the DMA32 zone during phases 1
    and 2, which may leave it more fragmented in phase 3 and expose the bugs
    that this patch fixes.

    The fix is to make scanners meeting in isolate_freepage() stay that way,
    and to check in compact_zone() for scanners meeting when migrate_pages()
    returns -ENOMEM. The result is that compact_finished() also detects
    scanners meeting and sets the compact_blockskip_flush flag to make
    kswapd reset the scanner pfn's.

    The results in stress-highalloc benchmark show that the "regression" by
    commit 81c0a2bb515f in phase 3 no longer occurs, and phase 1 and 2
    allocation success rates are also significantly improved.

    Signed-off-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Vlastimil Babka
     
  • commit d3132e4b83e6bd383c74d716f7281d7c3136089c upstream.

    Compaction caches pfn's for its migrate and free scanners to avoid
    scanning the whole zone each time. In compact_zone(), the cached values
    are read to set up initial values for the scanners. There are several
    situations when these cached pfn's are reset to the first and last pfn
    of the zone, respectively. One of these situations is when a compaction
    has been deferred for a zone and is now being restarted during a direct
    compaction, which is also done in compact_zone().

    However, compact_zone() currently reads the cached pfn's *before*
    resetting them. This means the reset doesn't affect the compaction that
    performs it, and with good chance also subsequent compactions, as
    update_pageblock_skip() is likely to be called and update the cached
    pfn's to those being processed. Another chance for a successful reset
    is when a direct compaction detects that migration and free scanners
    meet (which has its own problems addressed by another patch) and sets
    update_pageblock_skip flag which kswapd uses to do the reset because it
    goes to sleep.

    This is clearly a bug that results in non-deterministic behavior, so
    this patch moves the cached pfn reset to be performed *before* the
    values are read.

    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Vlastimil Babka
     

09 Jun, 2014

2 commits

  • commit 5a838c3b60e3a36ade764cf7751b8f17d7c9c2da upstream.

    pcpu_chunk_struct_size = sizeof(struct pcpu_chunk) +
    BITS_TO_LONGS(pcpu_unit_pages) * sizeof(unsigned long)

    It hardly could be ever bigger than PAGE_SIZE even for large-scale machine,
    but for consistency with its couterpart pcpu_mem_zalloc(),
    use pcpu_mem_free() instead.

    Commit b4916cb17c26 ("percpu: make pcpu_free_chunk() use
    pcpu_mem_free() instead of kfree()") addressed this problem, but
    missed this one.

    tj: commit message updated

    Signed-off-by: Jianyu Zhan
    Signed-off-by: Tejun Heo
    Fixes: 099a19d91ca4 ("percpu: allow limited allocation before slab is online)
    Signed-off-by: Jiri Slaby

    Jianyu Zhan
     
  • commit d5c9fde3dae750889168807038243ff36431d276 upstream.

    It is possible for "limit - setpoint + 1" to equal zero, after getting
    truncated to a 32 bit variable, and resulting in a divide by zero error.

    Using the fully 64 bit divide functions avoids this problem. It also
    will cause pos_ratio_polynom() to return the correct value when
    (setpoint - limit) exceeds 2^32.

    Also uninline pos_ratio_polynom, at Andrew's request.

    Signed-off-by: Rik van Riel
    Reviewed-by: Michal Hocko
    Cc: Aneesh Kumar K.V
    Cc: Mel Gorman
    Cc: Nishanth Aravamudan
    Cc: Luiz Capitulino
    Cc: Masayoshi Mizuma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Rik van Riel
     

06 Jun, 2014

3 commits

  • commit b985194c8c0a130ed155b71662e39f7eaea4876f upstream.

    For handling a free hugepage in memory failure, the race will happen if
    another thread hwpoisoned this hugepage concurrently. So we need to
    check PageHWPoison instead of !PageHWPoison.

    If hwpoison_filter(p) returns true or a race happens, then we need to
    unlock_page(hpage).

    Signed-off-by: Chen Yucong
    Reviewed-by: Naoya Horiguchi
    Tested-by: Naoya Horiguchi
    Reviewed-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Chen Yucong
     
  • commit dd18dbc2d42af75fffa60c77e0f02220bc329829 upstream.

    It's critical for split_huge_page() (and migration) to catch and freeze
    all PMDs on rmap walk. It gets tricky if there's concurrent fork() or
    mremap() since usually we copy/move page table entries on dup_mm() or
    move_page_tables() without rmap lock taken. To get it work we rely on
    rmap walk order to not miss any entry. We expect to see destination VMA
    after source one to work correctly.

    But after switching rmap implementation to interval tree it's not always
    possible to preserve expected walk order.

    It works fine for dup_mm() since new VMA has the same vma_start_pgoff()
    / vma_last_pgoff() and explicitly insert dst VMA after src one with
    vma_interval_tree_insert_after().

    But on move_vma() destination VMA can be merged into adjacent one and as
    result shifted left in interval tree. Fortunately, we can detect the
    situation and prevent race with rmap walk by moving page table entries
    under rmap lock. See commit 38a76013ad80.

    Problem is that we miss the lock when we move transhuge PMD. Most
    likely this bug caused the crash[1].

    [1] http://thread.gmane.org/gmane.linux.kernel.mm/96473

    Fixes: 108d6642ad81 ("mm anon rmap: remove anon_vma_moveto_tail")

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Andrea Arcangeli
    Cc: Rik van Riel
    Acked-by: Michel Lespinasse
    Cc: Dave Jones
    Cc: David Miller
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Kirill A. Shutemov
     
  • commit 1b17844b29ae042576bea588164f2f1e9590a8bc upstream.

    fixup_user_fault() is used by the futex code when the direct user access
    fails, and the futex code wants it to either map in the page in a usable
    form or return an error. It relied on handle_mm_fault() to map the
    page, and correctly checked the error return from that, but while that
    does map the page, it doesn't actually guarantee that the page will be
    mapped with sufficient permissions to be then accessed.

    So do the appropriate tests of the vma access rights by hand.

    [ Side note: arguably handle_mm_fault() could just do that itself, but
    we have traditionally done it in the caller, because some callers -
    notably get_user_pages() - have been able to access pages even when
    they are mapped with PROT_NONE. Maybe we should re-visit that design
    decision, but in the meantime this is the minimal patch. ]

    Found by Dave Jones running his trinity tool.

    Reported-by: Dave Jones
    Acked-by: Hugh Dickins
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Linus Torvalds
     

04 Jun, 2014

4 commits

  • commit d49ad9355420c743c736bfd1dee9eaa5b1a7722a upstream.

    When two threads have the same badness score, it's preferable to kill
    the thread group leader so that the actual process name is printed to
    the kernel log rather than the thread group name which may be shared
    amongst several processes.

    This was the behavior when select_bad_process() used to do
    for_each_process(), but it now iterates threads instead and leads to
    ambiguity.

    Signed-off-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Greg Thelen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    David Rientjes
     
  • commit 4d4048be8a93769350efa31d2482a038b7de73d0 upstream.

    find_lock_task_mm() expects it is called under rcu or tasklist lock, but
    it seems that at least oom_unkillable_task()->task_in_mem_cgroup() and
    mem_cgroup_out_of_memory()->oom_badness() can call it lockless.

    Perhaps we could fix the callers, but this patch simply adds rcu lock
    into find_lock_task_mm(). This also allows to simplify a bit one of its
    callers, oom_kill_process().

    Signed-off-by: Oleg Nesterov
    Cc: Sergey Dyasly
    Cc: Sameer Nanda
    Cc: "Eric W. Biederman"
    Cc: Frederic Weisbecker
    Cc: Mandeep Singh Baines
    Cc: "Ma, Xindong"
    Reviewed-by: Michal Hocko
    Cc: "Tu, Xiaobing"
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Oleg Nesterov
     
  • commit ad96244179fbd55b40c00f10f399bc04739b8e1f upstream.

    At least out_of_memory() calls has_intersects_mems_allowed() without
    even rcu_read_lock(), this is obviously buggy.

    Add the necessary rcu_read_lock(). This means that we can not simply
    return from the loop, we need "bool ret" and "break".

    While at it, swap the names of task_struct's (the argument and the
    local). This cleans up the code a little bit and avoids the unnecessary
    initialization.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Sergey Dyasly
    Tested-by: Sergey Dyasly
    Reviewed-by: Sameer Nanda
    Cc: "Eric W. Biederman"
    Cc: Frederic Weisbecker
    Cc: Mandeep Singh Baines
    Cc: "Ma, Xindong"
    Reviewed-by: Michal Hocko
    Cc: "Tu, Xiaobing"
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Oleg Nesterov
     
  • commit 1da4db0cd5c8a31d4468ec906b413e75e604b465 upstream.

    Change oom_kill.c to use for_each_thread() rather than the racy
    while_each_thread() which can loop forever if we race with exit.

    Note also that most users were buggy even if while_each_thread() was
    fine, the task can exit even _before_ rcu_read_lock().

    Fortunately the new for_each_thread() only requires the stable
    task_struct, so this change fixes both problems.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Sergey Dyasly
    Tested-by: Sergey Dyasly
    Reviewed-by: Sameer Nanda
    Cc: "Eric W. Biederman"
    Cc: Frederic Weisbecker
    Cc: Mandeep Singh Baines
    Cc: "Ma, Xindong"
    Reviewed-by: Michal Hocko
    Cc: "Tu, Xiaobing"
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Oleg Nesterov
     

29 May, 2014

1 commit

  • commit 7848a4bf51b34f41fcc9bd77e837126d99ae84e3 upstream.

    soft lockup in freeing gigantic hugepage fixed in commit 55f67141a892 "mm:
    hugetlb: fix softlockup when a large number of hugepages are freed." can
    happen in return_unused_surplus_pages(), so let's fix it.

    Signed-off-by: Masayoshi Mizuma
    Signed-off-by: Naoya Horiguchi
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Aneesh Kumar
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Mizuma, Masayoshi
     

15 May, 2014

3 commits

  • commit 55f67141a8927b2be3e51840da37b8a2320143ed upstream.

    When I decrease the value of nr_hugepage in procfs a lot, softlockup
    happens. It is because there is no chance of context switch during this
    process.

    On the other hand, when I allocate a large number of hugepages, there is
    some chance of context switch. Hence softlockup doesn't happen during
    this process. So it's necessary to add the context switch in the
    freeing process as same as allocating process to avoid softlockup.

    When I freed 12 TB hugapages with kernel-2.6.32-358.el6, the freeing
    process occupied a CPU over 150 seconds and following softlockup message
    appeared twice or more.

    $ echo 6000000 > /proc/sys/vm/nr_hugepages
    $ cat /proc/sys/vm/nr_hugepages
    6000000
    $ grep ^Huge /proc/meminfo
    HugePages_Total: 6000000
    HugePages_Free: 6000000
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    Hugepagesize: 2048 kB
    $ echo 0 > /proc/sys/vm/nr_hugepages

    BUG: soft lockup - CPU#16 stuck for 67s! [sh:12883] ...
    Pid: 12883, comm: sh Not tainted 2.6.32-358.el6.x86_64 #1
    Call Trace:
    free_pool_huge_page+0xb8/0xd0
    set_max_huge_pages+0x128/0x190
    hugetlb_sysctl_handler_common+0x113/0x140
    hugetlb_sysctl_handler+0x1e/0x20
    proc_sys_call_handler+0x97/0xd0
    proc_sys_write+0x14/0x20
    vfs_write+0xb8/0x1a0
    sys_write+0x51/0x90
    __audit_syscall_exit+0x265/0x290
    system_call_fastpath+0x16/0x1b

    I have not confirmed this problem with upstream kernels because I am not
    able to prepare the machine equipped with 12TB memory now. However I
    confirmed that the amount of decreasing hugepages was directly
    proportional to the amount of required time.

    I measured required times on a smaller machine. It showed 130-145
    hugepages decreased in a millisecond.

    Amount of decreasing Required time Decreasing rate
    hugepages (msec) (pages/msec)
    ------------------------------------------------------------
    10,000 pages == 20GB 70 - 74 135-142
    30,000 pages == 60GB 208 - 229 131-144

    It means decrement of 6TB hugepages will trigger softlockup with the
    default threshold 20sec, in this decreasing rate.

    Signed-off-by: Masayoshi Mizuma
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Wanpeng Li
    Cc: Aneesh Kumar
    Cc: KOSAKI Motohiro
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Mizuma, Masayoshi
     
  • commit 57e68e9cd65b4b8eb4045a1e0d0746458502554c upstream.

    A BUG_ON(!PageLocked) was triggered in mlock_vma_page() by Sasha Levin
    fuzzing with trinity. The call site try_to_unmap_cluster() does not lock
    the pages other than its check_page parameter (which is already locked).

    The BUG_ON in mlock_vma_page() is not documented and its purpose is
    somewhat unclear, but apparently it serializes against page migration,
    which could otherwise fail to transfer the PG_mlocked flag. This would
    not be fatal, as the page would be eventually encountered again, but
    NR_MLOCK accounting would become distorted nevertheless. This patch adds
    a comment to the BUG_ON in mlock_vma_page() and munlock_vma_page() to that
    effect.

    The call site try_to_unmap_cluster() is fixed so that for page !=
    check_page, trylock_page() is attempted (to avoid possible deadlocks as we
    already have check_page locked) and mlock_vma_page() is performed only
    upon success. If the page lock cannot be obtained, the page is left
    without PG_mlocked, which is again not a problem in the whole unevictable
    memory design.

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Bob Liu
    Reported-by: Sasha Levin
    Cc: Wanpeng Li
    Cc: Michel Lespinasse
    Cc: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Vlastimil Babka
     
  • commit 3a025760fc158b3726eac89ee95d7f29599e9dfa upstream.

    On NUMA systems, a node may start thrashing cache or even swap anonymous
    pages while there are still free pages on remote nodes.

    This is a result of commits 81c0a2bb515f ("mm: page_alloc: fair zone
    allocator policy") and fff4068cba48 ("mm: page_alloc: revert NUMA aspect
    of fair allocation policy").

    Before those changes, the allocator would first try all allowed zones,
    including those on remote nodes, before waking any kswapds. But now,
    the allocator fastpath doubles as the fairness pass, which in turn can
    only consider the local node to prevent remote spilling based on
    exhausted fairness batches alone. Remote nodes are only considered in
    the slowpath, after the kswapds are woken up. But if remote nodes still
    have free memory, kswapd should not be woken to rebalance the local node
    or it may thrash cash or swap prematurely.

    Fix this by adding one more unfair pass over the zonelist that is
    allowed to spill to remote nodes after the local fairness pass fails but
    before entering the slowpath and waking the kswapds.

    This also gets rid of the GFP_THISNODE exemption from the fairness
    protocol because the unfair pass is no longer tied to kswapd, which
    GFP_THISNODE is not allowed to wake up.

    However, because remote spills can be more frequent now - we prefer them
    over local kswapd reclaim - the allocation batches on remote nodes could
    underflow more heavily. When resetting the batches, use
    atomic_long_read() directly instead of zone_page_state() to calculate the
    delta as the latter filters negative counter values.

    Signed-off-by: Johannes Weiner
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Johannes Weiner
     

05 May, 2014

2 commits

  • commit 5acda9d12dcf1ad0d9a5a2a7c646de3472fa7555 upstream.

    After commit 839a8e8660b6 ("writeback: replace custom worker pool
    implementation with unbound workqueue") when device is removed while we
    are writing to it we crash in bdi_writeback_workfn() ->
    set_worker_desc() because bdi->dev is NULL.

    This can happen because even though bdi_unregister() cancels all pending
    flushing work, nothing really prevents new ones from being queued from
    balance_dirty_pages() or other places.

    Fix the problem by clearing BDI_registered bit in bdi_unregister() and
    checking it before scheduling of any flushing work.

    Fixes: 839a8e8660b6777e7fe4e80af1a048aebe2b5977

    Reviewed-by: Tejun Heo
    Signed-off-by: Jan Kara
    Cc: Derek Basehore
    Cc: Jens Axboe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Jan Kara
     
  • commit 6ca738d60c563d5c6cf6253ee4b8e76fa77b2b9e upstream.

    bdi_wakeup_thread_delayed() used the mod_delayed_work() function to
    schedule work to writeback dirty inodes. The problem with this is that
    it can delay work that is scheduled for immediate execution, such as the
    work from sync_inodes_sb(). This can happen since mod_delayed_work()
    can now steal work from a work_queue. This fixes the problem by using
    queue_delayed_work() instead. This is a regression caused by commit
    839a8e8660b6 ("writeback: replace custom worker pool implementation with
    unbound workqueue").

    The reason that this causes a problem is that laptop-mode will change
    the delay, dirty_writeback_centisecs, to 60000 (10 minutes) by default.
    In the case that bdi_wakeup_thread_delayed() races with
    sync_inodes_sb(), sync will be stopped for 10 minutes and trigger a hung
    task. Even if dirty_writeback_centisecs is not long enough to cause a
    hung task, we still don't want to delay sync for that long.

    We fix the problem by using queue_delayed_work() when we want to
    schedule writeback sometime in future. This function doesn't change the
    timer if it is already armed.

    For the same reason, we also change bdi_writeback_workfn() to
    immediately queue the work again in the case that the work_list is not
    empty. The same problem can happen if the sync work is run on the
    rescue worker.

    [jack@suse.cz: update changelog, add comment, use bdi_wakeup_thread_delayed()]
    Signed-off-by: Derek Basehore
    Reviewed-by: Jan Kara
    Cc: Alexander Viro
    Reviewed-by: Tejun Heo
    Cc: Greg Kroah-Hartman
    Cc: "Darrick J. Wong"
    Cc: Derek Basehore
    Cc: Kees Cook
    Cc: Benson Leung
    Cc: Sonny Rao
    Cc: Luigi Semenzato
    Cc: Jens Axboe
    Cc: Dave Chinner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Signed-off-by: Jiri Slaby

    Derek Basehore