14 Dec, 2014

3 commits

  • The slab shrinkers are currently invoked from the zonelist walkers in
    kswapd, direct reclaim, and zone reclaim, all of which roughly gauge the
    eligible LRU pages and assemble a nodemask to pass to NUMA-aware
    shrinkers, which then again have to walk over the nodemask. This is
    redundant code, extra runtime work, and fairly inaccurate when it comes to
    the estimation of actually scannable LRU pages. The code duplication will
    only get worse when making the shrinkers cgroup-aware and requiring them
    to have out-of-band cgroup hierarchy walks as well.

    Instead, invoke the shrinkers from shrink_zone(), which is where all
    reclaimers end up, to avoid this duplication.

    Take the count for eligible LRU pages out of get_scan_count(), which
    considers many more factors than just the availability of swap space, like
    zone_reclaimable_pages() currently does. Accumulate the number over all
    visited lruvecs to get the per-zone value.

    Some nodes have multiple zones due to memory addressing restrictions. To
    avoid putting too much pressure on the shrinkers, only invoke them once
    for each such node, using the class zone of the allocation as the pivot
    zone.

    For now, this integrates the slab shrinking better into the reclaim logic
    and gets rid of duplicative invocations from kswapd, direct reclaim, and
    zone reclaim. It also prepares for cgroup-awareness, allowing
    memcg-capable shrinkers to be added at the lruvec level without much
    duplication of both code and runtime work.

    This changes kswapd behavior, which used to invoke the shrinkers for each
    zone, but with scan ratios gathered from the entire node, resulting in
    meaningless pressure quantities on multi-zone nodes.

    Zone reclaim behavior also changes. It used to shrink slabs until the
    same amount of pages were shrunk as were reclaimed from the LRUs. Now it
    merely invokes the shrinkers once with the zone's scan ratio, which makes
    the shrinkers go easier on caches that implement aging and would prefer
    feeding back pressure from recently used slab objects to unused LRU pages.

    [vdavydov@parallels.com: assure class zone is populated]
    Signed-off-by: Johannes Weiner
    Cc: Dave Chinner
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • No brainer conversion: collect_procs_file() only schedules a process for
    later kill, share the lock, similarly to the anon vma variant.

    Signed-off-by: Davidlohr Bueso
    Acked-by: "Kirill A. Shutemov"
    Acked-by: Hugh Dickins
    Cc: Oleg Nesterov
    Acked-by: Peter Zijlstra (Intel)
    Cc: Rik van Riel
    Cc: Srikar Dronamraju
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Convert all open coded mutex_lock/unlock calls to the
    i_mmap_[lock/unlock]_write() helpers.

    Signed-off-by: Davidlohr Bueso
    Acked-by: Rik van Riel
    Acked-by: "Kirill A. Shutemov"
    Acked-by: Hugh Dickins
    Cc: Oleg Nesterov
    Acked-by: Peter Zijlstra (Intel)
    Cc: Srikar Dronamraju
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

11 Dec, 2014

3 commits

  • Merge first patchbomb from Andrew Morton:
    - a few minor cifs fixes
    - dma-debug upadtes
    - ocfs2
    - slab
    - about half of MM
    - procfs
    - kernel/exit.c
    - panic.c tweaks
    - printk upates
    - lib/ updates
    - checkpatch updates
    - fs/binfmt updates
    - the drivers/rtc tree
    - nilfs
    - kmod fixes
    - more kernel/exit.c
    - various other misc tweaks and fixes

    * emailed patches from Andrew Morton : (190 commits)
    exit: pidns: fix/update the comments in zap_pid_ns_processes()
    exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting
    exit: exit_notify: re-use "dead" list to autoreap current
    exit: reparent: call forget_original_parent() under tasklist_lock
    exit: reparent: avoid find_new_reaper() if no children
    exit: reparent: introduce find_alive_thread()
    exit: reparent: introduce find_child_reaper()
    exit: reparent: document the ->has_child_subreaper checks
    exit: reparent: s/while_each_thread/for_each_thread/ in find_new_reaper()
    exit: reparent: fix the cross-namespace PR_SET_CHILD_SUBREAPER reparenting
    exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting
    exit: proc: don't try to flush /proc/tgid/task/tgid
    exit: release_task: fix the comment about group leader accounting
    exit: wait: drop tasklist_lock before psig->c* accounting
    exit: wait: don't use zombie->real_parent
    exit: wait: cleanup the ptrace_reparented() checks
    usermodehelper: kill the kmod_thread_locker logic
    usermodehelper: don't use CLONE_VFORK for ____call_usermodehelper()
    fs/hfs/catalog.c: fix comparison bug in hfs_cat_keycmp
    nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races
    ...

    Linus Torvalds
     
  • Memory hotplug and failure mechanisms have several places where pcplists
    are drained so that pages are returned to the buddy allocator and can be
    e.g. prepared for offlining. This is always done in the context of a
    single zone, we can reduce the pcplists drain to the single zone, which
    is now possible.

    The change should make memory offlining due to hotremove or failure
    faster and not disturbing unrelated pcplists anymore.

    Signed-off-by: Vlastimil Babka
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Yasuaki Ishimatsu
    Cc: Zhang Yanfei
    Cc: Xishi Qiu
    Cc: Vladimir Davydov
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Marek Szyprowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The functions for draining per-cpu pages back to buddy allocators
    currently always operate on all zones. There are however several cases
    where the drain is only needed in the context of a single zone, and
    spilling other pcplists is a waste of time both due to the extra
    spilling and later refilling.

    This patch introduces new zone pointer parameter to drain_all_pages()
    and changes the dummy parameter of drain_local_pages() to be also a zone
    pointer. When NULL is passed, the functions operate on all zones as
    usual. Passing a specific zone pointer reduces the work to the single
    zone.

    All callers are updated to pass the NULL pointer in this patch.
    Conversion to single zone (where appropriate) is done in further
    patches.

    Signed-off-by: Vlastimil Babka
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Yasuaki Ishimatsu
    Cc: Zhang Yanfei
    Cc: Xishi Qiu
    Cc: Vladimir Davydov
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Marek Szyprowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

22 Oct, 2014

1 commit

  • When Uncorrected error happens, if the poisoned page is referenced
    by more than one user after error recovery, the recovery is not
    successful. But currently the display result is wrong.
    Before this patch:

    MCE 0x44e336: dirty mlocked LRU page recovery: Recovered
    MCE 0x44e336: dirty mlocked LRU page still referenced by 1 users
    mce: Memory error not recovered

    After this patch:

    MCE 0x44e336: dirty mlocked LRU page recovery: Failed
    MCE 0x44e336: dirty mlocked LRU page still referenced by 1 users
    mce: Memory error not recovered

    Signed-off-by: Chen, Gong
    Link: http://lkml.kernel.org/r/1406530260-26078-3-git-send-email-gong.chen@linux.intel.com
    Acked-by: Naoya Horiguchi
    Acked-by: Tony Luck
    Signed-off-by: Borislav Petkov

    Chen, Gong
     

19 Sep, 2014

1 commit


07 Aug, 2014

1 commit

  • When a hwpoison page is locked it could change state due to parallel
    modifications. The original compound page can be torn down and then
    this 4k page becomes part of a differently-size compound page is is a
    standalone regular page.

    Check after the lock if the page is still the same compound page.

    We could go back, grab the new head page and try again but it should be
    quite rare, so I thought this was safest. A retry loop would be more
    difficult to test and may have more side effects.

    The hwpoison code by design only tries to handle cases that are
    reasonably common in workloads, as visible in page-flags.

    I'm not really that concerned about handling this (likely rare case),
    just not crashing on it.

    Signed-off-by: Andi Kleen
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     

31 Jul, 2014

2 commits

  • hwpoison_user_mappings() could fail for various reasons, so printk()s to
    print out the reasons should be done in each failure check inside
    hwpoison_user_mappings().

    And currently we don't call action_result() when hwpoison_user_mappings()
    fails, which is not consistent with other exit points of memory error
    handler. So this patch fixes these messaging problems.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Chen Yucong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • A recent fix from Chen Yucong, commit 0bc1f8b0682c ("hwpoison: fix the
    handling path of the victimized page frame that belong to non-LRU")
    rejects going into unmapping operation for hugetlbfs/thp pages, which
    results in failing error containing on such pages. This patch fixes it.

    With this patch, hwpoison functional tests in mce-test testsuite pass.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Chen Yucong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

24 Jul, 2014

1 commit

  • I triggered VM_BUG_ON() in vma_address() when I tried to migrate an
    anonymous hugepage with mbind() in the kernel v3.16-rc3. This is
    because pgoff's calculation in rmap_walk_anon() fails to consider
    compound_order() only to have an incorrect value.

    This patch introduces page_to_pgoff(), which gets the page's offset in
    PAGE_CACHE_SIZE.

    Kirill pointed out that page cache tree should natively handle
    hugepages, and in order to make hugetlbfs fit it, page->index of
    hugetlbfs page should be in PAGE_CACHE_SIZE. This is beyond this patch,
    but page_to_pgoff() contains the point to be fixed in a single function.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Cc: Joonsoo Kim
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Hillf Danton
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

04 Jul, 2014

1 commit

  • Until now, the kernel has the same policy to handle victimized page
    frames that belong to kernel-space(reserved/slab-subsystem) or
    non-LRU(unknown page state). In other word, the result of handling
    either of these victimized page frames is (IGNORED | FAILED), and the
    return value of memory_failure() is -EBUSY.

    This patch is to avoid that memory_failure() returns very soon due to
    the "true" value of (!PageLRU(p)), and it also ensures that
    action_result() can report more precise information("reserved kernel",
    "kernel slab", and "unknown page state") instead of "non LRU",
    especially for memory errors which are detected by memory-scrubbing.

    Andi said:

    : While running the mcelog test suite on 3.14 I hit the following VM_BUG_ON:
    :
    : soft_offline: 0x56d4: unknown non LRU page type 3ffff800008000
    : page:ffffea000015b400 count:3 mapcount:2097169 mapping: (null) index:0xffff8800056d7000
    : page flags: 0x3ffff800004081(locked|slab|head)
    : ------------[ cut here ]------------
    : kernel BUG at mm/rmap.c:1495!
    :
    : I think what happened is that a LRU page turned into a slab page in
    : parallel with offlining. memory_failure initially tests for this case,
    : but doesn't retest later after the page has been locked.
    :
    : ...
    :
    : I ran this patch in a loop over night with some stress plus
    : the mcelog test suite running in a loop. I cannot guarantee it hit it,
    : but it should have given it a good beating.
    :
    : The kernel survived with no messages, although the mcelog test suite
    : got killed at some point because it couldn't fork anymore. Probably
    : some unrelated problem.
    :
    : So the patch is ok for me for .16.

    Signed-off-by: Chen Yucong
    Acked-by: Naoya Horiguchi
    Reported-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Yucong
     

05 Jun, 2014

7 commits

  • Currently memory error handler handles action optional errors in the
    deferred manner by default. And if a recovery aware application wants
    to handle it immediately, it can do it by setting PF_MCE_EARLY flag.
    However, such signal can be sent only to the main thread, so it's
    problematic if the application wants to have a dedicated thread to
    handler such signals.

    So this patch adds dedicated thread support to memory error handler. We
    have PF_MCE_EARLY flags for each thread separately, so with this patch
    AO signal is sent to the thread with PF_MCE_EARLY flag set, not the main
    thread. If you want to implement a dedicated thread, you call prctl()
    to set PF_MCE_EARLY on the thread.

    Memory error handler collects processes to be killed, so this patch lets
    it check PF_MCE_EARLY flag on each thread in the collecting routines.

    No behavioral change for all non-early kill cases.

    Tony said:

    : The old behavior was crazy - someone with a multithreaded process might
    : well expect that if they call prctl(PF_MCE_EARLY) in just one thread, then
    : that thread would see the SIGBUS with si_code = BUS_MCEERR_A0 - even if
    : that thread wasn't the main thread for the process.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Tony Luck
    Cc: Kamil Iskra
    Cc: Andi Kleen
    Cc: Borislav Petkov
    Cc: Chen Gong
    Cc: [3.2+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • When Linux sees an "action optional" machine check (where h/w has reported
    an error that is not in the current execution path) we generally do not
    want to signal a process, since most processes do not have a SIGBUS
    handler - we'd just prematurely terminate the process for a problem that
    they might never actually see.

    task_early_kill() decides whether to consider a process - and it checks
    whether this specific process has been marked for early signals with
    "prctl", or if the system administrator has requested early signals for
    all processes using /proc/sys/vm/memory_failure_early_kill.

    But for MF_ACTION_REQUIRED case we must not defer. The error is in the
    execution path of the current thread so we must send the SIGBUS
    immediatley.

    Fix by passing a flag argument through collect_procs*() to
    task_early_kill() so it knows whether we can defer or must take action.

    Signed-off-by: Tony Luck
    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Borislav Petkov
    Cc: Chen Gong
    Cc: [3.2+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tony Luck
     
  • When a thread in a multi-threaded application hits a machine check because
    of an uncorrectable error in memory - we want to send the SIGBUS with
    si.si_code = BUS_MCEERR_AR to that thread. Currently we fail to do that
    if the active thread is not the primary thread in the process.
    collect_procs() just finds primary threads and this test:

    if ((flags & MF_ACTION_REQUIRED) && t == current) {

    will see that the thread we found isn't the current thread and so send a
    si.si_code = BUS_MCEERR_AO to the primary (and nothing to the active
    thread at this time).

    We can fix this by checking whether "current" shares the same mm with the
    process that collect_procs() said owned the page. If so, we send the
    SIGBUS to current (with code BUS_MCEERR_AR).

    Signed-off-by: Tony Luck
    Signed-off-by: Naoya Horiguchi
    Reported-by: Otto Bruggeman
    Cc: Andi Kleen
    Cc: Borislav Petkov
    Cc: Chen Gong
    Cc: [3.2+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tony Luck
     
  • The comment about pages under writeback is far from the relevant code, so
    let's move it to the right place.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Memory migration uses a callback defined by the caller to determine how to
    allocate destination pages. When migration fails for a source page,
    however, it frees the destination page back to the system.

    This patch adds a memory migration callback defined by the caller to
    determine how to free destination pages. If a caller, such as memory
    compaction, builds its own freelist for migration targets, this can reuse
    already freed memory instead of scanning additional memory.

    If the caller provides a function to handle freeing of destination pages,
    it is called when page migration fails. If the caller passes NULL then
    freeing back to the system will be handled as usual. This patch
    introduces no functional change.

    Signed-off-by: David Rientjes
    Reviewed-by: Naoya Horiguchi
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Greg Thelen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Replace places where __get_cpu_var() is used for an address calculation
    with this_cpu_ptr().

    Signed-off-by: Christoph Lameter
    Cc: Tejun Heo
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • kmem_cache_{create,destroy,shrink} need to get a stable value of
    cpu/node online mask, because they init/destroy/access per-cpu/node
    kmem_cache parts, which can be allocated or destroyed on cpu/mem
    hotplug. To protect against cpu hotplug, these functions use
    {get,put}_online_cpus. However, they do nothing to synchronize with
    memory hotplug - taking the slab_mutex does not eliminate the
    possibility of race as described in patch 2.

    What we need there is something like get_online_cpus, but for memory.
    We already have lock_memory_hotplug, which serves for the purpose, but
    it's a bit of a hammer right now, because it's backed by a mutex. As a
    result, it imposes some limitations to locking order, which are not
    desirable, and can't be used just like get_online_cpus. That's why in
    patch 1 I substitute it with get/put_online_mems, which work exactly
    like get/put_online_cpus except they block not cpu, but memory hotplug.

    [ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
    myself, because it used an rw semaphore for get/put_online_mems,
    making them dead lock prune. ]

    This patch (of 2):

    {un}lock_memory_hotplug, which is used to synchronize against memory
    hotplug, is currently backed by a mutex, which makes it a bit of a
    hammer - threads that only want to get a stable value of online nodes
    mask won't be able to proceed concurrently. Also, it imposes some
    strong locking ordering rules on it, which narrows down the set of its
    usage scenarios.

    This patch introduces get/put_online_mems, which are the same as
    get/put_online_cpus, but for memory hotplug, i.e. executing a code
    inside a get/put_online_mems section will guarantee a stable value of
    online nodes, present pages, etc.

    lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.

    Signed-off-by: Vladimir Davydov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tang Chen
    Cc: Zhang Yanfei
    Cc: Toshi Kani
    Cc: Xishi Qiu
    Cc: Jiang Liu
    Cc: Rafael J. Wysocki
    Cc: David Rientjes
    Cc: Wen Congyang
    Cc: Yasuaki Ishimatsu
    Cc: Lai Jiangshan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

24 May, 2014

2 commits

  • When a memory error happens on an in-use page or (free and in-use)
    hugepage, the victim page is isolated with its refcount set to one.

    When you try to unpoison it later, unpoison_memory() calls put_page()
    for it twice in order to bring the page back to free page pool (buddy or
    free hugepage list). However, if another memory error occurs on the
    page which we are unpoisoning, memory_failure() returns without
    releasing the refcount which was incremented in the same call at first,
    which results in memory leak and unconsistent num_poisoned_pages
    statistics. This patch fixes it.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: [2.6.32+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • For handling a free hugepage in memory failure, the race will happen if
    another thread hwpoisoned this hugepage concurrently. So we need to
    check PageHWPoison instead of !PageHWPoison.

    If hwpoison_filter(p) returns true or a race happens, then we need to
    unlock_page(hpage).

    Signed-off-by: Chen Yucong
    Reviewed-by: Naoya Horiguchi
    Tested-by: Naoya Horiguchi
    Reviewed-by: Andi Kleen
    Cc: [2.6.36+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Yucong
     

04 Apr, 2014

1 commit

  • Pull cgroup updates from Tejun Heo:
    "A lot updates for cgroup:

    - The biggest one is cgroup's conversion to kernfs. cgroup took
    after the long abandoned vfs-entangled sysfs implementation and
    made it even more convoluted over time. cgroup's internal objects
    were fused with vfs objects which also brought in vfs locking and
    object lifetime rules. Naturally, there are places where vfs rules
    don't fit and nasty hacks, such as credential switching or lock
    dance interleaving inode mutex and cgroup_mutex with object serial
    number comparison thrown in to decide whether the operation is
    actually necessary, needed to be employed.

    After conversion to kernfs, internal object lifetime and locking
    rules are mostly isolated from vfs interactions allowing shedding
    of several nasty hacks and overall simplification. This will also
    allow implmentation of operations which may affect multiple cgroups
    which weren't possible before as it would have required nesting
    i_mutexes.

    - Various simplifications including dropping of module support,
    easier cgroup name/path handling, simplified cgroup file type
    handling and task_cg_lists optimization.

    - Prepatory changes for the planned unified hierarchy, which is still
    a patchset away from being actually operational. The dummy
    hierarchy is updated to serve as the default unified hierarchy.
    Controllers which aren't claimed by other hierarchies are
    associated with it, which BTW was what the dummy hierarchy was for
    anyway.

    - Various fixes from Li and others. This pull request includes some
    patches to add missing slab.h to various subsystems. This was
    triggered xattr.h include removal from cgroup.h. cgroup.h
    indirectly got included a lot of files which brought in xattr.h
    which brought in slab.h.

    There are several merge commits - one to pull in kernfs updates
    necessary for converting cgroup (already in upstream through
    driver-core), others for interfering changes in the fixes branch"

    * 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits)
    cgroup: remove useless argument from cgroup_exit()
    cgroup: fix spurious lockdep warning in cgroup_exit()
    cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c
    cgroup: break kernfs active_ref protection in cgroup directory operations
    cgroup: fix cgroup_taskset walking order
    cgroup: implement CFTYPE_ONLY_ON_DFL
    cgroup: make cgrp_dfl_root mountable
    cgroup: drop const from @buffer of cftype->write_string()
    cgroup: rename cgroup_dummy_root and related names
    cgroup: move ->subsys_mask from cgroupfs_root to cgroup
    cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
    cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
    cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
    cgroup: reorganize cgroup bootstrapping
    cgroup: relocate setting of CGRP_DEAD
    cpuset: use rcu_read_lock() to protect task_cs()
    cgroup_freezer: document freezer_fork() subtleties
    cgroup: update cgroup_transfer_tasks() to either succeed or fail
    cgroup: drop task_lock() protection around task->cgroups
    cgroup: update how a newly forked task gets associated with css_set
    ...

    Linus Torvalds
     

04 Mar, 2014

1 commit

  • Commit bf6bddf1924e ("mm: introduce compaction and migration for
    ballooned pages") introduces page_count(page) into memory compaction
    which dereferences page->first_page if PageTail(page).

    This results in a very rare NULL pointer dereference on the
    aforementioned page_count(page). Indeed, anything that does
    compound_head(), including page_count() is susceptible to racing with
    prep_compound_page() and seeing a NULL or dangling page->first_page
    pointer.

    This patch uses Andrea's implementation of compound_trans_head() that
    deals with such a race and makes it the default compound_head()
    implementation. This includes a read memory barrier that ensures that
    if PageTail(head) is true that we return a head page that is neither
    NULL nor dangling. The patch then adds a store memory barrier to
    prep_compound_page() to ensure page->first_page is set.

    This is the safest way to ensure we see the head page that we are
    expecting, PageTail(page) is already in the unlikely() path and the
    memory barriers are unfortunately required.

    Hugetlbfs is the exception, we don't enforce a store memory barrier
    during init since no race is possible.

    Signed-off-by: David Rientjes
    Cc: Holger Kiehl
    Cc: Christoph Lameter
    Cc: Rafael Aquini
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: "Kirill A. Shutemov"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

12 Feb, 2014

1 commit

  • mm/memory-failure.c::hwpoison_filter_task() has been reaching into
    cgroup to extract the associated ino to be used as a filtering
    criterion. This is an implementation detail which shouldn't be
    depended upon from outside cgroup proper and is about to change with
    the scheduled kernfs conversion.

    This patch introduces a proper interface to determine the associated
    ino, cgroup_ino(), and updates hwpoison_filter_task() to use it
    instead of reaching directly into cgroup.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Cc: Andi Kleen
    Cc: Wu Fengguang

    Tejun Heo
     

11 Feb, 2014

1 commit

  • mce-test detected a test failure when injecting error to a thp tail
    page. This is because we take page refcount of the tail page in
    madvise_hwpoison() while the fix in commit a3e0f9e47d5e
    ("mm/memory-failure.c: transfer page count from head page to tail page
    after split thp") assumes that we always take refcount on the head page.

    When a real memory error happens we take refcount on the head page where
    memory_failure() is called without MF_COUNT_INCREASED set, so it seems
    to me that testing memory error on thp tail page using madvise makes
    little sense.

    This patch cancels moving refcount in !MF_COUNT_INCREASED for valid
    testing.

    [akpm@linux-foundation.org: s/&&/&/]
    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Wanpeng Li
    Cc: Chen Gong
    Cc: [3.9+: a3e0f9e47d5e]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

24 Jan, 2014

1 commit

  • After thp split in hwpoison_user_mappings(), we hold page lock on the
    raw error page only between try_to_unmap, hence we are in danger of race
    condition.

    I found in the RHEL7 MCE-relay testing that we have "bad page" error
    when a memory error happens on a thp tail page used by qemu-kvm:

    Triggering MCE exception on CPU 10
    mce: [Hardware Error]: Machine check events logged
    MCE exception done on CPU 10
    MCE 0x38c535: Killing qemu-kvm:8418 due to hardware memory corruption
    MCE 0x38c535: dirty LRU page recovery: Recovered
    qemu-kvm[8418]: segfault at 20 ip 00007ffb0f0f229a sp 00007fffd6bc5240 error 4 in qemu-kvm[7ffb0ef14000+420000]
    BUG: Bad page state in process qemu-kvm pfn:38c400
    page:ffffea000e310000 count:0 mapcount:0 mapping: (null) index:0x7ffae3c00
    page flags: 0x2fffff0008001d(locked|referenced|uptodate|dirty|swapbacked)
    Modules linked in: hwpoison_inject mce_inject vhost_net macvtap macvlan ...
    CPU: 0 PID: 8418 Comm: qemu-kvm Tainted: G M -------------- 3.10.0-54.0.1.el7.mce_test_fixed.x86_64 #1
    Hardware name: NEC NEC Express5800/R120b-1 [N8100-1719F]/MS-91E7-001, BIOS 4.6.3C19 02/10/2011
    Call Trace:
    dump_stack+0x19/0x1b
    bad_page.part.59+0xcf/0xe8
    free_pages_prepare+0x148/0x160
    free_hot_cold_page+0x31/0x140
    free_hot_cold_page_list+0x46/0xa0
    release_pages+0x1c1/0x200
    free_pages_and_swap_cache+0xad/0xd0
    tlb_flush_mmu.part.46+0x4c/0x90
    tlb_finish_mmu+0x55/0x60
    exit_mmap+0xcb/0x170
    mmput+0x67/0xf0
    vhost_dev_cleanup+0x231/0x260 [vhost_net]
    vhost_net_release+0x3f/0x90 [vhost_net]
    __fput+0xe9/0x270
    ____fput+0xe/0x10
    task_work_run+0xc4/0xe0
    do_exit+0x2bb/0xa40
    do_group_exit+0x3f/0xa0
    get_signal_to_deliver+0x1d0/0x6e0
    do_signal+0x48/0x5e0
    do_notify_resume+0x71/0xc0
    retint_signal+0x48/0x8c

    The reason of this bug is that a page fault happens before unlocking the
    head page at the end of memory_failure(). This strange page fault is
    trying to access to address 0x20 and I'm not sure why qemu-kvm does
    this, but anyway as a result the SIGSEGV makes qemu-kvm exit and on the
    way we catch the bad page bug/warning because we try to free a locked
    page (which was the former head page.)

    To fix this, this patch suggests to shift page lock from head page to
    tail page just after thp split. SIGSEGV still happens, but it affects
    only error affected VMs, not a whole system.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Wanpeng Li
    Cc: [3.9+] # a3e0f9e47d5ef "mm/memory-failure.c: transfer page count from head page to tail page after split thp"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

22 Jan, 2014

2 commits

  • Some part of putback_lru_pages() and putback_movable_pages() is
    duplicated, so it could confuse us what we should use. We can remove
    putback_lru_pages() since it is not really needed now. This makes us
    undestand and maintain the code more easily.

    And comment on putback_movable_pages() is stale now, so fix it.

    Signed-off-by: Joonsoo Kim
    Reviewed-by: Wanpeng Li
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Rafael Aquini
    Cc: Vlastimil Babka
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • [akpm@linux-foundation.org: s/cache/pagecache/]
    Signed-off-by: Zhi Yong Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhi Yong Wu
     

03 Jan, 2014

1 commit

  • Memory failures on thp tail pages cause kernel panic like below:

    mce: [Hardware Error]: Machine check events logged
    MCE exception done on CPU 7
    BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
    IP: [] dequeue_hwpoisoned_huge_page+0x131/0x1e0
    PGD bae42067 PUD ba47d067 PMD 0
    Oops: 0000 [#1] SMP
    ...
    CPU: 7 PID: 128 Comm: kworker/7:2 Tainted: G M O 3.13.0-rc4-131217-1558-00003-g83b7df08e462 #25
    ...
    Call Trace:
    me_huge_page+0x3e/0x50
    memory_failure+0x4bb/0xc20
    mce_process_work+0x3e/0x70
    process_one_work+0x171/0x420
    worker_thread+0x11b/0x3a0
    ? manage_workers.isra.25+0x2b0/0x2b0
    kthread+0xe4/0x100
    ? kthread_create_on_node+0x190/0x190
    ret_from_fork+0x7c/0xb0
    ? kthread_create_on_node+0x190/0x190
    ...
    RIP dequeue_hwpoisoned_huge_page+0x131/0x1e0
    CR2: 0000000000000058

    The reasoning of this problem is shown below:
    - when we have a memory error on a thp tail page, the memory error
    handler grabs a refcount of the head page to keep the thp under us.
    - Before unmapping the error page from processes, we split the thp,
    where page refcounts of both of head/tail pages don't change.
    - Then we call try_to_unmap() over the error page (which was a tail
    page before). We didn't pin the error page to handle the memory error,
    this error page is freed and removed from LRU list.
    - We never have the error page on LRU list, so the first page state
    check returns "unknown page," then we move to the second check
    with the saved page flag.
    - The saved page flag have PG_tail set, so the second page state check
    returns "hugepage."
    - We call me_huge_page() for freed error page, then we hit the above panic.

    The root cause is that we didn't move refcount from the head page to the
    tail page after split thp. So this patch suggests to do this.

    This panic was introduced by commit 524fca1e73 ("HWPOISON: fix
    misjudgement of page_action() for errors on mlocked pages"). Note that we
    did have the same refcount problem before this commit, but it was just
    ignored because we had only first page state check which returned "unknown
    page." The commit changed the refcount problem from "doesn't work" to
    "kernel panic."

    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Wanpeng Li
    Cc: Andi Kleen
    Cc: [3.9+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

19 Dec, 2013

1 commit

  • After a successful hugetlb page migration by soft offline, the source
    page will either be freed into hugepage_freelists or buddy(over-commit
    page). If page is in buddy, page_hstate(page) will be NULL. It will
    hit a NULL pointer dereference in dequeue_hwpoisoned_huge_page().

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
    IP: [] dequeue_hwpoisoned_huge_page+0x131/0x1d0
    PGD c23762067 PUD c24be2067 PMD 0
    Oops: 0000 [#1] SMP

    So check PageHuge(page) after call migrate_pages() successfully.

    Signed-off-by: Jianguo Wu
    Tested-by: Naoya Horiguchi
    Reviewed-by: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianguo Wu
     

15 Nov, 2013

1 commit

  • This patch enhances the type safety for the kfifo API. It is now safe
    to put const data into a non const FIFO and the API will now generate a
    compiler warning when reading from the fifo where the destination
    address is pointing to a const variable.

    As a side effect the kfifo_put() does now expect the value of an element
    instead a pointer to the element. This was suggested Russell King. It
    make the handling of the kfifo_put easier since there is no need to
    create a helper variable for getting the address of a pointer or to pass
    integers of different sizes.

    IMHO the API break is okay, since there are currently only six users of
    kfifo_put().

    The code is also cleaner by kicking out the "if (0)" expressions.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Stefani Seibold
    Cc: Russell King
    Cc: Hauke Mehrtens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stefani Seibold
     

13 Nov, 2013

1 commit

  • Chen Gong pointed out that set/unset_migratetype_isolate() was done in
    different functions in mm/memory-failure.c, which makes the code less
    readable/maintainable. So this patch does it in soft_offline_page().

    With this patch, we get to hold lock_memory_hotplug() longer but it's
    not a problem because races between memory hotplug and soft offline are
    very rare.

    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Chen, Gong
    Acked-by: Andi Kleen
    Reviewed-by: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

01 Oct, 2013

2 commits

  • If the page is poisoned by software injection w/ MF_COUNT_INCREASED
    flag, there is a false report during the 2nd attempt at page recovery
    which is not truthful.

    This patch fixes it by reporting the first attempt to try free buddy
    page recovery if MF_COUNT_INCREASED is set.

    Before patch:

    [ 346.332041] Injecting memory failure at pfn 200010
    [ 346.332189] MCE 0x200010: free buddy, 2nd try page recovery: Delayed

    After patch:

    [ 297.742600] Injecting memory failure at pfn 200010
    [ 297.742941] MCE 0x200010: free buddy page recovery: Delayed

    Reviewed-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Signed-off-by: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • PageTransHuge() can't guarantee the page is a transparent huge page
    since it returns true for both transparent huge and hugetlbfs pages.

    This patch fixes it by checking the page is also !hugetlbfs page.

    Before patch:

    [ 121.571128] Injecting memory failure at pfn 23a200
    [ 121.571141] MCE 0x23a200: huge page recovery: Delayed
    [ 140.355100] MCE: Memory failure is now running on 0x23a200

    After patch:

    [ 94.290793] Injecting memory failure at pfn 23a000
    [ 94.290800] MCE 0x23a000: huge page recovery: Delayed
    [ 105.722303] MCE: Software-unpoisoned page 0x23a000

    Signed-off-by: Wanpeng Li
    Reviewed-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     

13 Sep, 2013

1 commit

  • Pull vfs pile 4 from Al Viro:
    "list_lru pile, mostly"

    This came out of Andrew's pile, Al ended up doing the merge work so that
    Andrew didn't have to.

    Additionally, a few fixes.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (42 commits)
    super: fix for destroy lrus
    list_lru: dynamically adjust node arrays
    shrinker: Kill old ->shrink API.
    shrinker: convert remaining shrinkers to count/scan API
    staging/lustre/libcfs: cleanup linux-mem.h
    staging/lustre/ptlrpc: convert to new shrinker API
    staging/lustre/obdclass: convert lu_object shrinker to count/scan API
    staging/lustre/ldlm: convert to shrinkers to count/scan API
    hugepage: convert huge zero page shrinker to new shrinker API
    i915: bail out earlier when shrinker cannot acquire mutex
    drivers: convert shrinkers to new count/scan API
    fs: convert fs shrinkers to new scan/count API
    xfs: fix dquot isolation hang
    xfs-convert-dquot-cache-lru-to-list_lru-fix
    xfs: convert dquot cache lru to list_lru
    xfs: rework buffer dispose list tracking
    xfs-convert-buftarg-lru-to-generic-code-fix
    xfs: convert buftarg LRU to generic code
    fs: convert inode and dentry shrinking to be node aware
    vmscan: per-node deferred work
    ...

    Linus Torvalds
     

12 Sep, 2013

4 commits

  • Injecting memory failure for page 0x19d0 at 0xb77d2000
    MCE 0x19d0: non LRU page recovery: Ignored
    MCE: Software-unpoisoned page 0x19d0
    BUG: Bad page state in process bash pfn:019d0
    page:f3461a00 count:0 mapcount:0 mapping: (null) index:0x0
    page flags: 0x40000404(referenced|reserved)
    Modules linked in: nfsd auth_rpcgss i915 nfs_acl nfs lockd video drm_kms_helper drm bnep rfcomm sunrpc bluetooth psmouse parport_pc ppdev lp serio_raw fscache parport gpio_ich lpc_ich mac_hid i2c_algo_bit tpm_tis wmi usb_storage hid_generic usbhid hid e1000e firewire_ohci firewire_core ahci ptp libahci pps_core crc_itu_t
    CPU: 3 PID: 2123 Comm: bash Not tainted 3.11.0-rc6+ #12
    Hardware name: LENOVO 7034DD7/ , BIOS 9HKT47AUS 01//2012
    00000000 00000000 e9625ea0 c15ec49b f3461a00 e9625eb8 c15ea119 c17cbf18
    ef084314 000019d0 f3461a00 e9625ed8 c110dc8a f3461a00 00000001 00000000
    f3461a00 40000404 00000000 e9625ef8 c110dcc1 f3461a00 f3461a00 000019d0
    Call Trace:
    dump_stack+0x41/0x52
    bad_page+0xcf/0xeb
    free_pages_prepare+0x12a/0x140
    free_hot_cold_page+0x21/0x110
    __put_single_page+0x21/0x30
    put_page+0x25/0x40
    unpoison_memory+0x107/0x200
    hwpoison_unpoison+0x20/0x30
    simple_attr_write+0xb6/0xd0
    vfs_write+0xa0/0x1b0
    SyS_write+0x4f/0x90
    sysenter_do_call+0x12/0x22
    Disabling lock debugging due to kernel taint

    Testcase:

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define PAGES_TO_TEST 1
    #define PAGE_SIZE 4096

    int main(void)
    {
    char *mem;

    mem = mmap(NULL, PAGES_TO_TEST * PAGE_SIZE,
    PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);

    if (madvise(mem, PAGES_TO_TEST * PAGE_SIZE, MADV_HWPOISON) == -1)
    return -1;

    munmap(mem, PAGES_TO_TEST * PAGE_SIZE);

    return 0;
    }

    There is one page reference count for default empty zero page,
    madvise_hwpoison add another one by get_user_pages_fast. memory_hwpoison
    reduce one page reference count since it's a non LRU page.
    unpoison_memory release the last page reference count and free empty zero
    page to buddy system which is not correct since empty zero page has
    PG_reserved flag. This patch fix it by don't reduce the page reference
    count under 1 against empty zero page.

    Signed-off-by: Wanpeng Li
    Reviewed-by: Naoya Horiguchi
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • Drop forward reference declarations __soft_offline_page.

    Signed-off-by: Wanpeng Li
    Reviewed-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • Set pageblock migration type will hold zone->lock which is heavy contended
    in system to avoid race. However, soft offline page will set pageblock
    migration type twice during get page if the page is in used, not hugetlbfs
    page and not on lru list. There is unnecessary to set the pageblock
    migration type and hold heavy contended zone->lock again if the first
    round get page have already set the pageblock to right migration type.

    The trick here is migration type is MIGRATE_ISOLATE. There are other two
    parts can change MIGRATE_ISOLATE except hwpoison. One is memory hoplug,
    however, we hold lock_memory_hotplug() which avoid race. The second is
    CMA which umovable page allocation requst can't fallback to. So it's safe
    here.

    Signed-off-by: Wanpeng Li
    Cc: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • Replace atomic_long_sub() with atomic_long_dec() since the page is normal
    page instead of hugetlbfs page or thp.

    Signed-off-by: Wanpeng Li
    Reviewed-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li