04 Apr, 2014

1 commit

  • Pull cgroup updates from Tejun Heo:
    "A lot updates for cgroup:

    - The biggest one is cgroup's conversion to kernfs. cgroup took
    after the long abandoned vfs-entangled sysfs implementation and
    made it even more convoluted over time. cgroup's internal objects
    were fused with vfs objects which also brought in vfs locking and
    object lifetime rules. Naturally, there are places where vfs rules
    don't fit and nasty hacks, such as credential switching or lock
    dance interleaving inode mutex and cgroup_mutex with object serial
    number comparison thrown in to decide whether the operation is
    actually necessary, needed to be employed.

    After conversion to kernfs, internal object lifetime and locking
    rules are mostly isolated from vfs interactions allowing shedding
    of several nasty hacks and overall simplification. This will also
    allow implmentation of operations which may affect multiple cgroups
    which weren't possible before as it would have required nesting
    i_mutexes.

    - Various simplifications including dropping of module support,
    easier cgroup name/path handling, simplified cgroup file type
    handling and task_cg_lists optimization.

    - Prepatory changes for the planned unified hierarchy, which is still
    a patchset away from being actually operational. The dummy
    hierarchy is updated to serve as the default unified hierarchy.
    Controllers which aren't claimed by other hierarchies are
    associated with it, which BTW was what the dummy hierarchy was for
    anyway.

    - Various fixes from Li and others. This pull request includes some
    patches to add missing slab.h to various subsystems. This was
    triggered xattr.h include removal from cgroup.h. cgroup.h
    indirectly got included a lot of files which brought in xattr.h
    which brought in slab.h.

    There are several merge commits - one to pull in kernfs updates
    necessary for converting cgroup (already in upstream through
    driver-core), others for interfering changes in the fixes branch"

    * 'for-3.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (74 commits)
    cgroup: remove useless argument from cgroup_exit()
    cgroup: fix spurious lockdep warning in cgroup_exit()
    cgroup: Use RCU_INIT_POINTER(x, NULL) in cgroup.c
    cgroup: break kernfs active_ref protection in cgroup directory operations
    cgroup: fix cgroup_taskset walking order
    cgroup: implement CFTYPE_ONLY_ON_DFL
    cgroup: make cgrp_dfl_root mountable
    cgroup: drop const from @buffer of cftype->write_string()
    cgroup: rename cgroup_dummy_root and related names
    cgroup: move ->subsys_mask from cgroupfs_root to cgroup
    cgroup: treat cgroup_dummy_root as an equivalent hierarchy during rebinding
    cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}()
    cgroup: use cgroup_setup_root() to initialize cgroup_dummy_root
    cgroup: reorganize cgroup bootstrapping
    cgroup: relocate setting of CGRP_DEAD
    cpuset: use rcu_read_lock() to protect task_cs()
    cgroup_freezer: document freezer_fork() subtleties
    cgroup: update cgroup_transfer_tasks() to either succeed or fail
    cgroup: drop task_lock() protection around task->cgroups
    cgroup: update how a newly forked task gets associated with css_set
    ...

    Linus Torvalds
     

04 Mar, 2014

1 commit

  • Commit bf6bddf1924e ("mm: introduce compaction and migration for
    ballooned pages") introduces page_count(page) into memory compaction
    which dereferences page->first_page if PageTail(page).

    This results in a very rare NULL pointer dereference on the
    aforementioned page_count(page). Indeed, anything that does
    compound_head(), including page_count() is susceptible to racing with
    prep_compound_page() and seeing a NULL or dangling page->first_page
    pointer.

    This patch uses Andrea's implementation of compound_trans_head() that
    deals with such a race and makes it the default compound_head()
    implementation. This includes a read memory barrier that ensures that
    if PageTail(head) is true that we return a head page that is neither
    NULL nor dangling. The patch then adds a store memory barrier to
    prep_compound_page() to ensure page->first_page is set.

    This is the safest way to ensure we see the head page that we are
    expecting, PageTail(page) is already in the unlikely() path and the
    memory barriers are unfortunately required.

    Hugetlbfs is the exception, we don't enforce a store memory barrier
    during init since no race is possible.

    Signed-off-by: David Rientjes
    Cc: Holger Kiehl
    Cc: Christoph Lameter
    Cc: Rafael Aquini
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: "Kirill A. Shutemov"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

12 Feb, 2014

1 commit

  • mm/memory-failure.c::hwpoison_filter_task() has been reaching into
    cgroup to extract the associated ino to be used as a filtering
    criterion. This is an implementation detail which shouldn't be
    depended upon from outside cgroup proper and is about to change with
    the scheduled kernfs conversion.

    This patch introduces a proper interface to determine the associated
    ino, cgroup_ino(), and updates hwpoison_filter_task() to use it
    instead of reaching directly into cgroup.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Cc: Andi Kleen
    Cc: Wu Fengguang

    Tejun Heo
     

11 Feb, 2014

1 commit

  • mce-test detected a test failure when injecting error to a thp tail
    page. This is because we take page refcount of the tail page in
    madvise_hwpoison() while the fix in commit a3e0f9e47d5e
    ("mm/memory-failure.c: transfer page count from head page to tail page
    after split thp") assumes that we always take refcount on the head page.

    When a real memory error happens we take refcount on the head page where
    memory_failure() is called without MF_COUNT_INCREASED set, so it seems
    to me that testing memory error on thp tail page using madvise makes
    little sense.

    This patch cancels moving refcount in !MF_COUNT_INCREASED for valid
    testing.

    [akpm@linux-foundation.org: s/&&/&/]
    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Wanpeng Li
    Cc: Chen Gong
    Cc: [3.9+: a3e0f9e47d5e]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

24 Jan, 2014

1 commit

  • After thp split in hwpoison_user_mappings(), we hold page lock on the
    raw error page only between try_to_unmap, hence we are in danger of race
    condition.

    I found in the RHEL7 MCE-relay testing that we have "bad page" error
    when a memory error happens on a thp tail page used by qemu-kvm:

    Triggering MCE exception on CPU 10
    mce: [Hardware Error]: Machine check events logged
    MCE exception done on CPU 10
    MCE 0x38c535: Killing qemu-kvm:8418 due to hardware memory corruption
    MCE 0x38c535: dirty LRU page recovery: Recovered
    qemu-kvm[8418]: segfault at 20 ip 00007ffb0f0f229a sp 00007fffd6bc5240 error 4 in qemu-kvm[7ffb0ef14000+420000]
    BUG: Bad page state in process qemu-kvm pfn:38c400
    page:ffffea000e310000 count:0 mapcount:0 mapping: (null) index:0x7ffae3c00
    page flags: 0x2fffff0008001d(locked|referenced|uptodate|dirty|swapbacked)
    Modules linked in: hwpoison_inject mce_inject vhost_net macvtap macvlan ...
    CPU: 0 PID: 8418 Comm: qemu-kvm Tainted: G M -------------- 3.10.0-54.0.1.el7.mce_test_fixed.x86_64 #1
    Hardware name: NEC NEC Express5800/R120b-1 [N8100-1719F]/MS-91E7-001, BIOS 4.6.3C19 02/10/2011
    Call Trace:
    dump_stack+0x19/0x1b
    bad_page.part.59+0xcf/0xe8
    free_pages_prepare+0x148/0x160
    free_hot_cold_page+0x31/0x140
    free_hot_cold_page_list+0x46/0xa0
    release_pages+0x1c1/0x200
    free_pages_and_swap_cache+0xad/0xd0
    tlb_flush_mmu.part.46+0x4c/0x90
    tlb_finish_mmu+0x55/0x60
    exit_mmap+0xcb/0x170
    mmput+0x67/0xf0
    vhost_dev_cleanup+0x231/0x260 [vhost_net]
    vhost_net_release+0x3f/0x90 [vhost_net]
    __fput+0xe9/0x270
    ____fput+0xe/0x10
    task_work_run+0xc4/0xe0
    do_exit+0x2bb/0xa40
    do_group_exit+0x3f/0xa0
    get_signal_to_deliver+0x1d0/0x6e0
    do_signal+0x48/0x5e0
    do_notify_resume+0x71/0xc0
    retint_signal+0x48/0x8c

    The reason of this bug is that a page fault happens before unlocking the
    head page at the end of memory_failure(). This strange page fault is
    trying to access to address 0x20 and I'm not sure why qemu-kvm does
    this, but anyway as a result the SIGSEGV makes qemu-kvm exit and on the
    way we catch the bad page bug/warning because we try to free a locked
    page (which was the former head page.)

    To fix this, this patch suggests to shift page lock from head page to
    tail page just after thp split. SIGSEGV still happens, but it affects
    only error affected VMs, not a whole system.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Wanpeng Li
    Cc: [3.9+] # a3e0f9e47d5ef "mm/memory-failure.c: transfer page count from head page to tail page after split thp"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

22 Jan, 2014

2 commits

  • Some part of putback_lru_pages() and putback_movable_pages() is
    duplicated, so it could confuse us what we should use. We can remove
    putback_lru_pages() since it is not really needed now. This makes us
    undestand and maintain the code more easily.

    And comment on putback_movable_pages() is stale now, so fix it.

    Signed-off-by: Joonsoo Kim
    Reviewed-by: Wanpeng Li
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Rafael Aquini
    Cc: Vlastimil Babka
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • [akpm@linux-foundation.org: s/cache/pagecache/]
    Signed-off-by: Zhi Yong Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhi Yong Wu
     

03 Jan, 2014

1 commit

  • Memory failures on thp tail pages cause kernel panic like below:

    mce: [Hardware Error]: Machine check events logged
    MCE exception done on CPU 7
    BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
    IP: [] dequeue_hwpoisoned_huge_page+0x131/0x1e0
    PGD bae42067 PUD ba47d067 PMD 0
    Oops: 0000 [#1] SMP
    ...
    CPU: 7 PID: 128 Comm: kworker/7:2 Tainted: G M O 3.13.0-rc4-131217-1558-00003-g83b7df08e462 #25
    ...
    Call Trace:
    me_huge_page+0x3e/0x50
    memory_failure+0x4bb/0xc20
    mce_process_work+0x3e/0x70
    process_one_work+0x171/0x420
    worker_thread+0x11b/0x3a0
    ? manage_workers.isra.25+0x2b0/0x2b0
    kthread+0xe4/0x100
    ? kthread_create_on_node+0x190/0x190
    ret_from_fork+0x7c/0xb0
    ? kthread_create_on_node+0x190/0x190
    ...
    RIP dequeue_hwpoisoned_huge_page+0x131/0x1e0
    CR2: 0000000000000058

    The reasoning of this problem is shown below:
    - when we have a memory error on a thp tail page, the memory error
    handler grabs a refcount of the head page to keep the thp under us.
    - Before unmapping the error page from processes, we split the thp,
    where page refcounts of both of head/tail pages don't change.
    - Then we call try_to_unmap() over the error page (which was a tail
    page before). We didn't pin the error page to handle the memory error,
    this error page is freed and removed from LRU list.
    - We never have the error page on LRU list, so the first page state
    check returns "unknown page," then we move to the second check
    with the saved page flag.
    - The saved page flag have PG_tail set, so the second page state check
    returns "hugepage."
    - We call me_huge_page() for freed error page, then we hit the above panic.

    The root cause is that we didn't move refcount from the head page to the
    tail page after split thp. So this patch suggests to do this.

    This panic was introduced by commit 524fca1e73 ("HWPOISON: fix
    misjudgement of page_action() for errors on mlocked pages"). Note that we
    did have the same refcount problem before this commit, but it was just
    ignored because we had only first page state check which returned "unknown
    page." The commit changed the refcount problem from "doesn't work" to
    "kernel panic."

    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Wanpeng Li
    Cc: Andi Kleen
    Cc: [3.9+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

19 Dec, 2013

1 commit

  • After a successful hugetlb page migration by soft offline, the source
    page will either be freed into hugepage_freelists or buddy(over-commit
    page). If page is in buddy, page_hstate(page) will be NULL. It will
    hit a NULL pointer dereference in dequeue_hwpoisoned_huge_page().

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
    IP: [] dequeue_hwpoisoned_huge_page+0x131/0x1d0
    PGD c23762067 PUD c24be2067 PMD 0
    Oops: 0000 [#1] SMP

    So check PageHuge(page) after call migrate_pages() successfully.

    Signed-off-by: Jianguo Wu
    Tested-by: Naoya Horiguchi
    Reviewed-by: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianguo Wu
     

15 Nov, 2013

1 commit

  • This patch enhances the type safety for the kfifo API. It is now safe
    to put const data into a non const FIFO and the API will now generate a
    compiler warning when reading from the fifo where the destination
    address is pointing to a const variable.

    As a side effect the kfifo_put() does now expect the value of an element
    instead a pointer to the element. This was suggested Russell King. It
    make the handling of the kfifo_put easier since there is no need to
    create a helper variable for getting the address of a pointer or to pass
    integers of different sizes.

    IMHO the API break is okay, since there are currently only six users of
    kfifo_put().

    The code is also cleaner by kicking out the "if (0)" expressions.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Stefani Seibold
    Cc: Russell King
    Cc: Hauke Mehrtens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stefani Seibold
     

13 Nov, 2013

1 commit

  • Chen Gong pointed out that set/unset_migratetype_isolate() was done in
    different functions in mm/memory-failure.c, which makes the code less
    readable/maintainable. So this patch does it in soft_offline_page().

    With this patch, we get to hold lock_memory_hotplug() longer but it's
    not a problem because races between memory hotplug and soft offline are
    very rare.

    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Chen, Gong
    Acked-by: Andi Kleen
    Reviewed-by: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

01 Oct, 2013

2 commits

  • If the page is poisoned by software injection w/ MF_COUNT_INCREASED
    flag, there is a false report during the 2nd attempt at page recovery
    which is not truthful.

    This patch fixes it by reporting the first attempt to try free buddy
    page recovery if MF_COUNT_INCREASED is set.

    Before patch:

    [ 346.332041] Injecting memory failure at pfn 200010
    [ 346.332189] MCE 0x200010: free buddy, 2nd try page recovery: Delayed

    After patch:

    [ 297.742600] Injecting memory failure at pfn 200010
    [ 297.742941] MCE 0x200010: free buddy page recovery: Delayed

    Reviewed-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Signed-off-by: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • PageTransHuge() can't guarantee the page is a transparent huge page
    since it returns true for both transparent huge and hugetlbfs pages.

    This patch fixes it by checking the page is also !hugetlbfs page.

    Before patch:

    [ 121.571128] Injecting memory failure at pfn 23a200
    [ 121.571141] MCE 0x23a200: huge page recovery: Delayed
    [ 140.355100] MCE: Memory failure is now running on 0x23a200

    After patch:

    [ 94.290793] Injecting memory failure at pfn 23a000
    [ 94.290800] MCE 0x23a000: huge page recovery: Delayed
    [ 105.722303] MCE: Software-unpoisoned page 0x23a000

    Signed-off-by: Wanpeng Li
    Reviewed-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     

13 Sep, 2013

1 commit

  • Pull vfs pile 4 from Al Viro:
    "list_lru pile, mostly"

    This came out of Andrew's pile, Al ended up doing the merge work so that
    Andrew didn't have to.

    Additionally, a few fixes.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (42 commits)
    super: fix for destroy lrus
    list_lru: dynamically adjust node arrays
    shrinker: Kill old ->shrink API.
    shrinker: convert remaining shrinkers to count/scan API
    staging/lustre/libcfs: cleanup linux-mem.h
    staging/lustre/ptlrpc: convert to new shrinker API
    staging/lustre/obdclass: convert lu_object shrinker to count/scan API
    staging/lustre/ldlm: convert to shrinkers to count/scan API
    hugepage: convert huge zero page shrinker to new shrinker API
    i915: bail out earlier when shrinker cannot acquire mutex
    drivers: convert shrinkers to new count/scan API
    fs: convert fs shrinkers to new scan/count API
    xfs: fix dquot isolation hang
    xfs-convert-dquot-cache-lru-to-list_lru-fix
    xfs: convert dquot cache lru to list_lru
    xfs: rework buffer dispose list tracking
    xfs-convert-buftarg-lru-to-generic-code-fix
    xfs: convert buftarg LRU to generic code
    fs: convert inode and dentry shrinking to be node aware
    vmscan: per-node deferred work
    ...

    Linus Torvalds
     

12 Sep, 2013

9 commits

  • Injecting memory failure for page 0x19d0 at 0xb77d2000
    MCE 0x19d0: non LRU page recovery: Ignored
    MCE: Software-unpoisoned page 0x19d0
    BUG: Bad page state in process bash pfn:019d0
    page:f3461a00 count:0 mapcount:0 mapping: (null) index:0x0
    page flags: 0x40000404(referenced|reserved)
    Modules linked in: nfsd auth_rpcgss i915 nfs_acl nfs lockd video drm_kms_helper drm bnep rfcomm sunrpc bluetooth psmouse parport_pc ppdev lp serio_raw fscache parport gpio_ich lpc_ich mac_hid i2c_algo_bit tpm_tis wmi usb_storage hid_generic usbhid hid e1000e firewire_ohci firewire_core ahci ptp libahci pps_core crc_itu_t
    CPU: 3 PID: 2123 Comm: bash Not tainted 3.11.0-rc6+ #12
    Hardware name: LENOVO 7034DD7/ , BIOS 9HKT47AUS 01//2012
    00000000 00000000 e9625ea0 c15ec49b f3461a00 e9625eb8 c15ea119 c17cbf18
    ef084314 000019d0 f3461a00 e9625ed8 c110dc8a f3461a00 00000001 00000000
    f3461a00 40000404 00000000 e9625ef8 c110dcc1 f3461a00 f3461a00 000019d0
    Call Trace:
    dump_stack+0x41/0x52
    bad_page+0xcf/0xeb
    free_pages_prepare+0x12a/0x140
    free_hot_cold_page+0x21/0x110
    __put_single_page+0x21/0x30
    put_page+0x25/0x40
    unpoison_memory+0x107/0x200
    hwpoison_unpoison+0x20/0x30
    simple_attr_write+0xb6/0xd0
    vfs_write+0xa0/0x1b0
    SyS_write+0x4f/0x90
    sysenter_do_call+0x12/0x22
    Disabling lock debugging due to kernel taint

    Testcase:

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define PAGES_TO_TEST 1
    #define PAGE_SIZE 4096

    int main(void)
    {
    char *mem;

    mem = mmap(NULL, PAGES_TO_TEST * PAGE_SIZE,
    PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);

    if (madvise(mem, PAGES_TO_TEST * PAGE_SIZE, MADV_HWPOISON) == -1)
    return -1;

    munmap(mem, PAGES_TO_TEST * PAGE_SIZE);

    return 0;
    }

    There is one page reference count for default empty zero page,
    madvise_hwpoison add another one by get_user_pages_fast. memory_hwpoison
    reduce one page reference count since it's a non LRU page.
    unpoison_memory release the last page reference count and free empty zero
    page to buddy system which is not correct since empty zero page has
    PG_reserved flag. This patch fix it by don't reduce the page reference
    count under 1 against empty zero page.

    Signed-off-by: Wanpeng Li
    Reviewed-by: Naoya Horiguchi
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • Drop forward reference declarations __soft_offline_page.

    Signed-off-by: Wanpeng Li
    Reviewed-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • Set pageblock migration type will hold zone->lock which is heavy contended
    in system to avoid race. However, soft offline page will set pageblock
    migration type twice during get page if the page is in used, not hugetlbfs
    page and not on lru list. There is unnecessary to set the pageblock
    migration type and hold heavy contended zone->lock again if the first
    round get page have already set the pageblock to right migration type.

    The trick here is migration type is MIGRATE_ISOLATE. There are other two
    parts can change MIGRATE_ISOLATE except hwpoison. One is memory hoplug,
    however, we hold lock_memory_hotplug() which avoid race. The second is
    CMA which umovable page allocation requst can't fallback to. So it's safe
    here.

    Signed-off-by: Wanpeng Li
    Cc: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • Replace atomic_long_sub() with atomic_long_dec() since the page is normal
    page instead of hugetlbfs page or thp.

    Signed-off-by: Wanpeng Li
    Reviewed-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • There is a race between hwpoison page and unpoison page, memory_failure
    set the page hwpoison and increase num_poisoned_pages without hold page
    lock, and one page count will be accounted against thp for
    num_poisoned_pages. However, unpoison can occur before memory_failure
    hold page lock and split transparent hugepage, unpoison will decrease
    num_poisoned_pages by 1 << compound_order since memory_failure has not yet
    split transparent hugepage with page lock held. That means we account one
    page for hwpoison and 1 << compound_order for unpoison. This patch fix it
    by inserting a PageTransHuge check before doing TestClearPageHWPoison,
    unpoison failed without clearing PageHWPoison and decreasing
    num_poisoned_pages.

    A B
    memory_failue
    TestSetPageHWPoison(p);
    if (PageHuge(p))
    nr_pages = 1 << compound_order(hpage);
    else
    nr_pages = 1;
    atomic_long_add(nr_pages, &num_poisoned_pages);
    unpoison_memory
    nr_pages = 1<< compound_trans_order(page);
    if(TestClearPageHWPoison(p))
    atomic_long_sub(nr_pages, &num_poisoned_pages);
    lock page
    if (!PageHWPoison(p))
    unlock page and return
    hwpoison_user_mappings
    if (PageTransHuge(hpage))
    split_huge_page(hpage);

    Signed-off-by: Wanpeng Li
    Suggested-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • compound lock is introduced by commit e9da73d67("thp: compound_lock."), it
    is used to serialize put_page against __split_huge_page_refcount(). In
    addition, transparent hugepages will be splitted in hwpoison handler and
    just one subpage will be poisoned. There is unnecessary to hold compound
    lock for hugetlbfs page. This patch replace compound_trans_order by
    compond_order in the place where the page is hugetlbfs page.

    Signed-off-by: Wanpeng Li
    Reviewed-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • memory_failure() store the page flag of the error page before doing unmap,
    and (only) if the first check with page flags at the time decided the
    error page is unknown, it do the second check with the stored page flag
    since memory_failure() does unmapping of the error pages before doing
    page_action(). This unmapping changes the page state, especially
    page_remove_rmap() (called from try_to_unmap_one()) clears PG_mlocked, so
    page_action() can't catch mlocked pages after that.

    However, memory_failure() can't handle memory errors on dirty mlocked
    pages correctly. try_to_unmap_one will move the dirty bit from pte to the
    physical page, the second check lose it since it check the stored page
    flag. This patch fix it by restore PG_dirty flag to stored page flag if
    the page is dirty.

    Testcase:

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include

    #define PAGES_TO_TEST 2
    #define PAGE_SIZE 4096

    int main(void)
    {
    char *mem;
    int i;

    mem = mmap(NULL, PAGES_TO_TEST * PAGE_SIZE,
    PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, 0, 0);

    for (i = 0; i < PAGES_TO_TEST; i++)
    mem[i * PAGE_SIZE] = 'a';

    if (madvise(mem, PAGES_TO_TEST * PAGE_SIZE, MADV_HWPOISON) == -1)
    return -1;

    return 0;
    }

    Before patch:

    [ 912.839247] Injecting memory failure for page 7dfb8 at 7f6b4e37b000
    [ 912.839257] MCE 0x7dfb8: clean mlocked LRU page recovery: Recovered
    [ 912.845550] MCE 0x7dfb8: clean mlocked LRU page still referenced by 1 users
    [ 912.852586] Injecting memory failure for page 7e6aa at 7f6b4e37c000
    [ 912.852594] MCE 0x7e6aa: clean mlocked LRU page recovery: Recovered
    [ 912.858936] MCE 0x7e6aa: clean mlocked LRU page still referenced by 1 users

    After patch:

    [ 163.590225] Injecting memory failure for page 91bc2f at 7f9f5b0e5000
    [ 163.590264] MCE 0x91bc2f: dirty mlocked LRU page recovery: Recovered
    [ 163.596680] MCE 0x91bc2f: dirty mlocked LRU page still referenced by 1 users
    [ 163.603831] Injecting memory failure for page 91cdd3 at 7f9f5b0e6000
    [ 163.603852] MCE 0x91cdd3: dirty mlocked LRU page recovery: Recovered
    [ 163.610305] MCE 0x91cdd3: dirty mlocked LRU page still referenced by 1 users

    Signed-off-by: Wanpeng Li
    Reviewed-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • Soft offline code expects that MIGRATE_ISOLATE is set on the target page
    only during soft offlining work. But currenly it doesn't work as expected
    when get_any_page() fails and returns negative value. In the result, end
    users can have unexpectedly isolated pages. This patch just fixes it.

    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Wanpeng Li
    Cc: Andi Kleen
    Cc: Fengguang Wu
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently migrate_huge_page() takes a pointer to a hugepage to be migrated
    as an argument, instead of taking a pointer to the list of hugepages to be
    migrated. This behavior was introduced in commit 189ebff28 ("hugetlb:
    simplify migrate_huge_page()"), and was OK because until now hugepage
    migration is enabled only for soft-offlining which migrates only one
    hugepage in a single call.

    But the situation will change in the later patches in this series which
    enable other users of page migration to support hugepage migration. They
    can kick migration for both of normal pages and hugepages in a single
    call, so we need to go back to original implementation which uses linked
    lists to collect the hugepages to be migrated.

    With this patch, soft_offline_huge_page() switches to use migrate_pages(),
    and migrate_huge_page() is not used any more. So let's remove it.

    Signed-off-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Reviewed-by: Wanpeng Li
    Acked-by: Hillf Danton
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

11 Sep, 2013

1 commit

  • Pass the node of the current zone being reclaimed to shrink_slab(),
    allowing the shrinker control nodemask to be set appropriately for node
    aware shrinkers.

    Signed-off-by: Dave Chinner
    Signed-off-by: Glauber Costa
    Acked-by: Mel Gorman
    Cc: "Theodore Ts'o"
    Cc: Adrian Hunter
    Cc: Al Viro
    Cc: Artem Bityutskiy
    Cc: Arve Hjønnevåg
    Cc: Carlos Maiolino
    Cc: Christoph Hellwig
    Cc: Chuck Lever
    Cc: Daniel Vetter
    Cc: David Rientjes
    Cc: Gleb Natapov
    Cc: Greg Thelen
    Cc: J. Bruce Fields
    Cc: Jan Kara
    Cc: Jerome Glisse
    Cc: John Stultz
    Cc: KAMEZAWA Hiroyuki
    Cc: Kent Overstreet
    Cc: Kirill A. Shutemov
    Cc: Marcelo Tosatti
    Cc: Mel Gorman
    Cc: Steven Whitehouse
    Cc: Thomas Hellstrom
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Dave Chinner
     

07 Sep, 2013

1 commit

  • Pull trivial tree from Jiri Kosina:
    "The usual trivial updates all over the tree -- mostly typo fixes and
    documentation updates"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (52 commits)
    doc: Documentation/cputopology.txt fix typo
    treewide: Convert retrun typos to return
    Fix comment typo for init_cma_reserved_pageblock
    Documentation/trace: Correcting and extending tracepoint documentation
    mm/hotplug: fix a typo in Documentation/memory-hotplug.txt
    power: Documentation: Update s2ram link
    doc: fix a typo in Documentation/00-INDEX
    Documentation/printk-formats.txt: No casts needed for u64/s64
    doc: Fix typo "is is" in Documentations
    treewide: Fix printks with 0x%#
    zram: doc fixes
    Documentation/kmemcheck: update kmemcheck documentation
    doc: documentation/hwspinlock.txt fix typo
    PM / Hibernate: add section for resume options
    doc: filesystems : Fix typo in Documentations/filesystems
    scsi/megaraid fixed several typos in comments
    ppc: init_32: Fix error typo "CONFIG_START_KERNEL"
    treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks
    page_isolation: Fix a comment typo in test_pages_isolated()
    doc: fix a typo about irq affinity
    ...

    Linus Torvalds
     

27 Aug, 2013

1 commit


12 Aug, 2013

1 commit


11 Jul, 2013

1 commit

  • If the firmware indicates in GHES error data entry that the error threshold
    has exceeded for a corrected error event, then we try to soft-offline the
    page. This could be called in interrupt context, so we queue this up similar
    to how we handle memory failure scenarios.

    Signed-off-by: Naveen N. Rao
    Acked-by: Borislav Petkov
    Signed-off-by: Tony Luck

    Naveen N. Rao
     

04 Jul, 2013

1 commit

  • After a successful page migration by soft offlining, the source page is
    not properly freed and it's never reusable even if we unpoison it
    afterward.

    This is caused by the race between freeing page and setting PG_hwpoison.
    In successful soft offlining, the source page is put (and the refcount
    becomes 0) by putback_lru_page() in unmap_and_move(), where it's linked
    to pagevec and actual freeing back to buddy is delayed. So if
    PG_hwpoison is set for the page before freeing, the freeing does not
    functions as expected (in such case freeing aborts in
    free_pages_prepare() check.)

    This patch tries to make sure to free the source page before setting
    PG_hwpoison on it. To avoid reallocating, the page keeps
    MIGRATE_ISOLATE until after setting PG_hwpoison.

    This patch also removes obsolete comments about "keeping elevated
    refcount" because what they say is not true. Unlike memory_failure(),
    soft_offline_page() uses no special page isolation code, and the
    soft-offlined pages have no elevated.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

30 Apr, 2013

1 commit

  • Currently page_action() does not check dirty flag to determine whether
    the error page is "clean mlocked/unevictable LRU" page. This doesn't
    cause any misjudgement because we do matching against "dirty
    mlocked/unevictable LRU" just before the check. But in order to make
    code consistent and/or to avoid potential regression, we had better
    check dirty flag explicitly.

    Signed-off-by: Naoya Horiguchi
    Suggested-by: Chen Gong
    Cc: Andi Kleen
    Cc: Tony Luck
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

24 Feb, 2013

8 commits

  • error_states[] has two separate states "unevictable LRU page" and
    "mlocked LRU page", and the former one has the higher priority now. But
    because of that the latter one is rarely chosen because pages with
    PageMlocked highly likely have PG_unevictable set. On the other hand,
    PG_unevictable without PageMlocked is common for ramfs or SHM_LOCKed
    shared memory, so reversing the priority of these two states helps us
    clearly distinguish them.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Chen Gong
    Cc: Tony Luck
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • memory_failure() can't handle memory errors on mlocked pages correctly,
    because page_action() judges such errors as ones on "unknown pages"
    instead of ones on "unevictable LRU page" or "mlocked LRU page". In
    order to determine page_state page_action() checks page flags at the
    timing of the judgement, but such page flags are not the same with those
    just after memory_failure() is called, because memory_failure() does
    unmapping of the error pages before doing page_action(). This unmapping
    changes the page state, especially page_remove_rmap() (called from
    try_to_unmap_one()) clears PG_mlocked, so page_action() can't catch
    mlocked pages after that.

    With this patch, we store the page flag of the error page before doing
    unmap, and (only) if the first check with page flags at the time decided
    the error page is unknown, we do the second check with the stored page
    flag. This implementation doesn't change error handling for the page
    types for which the first check can determine the page state correctly.

    [akpm@linux-foundation.org: tweak comments]
    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Cc: Chen Gong
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • No functional change, but the only purpose of the offlining argument to
    migrate_pages() etc, was to ensure that __unmap_and_move() could migrate a
    KSM page for memory hotremove (which took ksm_thread_mutex) but not for
    other callers. Now all cases are safe, remove the arg.

    Signed-off-by: Hugh Dickins
    Cc: Rik van Riel
    Cc: Petr Holasek
    Cc: Andrea Arcangeli
    Cc: Izik Eidus
    Cc: Gerald Schaefer
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • num_poisoned_pages counts up the number of pages isolated by memory
    errors. But for thp, only one subpage is isolated because memory error
    handler splits it, so it's wrong to add (1 << compound_trans_order).

    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently soft_offline_page() is hard to maintain because it has many
    return points and goto statements. All of this mess come from
    get_any_page().

    This function should only get page refcount as the name implies, but it
    does some page isolating actions like SetPageHWPoison() and dequeuing
    hugepage. This patch corrects it and introduces some internal
    subroutines to make soft offlining code more readable and maintainable.

    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Andi Kleen
    Cc: Tony Luck
    Cc: Wu Fengguang
    Cc: Xishi Qiu
    Cc: Jiang Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Since MCE is an x86 concept, and this code is in mm/, it would be better
    to use the name num_poisoned_pages instead of mce_bad_pages.

    [akpm@linux-foundation.org: fix mm/sparse.c]
    Signed-off-by: Xishi Qiu
    Signed-off-by: Jiang Liu
    Suggested-by: Borislav Petkov
    Reviewed-by: Wanpeng Li
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • There are too many return points randomly intermingled with some "goto
    done" return points. So adjust the function structure, one for the
    success path, the other for the failure path. Use atomic_long_inc
    instead of atomic_long_add.

    Signed-off-by: Xishi Qiu
    Signed-off-by: Jiang Liu
    Suggested-by: Andrew Morton
    Cc: Borislav Petkov
    Cc: Wanpeng Li
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • When doing

    $ echo paddr > /sys/devices/system/memory/soft_offline_page

    to offline a *free* page, the value of mce_bad_pages will be added, and
    the page is set HWPoison flag, but it is still managed by page buddy
    alocator.

    $ cat /proc/meminfo | grep HardwareCorrupted

    shows the value.

    If we offline the same page, the value of mce_bad_pages will be added
    *again*, this means the value is incorrect now. Assume the page is
    still free during this short time.

    soft_offline_page()
    get_any_page()
    "else if (is_free_buddy_page(p))" branch return 0
    "goto done";
    "atomic_long_add(1, &mce_bad_pages);"

    This patch:

    Move poisoned page check at the beginning of the function in order to
    fix the error.

    Signed-off-by: Xishi Qiu
    Signed-off-by: Jiang Liu
    Tested-by: Naoya Horiguchi
    Cc: Borislav Petkov
    Cc: Wanpeng Li
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     

17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

12 Dec, 2012

1 commit

  • action_result() fails to print out "dirty" even if an error occurred on
    a dirty pagecache, because when we check PageDirty in action_result() it
    was cleared after page isolation even if it's dirty before error
    handling. This can break some applications that monitor this message,
    so should be fixed.

    There are several callers of action_result() except page_action(), but
    either of them are not for LRU pages but for free pages or kernel pages,
    so we don't have to consider dirty or not for them.

    Note that PG_dirty can be set outside page locks as described in commit
    6746aff74da2 ("HWPOISON: shmem: call set_page_dirty() with locked
    page"), so this patch does not completely closes the race window, but
    just narrows it.

    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Andi Kleen
    Cc: Tony Luck
    Cc: "Jun'ichi Nomura"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi