19 Apr, 2014

1 commit

  • Sasha Levin has reported two THP BUGs[1][2]. I believe both of them
    have the same root cause. Let's look to them one by one.

    The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!". It's
    BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page(). From my
    testing I see that page_mapcount() is higher than mapcount here.

    I think it happens due to race between zap_huge_pmd() and
    page_check_address_pmd(). page_check_address_pmd() misses PMD which is
    under zap:

    CPU0 CPU1
    zap_huge_pmd()
    pmdp_get_and_clear()
    __split_huge_page()
    anon_vma_interval_tree_foreach()
    __split_huge_page_splitting()
    page_check_address_pmd()
    mm_find_pmd()
    /*
    * We check if PMD present without taking ptl: no
    * serialization against zap_huge_pmd(). We miss this PMD,
    * it's not accounted to 'mapcount' in __split_huge_page().
    */
    pmd_present(pmd) == 0

    BUG_ON(mapcount != page_mapcount(page)) // CRASH!!!

    page_remove_rmap(page)
    atomic_add_negative(-1, &page->_mapcount)

    The second bug[2] is "kernel BUG at mm/huge_memory.c:1371!".
    It's VM_BUG_ON_PAGE(!PageHead(page), page) in zap_huge_pmd().

    This happens in similar way:

    CPU0 CPU1
    zap_huge_pmd()
    pmdp_get_and_clear()
    page_remove_rmap(page)
    atomic_add_negative(-1, &page->_mapcount)
    __split_huge_page()
    anon_vma_interval_tree_foreach()
    __split_huge_page_splitting()
    page_check_address_pmd()
    mm_find_pmd()
    pmd_present(pmd) == 0 /* The same comment as above */
    /*
    * No crash this time since we already decremented page->_mapcount in
    * zap_huge_pmd().
    */
    BUG_ON(mapcount != page_mapcount(page))

    /*
    * We split the compound page here into small pages without
    * serialization against zap_huge_pmd()
    */
    __split_huge_page_refcount()
    VM_BUG_ON_PAGE(!PageHead(page), page); // CRASH!!!

    So my understanding the problem is pmd_present() check in mm_find_pmd()
    without taking page table lock.

    The bug was introduced by me commit with commit 117b0791ac42. Sorry for
    that. :(

    Let's open code mm_find_pmd() in page_check_address_pmd() and do the
    check under page table lock.

    Note that __page_check_address() does the same for PTE entires
    if sync != 0.

    I've stress tested split and zap code paths for 36+ hours by now and
    don't see crashes with the patch applied. Before it took
    [2] https://lkml.kernel.org/g/

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Sasha Levin
    Tested-by: Sasha Levin
    Cc: Bob Liu
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Dave Jones
    Cc: Vlastimil Babka
    Cc: [3.13+]

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

08 Apr, 2014

2 commits

  • mem_cgroup_newpage_charge is used only for charging anonymous memory so
    it is better to rename it to mem_cgroup_charge_anon.

    mem_cgroup_cache_charge is used for file backed memory so rename it to
    mem_cgroup_charge_file.

    Signed-off-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The main motivation behind this patch is to provide a way to disable THP
    for jobs where the code cannot be modified, and using a malloc hook with
    madvise is not an option (i.e. statically allocated data). This patch
    allows us to do just that, without affecting other jobs running on the
    system.

    We need to do this sort of thing for jobs where THP hurts performance,
    due to the possibility of increased remote memory accesses that can be
    created by situations such as the following:

    When you touch 1 byte of an untouched, contiguous 2MB chunk, a THP will
    be handed out, and the THP will be stuck on whatever node the chunk was
    originally referenced from. If many remote nodes need to do work on
    that same chunk, they'll be making remote accesses.

    With THP disabled, 4K pages can be handed out to separate nodes as
    they're needed, greatly reducing the amount of remote accesses to
    memory.

    This patch is based on some of my work combined with some
    suggestions/patches given by Oleg Nesterov. The main goal here is to
    add a prctl switch to allow us to disable to THP on a per mm_struct
    basis.

    Here's a bit of test data with the new patch in place...

    First with the flag unset:

    # perf stat -a ./prctl_wrapper_mmv3 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
    Setting thp_disabled for this task...
    thp_disable: 0
    Set thp_disabled state to 0
    Process pid = 18027

    PF/
    MAX MIN TOTCPU/ TOT_PF/ TOT_PF/ WSEC/
    TYPE: CPUS WALL WALL SYS USER TOTCPU CPU WALL_SEC SYS_SEC CPU NODES
    512 1.120 0.060 0.000 0.110 0.110 0.000 28571428864 -9223372036854775808 55803572 23

    Performance counter stats for './prctl_wrapper_mmv3_hack 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g':

    273719072.841402 task-clock # 641.026 CPUs utilized [100.00%]
    1,008,986 context-switches # 0.000 M/sec [100.00%]
    7,717 CPU-migrations # 0.000 M/sec [100.00%]
    1,698,932 page-faults # 0.000 M/sec
    355,222,544,890,379 cycles # 1.298 GHz [100.00%]
    536,445,412,234,588 stalled-cycles-frontend # 151.02% frontend cycles idle [100.00%]
    409,110,531,310,223 stalled-cycles-backend # 115.17% backend cycles idle [100.00%]
    148,286,797,266,411 instructions # 0.42 insns per cycle
    # 3.62 stalled cycles per insn [100.00%]
    27,061,793,159,503 branches # 98.867 M/sec [100.00%]
    1,188,655,196 branch-misses # 0.00% of all branches

    427.001706337 seconds time elapsed

    Now with the flag set:

    # perf stat -a ./prctl_wrapper_mmv3 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
    Setting thp_disabled for this task...
    thp_disable: 1
    Set thp_disabled state to 1
    Process pid = 144957

    PF/
    MAX MIN TOTCPU/ TOT_PF/ TOT_PF/ WSEC/
    TYPE: CPUS WALL WALL SYS USER TOTCPU CPU WALL_SEC SYS_SEC CPU NODES
    512 0.620 0.260 0.250 0.320 0.570 0.001 51612901376 128000000000 100806448 23

    Performance counter stats for './prctl_wrapper_mmv3_hack 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g':

    138789390.540183 task-clock # 641.959 CPUs utilized [100.00%]
    534,205 context-switches # 0.000 M/sec [100.00%]
    4,595 CPU-migrations # 0.000 M/sec [100.00%]
    63,133,119 page-faults # 0.000 M/sec
    147,977,747,269,768 cycles # 1.066 GHz [100.00%]
    200,524,196,493,108 stalled-cycles-frontend # 135.51% frontend cycles idle [100.00%]
    105,175,163,716,388 stalled-cycles-backend # 71.07% backend cycles idle [100.00%]
    180,916,213,503,160 instructions # 1.22 insns per cycle
    # 1.11 stalled cycles per insn [100.00%]
    26,999,511,005,868 branches # 194.536 M/sec [100.00%]
    714,066,351 branch-misses # 0.00% of all branches

    216.196778807 seconds time elapsed

    As with previous versions of the patch, We're getting about a 2x
    performance increase here. Here's a link to the test case I used, along
    with the little wrapper to activate the flag:

    http://oss.sgi.com/projects/memtests/thp_pthread_mmprctlv3.tar.gz

    This patch (of 3):

    Revert commit 8e72033f2a48 and add in code to fix up any issues caused
    by the revert.

    The revert is necessary because hugepage_madvise would return -EINVAL
    when VM_NOHUGEPAGE is set, which will break subsequent chunks of this
    patch set.

    Here's a snip of an e-mail from Gerald detailing the original purpose of
    this code, and providing justification for the revert:

    "The intent of commit 8e72033f2a48 was to guard against any future
    programming errors that may result in an madvice(MADV_HUGEPAGE) on
    guest mappings, which would crash the kernel.

    Martin suggested adding the bit to arch/s390/mm/pgtable.c, if
    8e72033f2a48 was to be reverted, because that check will also prevent
    a kernel crash in the case described above, it will now send a
    SIGSEGV instead.

    This would now also allow to do the madvise on other parts, if
    needed, so it is a more flexible approach. One could also say that
    it would have been better to do it this way right from the
    beginning..."

    Signed-off-by: Alex Thorlton
    Suggested-by: Oleg Nesterov
    Tested-by: Christian Borntraeger
    Cc: Gerald Schaefer
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Paolo Bonzini
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Johannes Weiner
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Thorlton
     

04 Apr, 2014

1 commit

  • I've realized that there's no need for do_huge_pmd_wp_zero_page_fallback().
    We can just split zero page with split_huge_page_pmd() and return
    VM_FAULT_FALLBACK. handle_pte_fault() will handle write-protection
    fault for us.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

04 Mar, 2014

1 commit

  • Daniel Borkmann reported a VM_BUG_ON assertion failing:

    ------------[ cut here ]------------
    kernel BUG at mm/mlock.c:528!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: ccm arc4 iwldvm [...]
    video
    CPU: 3 PID: 2266 Comm: netsniff-ng Not tainted 3.14.0-rc2+ #8
    Hardware name: LENOVO 2429BP3/2429BP3, BIOS G4ET37WW (1.12 ) 05/29/2012
    task: ffff8801f87f9820 ti: ffff88002cb44000 task.ti: ffff88002cb44000
    RIP: 0010:[] [] munlock_vma_pages_range+0x2e0/0x2f0
    Call Trace:
    do_munmap+0x18f/0x3b0
    vm_munmap+0x41/0x60
    SyS_munmap+0x22/0x30
    system_call_fastpath+0x1a/0x1f
    RIP munlock_vma_pages_range+0x2e0/0x2f0
    ---[ end trace a0088dcf07ae10f2 ]---

    because munlock_vma_pages_range() thinks it's unexpectedly in the middle
    of a THP page. This can be reproduced with default config since 3.11
    kernels. A reproducer can be found in the kernel's selftest directory
    for networking by running ./psock_tpacket.

    The problem is that an order=2 compound page (allocated by
    alloc_one_pg_vec_page() is part of the munlocked VM_MIXEDMAP vma (mapped
    by packet_mmap()) and mistaken for a THP page and assumed to be order=9.

    The checks for THP in munlock came with commit ff6a6da60b89 ("mm:
    accelerate munlock() treatment of THP pages"), i.e. since 3.9, but did
    not trigger a bug. It just makes munlock_vma_pages_range() skip such
    compound pages until the next 512-pages-aligned page, when it encounters
    a head page. This is however not a problem for vma's where mlocking has
    no effect anyway, but it can distort the accounting.

    Since commit 7225522bb429 ("mm: munlock: batch non-THP page isolation
    and munlock+putback using pagevec") this can trigger a VM_BUG_ON in
    PageTransHuge() check.

    This patch fixes the issue by adding VM_MIXEDMAP flag to VM_SPECIAL, a
    list of flags that make vma's non-mlockable and non-mergeable. The
    reasoning is that VM_MIXEDMAP vma's are similar to VM_PFNMAP, which is
    already on the VM_SPECIAL list, and both are intended for non-LRU pages
    where mlocking makes no sense anyway. Related Lkml discussion can be
    found in [2].

    [1] tools/testing/selftests/net/psock_tpacket
    [2] https://lkml.org/lkml/2014/1/10/427

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Daniel Borkmann
    Reported-by: Daniel Borkmann
    Tested-by: Daniel Borkmann
    Cc: Thomas Hellstrom
    Cc: John David Anglin
    Cc: HATAYAMA Daisuke
    Cc: Konstantin Khlebnikov
    Cc: Carsten Otte
    Cc: Jared Hulbert
    Tested-by: Hannes Frederic Sowa
    Cc: Kirill A. Shutemov
    Acked-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: [3.11.x+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

26 Feb, 2014

1 commit

  • Masayoshi Mizuma reported a bug with the hang of an application under
    the memcg limit. It happens on write-protection fault to huge zero page

    If we successfully allocate a huge page to replace zero page but hit the
    memcg limit we need to split the zero page with split_huge_page_pmd()
    and fallback to small pages.

    The other part of the problem is that VM_FAULT_OOM has special meaning
    in do_huge_pmd_wp_page() context. __handle_mm_fault() expects the page
    to be split if it sees VM_FAULT_OOM and it will will retry page fault
    handling. This causes an infinite loop if the page was not split.

    do_huge_pmd_wp_zero_page_fallback() can return VM_FAULT_OOM if it failed
    to allocate one small page, so fallback to small pages will not help.

    The solution for this part is to replace VM_FAULT_OOM with
    VM_FAULT_FALLBACK is fallback required.

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Masayoshi Mizuma
    Reviewed-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

17 Feb, 2014

1 commit

  • Archs like ppc64 doesn't do tlb flush in set_pte/pmd functions when using
    a hash table MMU for various reasons (the flush is handled as part of
    the PTE modification when necessary).

    ppc64 thus doesn't implement flush_tlb_range for hash based MMUs.

    Additionally ppc64 require the tlb flushing to be batched within ptl locks.

    The reason to do that is to ensure that the hash page table is in sync with
    linux page table.

    We track the hpte index in linux pte and if we clear them without flushing
    hash and drop the ptl lock, we can have another cpu update the pte and can
    end up with duplicate entry in the hash table, which is fatal.

    We also want to keep set_pte_at simpler by not requiring them to do hash
    flush for performance reason. We do that by assuming that set_pte_at() is
    never *ever* called on a PTE that is already valid.

    This was the case until the NUMA code went in which broke that assumption.

    Fix that by introducing a new pair of helpers to set _PAGE_NUMA in a
    way similar to ptep/pmdp_set_wrprotect(), with a generic implementation
    using set_pte_at() and a powerpc specific one using the appropriate
    mechanism needed to keep the hash table in sync.

    Acked-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Benjamin Herrenschmidt

    Aneesh Kumar K.V
     

28 Jan, 2014

1 commit

  • Pull powerpc mremap fix from Ben Herrenschmidt:
    "This is the patch that I had sent after -rc8 and which we decided to
    wait before merging. It's based on a different tree than my -next
    branch (it needs some pre-reqs that were in -rc4 or so while my -next
    is based on -rc1) so I left it as a separate branch for your to pull.
    It's identical to the request I did 2 or 3 weeks back.

    This fixes crashes in mremap with THP on powerpc.

    The fix however requires a small change in the generic code. It moves
    a condition into a helper we can override from the arch which is
    harmless, but it *also* slightly changes the order of the set_pmd and
    the withdraw & deposit, which should be fine according to Kirill (who
    wrote that code) but I agree -rc8 is a bit late...

    It was acked by Kirill and Andrew told me to just merge it via powerpc"

    * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
    powerpc/thp: Fix crash on mremap

    Linus Torvalds
     

24 Jan, 2014

3 commits

  • Code that is obj-y (always built-in) or dependent on a bool Kconfig
    (built-in or absent) can never be modular. So using module_init as an
    alias for __initcall can be somewhat misleading.

    Fix these up now, so that we can relocate module_init from init.h into
    module.h in the future. If we don't do this, we'd have to add module.h
    to obviously non-modular code, and that would be a worse thing.

    The audit targets the following module_init users for change:
    mm/ksm.c bool KSM
    mm/mmap.c bool MMU
    mm/huge_memory.c bool TRANSPARENT_HUGEPAGE
    mm/mmu_notifier.c bool MMU_NOTIFIER

    Note that direct use of __initcall is discouraged, vs. one of the
    priority categorized subgroups. As __initcall gets mapped onto
    device_initcall, our use of subsys_initcall (which makes sense for these
    files) will thus change this registration from level 6-device to level
    4-subsys (i.e. slightly earlier).

    However no observable impact of that difference has been observed during
    testing.

    One might think that core_initcall (l2) or postcore_initcall (l3) would
    be more appropriate for anything in mm/ but if we look at some actual
    init functions themselves, we see things like:

    mm/huge_memory.c --> hugepage_init --> hugepage_init_sysfs
    mm/mmap.c --> init_user_reserve --> sysctl_user_reserve_kbytes
    mm/ksm.c --> ksm_init --> sysfs_create_group

    and hence the choice of subsys_initcall (l4) seems reasonable, and at
    the same time minimizes the risk of changing the priority too
    drastically all at once. We can adjust further in the future.

    Also, several instances of missing ";" at EOL are fixed.

    Signed-off-by: Paul Gortmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Gortmaker
     
  • min_free_kbytes may be raised during THP's initialization. Sometimes,
    this will change the value which was set by the user. Showing this
    message will clarify this confusion.

    Only show this message when changing a value which was set by the user
    according to Michal Hocko's suggestion.

    Show the old value of min_free_kbytes according to Dave Hansen's
    suggestion. This will give user the chance to restore old value of
    min_free_kbytes.

    Signed-off-by: Han Pingtian
    Reviewed-by: Michal Hocko
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Han Pingtian
     
  • Most of the VM_BUG_ON assertions are performed on a page. Usually, when
    one of these assertions fails we'll get a BUG_ON with a call stack and
    the registers.

    I've recently noticed based on the requests to add a small piece of code
    that dumps the page to various VM_BUG_ON sites that the page dump is
    quite useful to people debugging issues in mm.

    This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
    VM_BUG_ON() does, also dumps the page before executing the actual
    BUG_ON.

    [akpm@linux-foundation.org: fix up includes]
    Signed-off-by: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

15 Jan, 2014

1 commit

  • This patch fix the below crash

    NIP [c00000000004cee4] .__hash_page_thp+0x2a4/0x440
    LR [c0000000000439ac] .hash_page+0x18c/0x5e0
    ...
    Call Trace:
    [c000000736103c40] [00001ffffb000000] 0x1ffffb000000(unreliable)
    [437908.479693] [c000000736103d50] [c0000000000439ac] .hash_page+0x18c/0x5e0
    [437908.479699] [c000000736103e30] [c00000000000924c] .do_hash_page+0x4c/0x58

    On ppc64 we use the pgtable for storing the hpte slot information and
    store address to the pgtable at a constant offset (PTRS_PER_PMD) from
    pmd. On mremap, when we switch the pmd, we need to withdraw and deposit
    the pgtable again, so that we find the pgtable at PTRS_PER_PMD offset
    from new pmd.

    We also want to move the withdraw and deposit before the set_pmd so
    that, when page fault find the pmd as trans huge we can be sure that
    pgtable can be located at the offset.

    Signed-off-by: Aneesh Kumar K.V
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Benjamin Herrenschmidt

    Aneesh Kumar K.V
     

12 Jan, 2014

1 commit

  • We see General Protection Fault on RSI in copy_page_rep: that RSI is
    what you get from a NULL struct page pointer.

    RIP: 0010:[] [] copy_page_rep+0x5/0x10
    RSP: 0000:ffff880136e15c00 EFLAGS: 00010286
    RAX: ffff880000000000 RBX: ffff880136e14000 RCX: 0000000000000200
    RDX: 6db6db6db6db6db7 RSI: db73880000000000 RDI: ffff880dd0c00000
    RBP: ffff880136e15c18 R08: 0000000000000200 R09: 000000000005987c
    R10: 000000000005987c R11: 0000000000000200 R12: 0000000000000001
    R13: ffffea00305aa000 R14: 0000000000000000 R15: 0000000000000000
    FS: 00007f195752f700(0000) GS:ffff880c7fc20000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000093010000 CR3: 00000001458e1000 CR4: 00000000000027e0
    Call Trace:
    copy_user_huge_page+0x93/0xab
    do_huge_pmd_wp_page+0x710/0x815
    handle_mm_fault+0x15d8/0x1d70
    __do_page_fault+0x14d/0x840
    do_page_fault+0x2f/0x90
    page_fault+0x22/0x30

    do_huge_pmd_wp_page() tests is_huge_zero_pmd(orig_pmd) four times: but
    since shrink_huge_zero_page() can free the huge_zero_page, and we have
    no hold of our own on it here (except where the fourth test holds
    page_table_lock and has checked pmd_same), it's possible for it to
    answer yes the first time, but no to the second or third test. Change
    all those last three to tests for NULL page.

    (Note: this is not the same issue as trinity's DEBUG_PAGEALLOC BUG
    in copy_page_rep with RSI: ffff88009c422000, reported by Sasha Levin
    in https://lkml.org/lkml/2013/3/29/103. I believe that one is due
    to the source page being split, and a tail page freed, while copy
    is in progress; and not a problem without DEBUG_PAGEALLOC, since
    the pmd_same check will prevent a miscopy from being made visible.)

    Fixes: 97ae17497e99 ("thp: implement refcounting for huge zero page")
    Signed-off-by: Hugh Dickins
    Cc: stable@vger.kernel.org # v3.10 v3.11 v3.12
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

03 Jan, 2014

1 commit

  • Sasha Levin reported the following warning being triggered

    WARNING: CPU: 28 PID: 35287 at mm/huge_memory.c:887 copy_huge_pmd+0x145/ 0x3a0()
    Call Trace:
    copy_huge_pmd+0x145/0x3a0
    copy_page_range+0x3f2/0x560
    dup_mmap+0x2c9/0x3d0
    dup_mm+0xad/0x150
    copy_process+0xa68/0x12e0
    do_fork+0x96/0x270
    SyS_clone+0x16/0x20
    stub_clone+0x69/0x90

    This warning was introduced by "mm: numa: Avoid unnecessary disruption
    of NUMA hinting during migration" for paranoia reasons but the warning
    is bogus. I was thinking of parallel races between NUMA hinting faults
    and forks but this warning would also be triggered by a parallel reclaim
    splitting a THP during a fork. Remote the bogus warning.

    Signed-off-by: Mel Gorman
    Reported-by: Sasha Levin
    Cc: Alex Thorlton
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

19 Dec, 2013

7 commits

  • THP migration can fail for a variety of reasons. Avoid flushing the TLB
    to deal with THP migration races until the copy is ready to start.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There are a few subtle races, between change_protection_range (used by
    mprotect and change_prot_numa) on one side, and NUMA page migration and
    compaction on the other side.

    The basic race is that there is a time window between when the PTE gets
    made non-present (PROT_NONE or NUMA), and the TLB is flushed.

    During that time, a CPU may continue writing to the page.

    This is fine most of the time, however compaction or the NUMA migration
    code may come in, and migrate the page away.

    When that happens, the CPU may continue writing, through the cached
    translation, to what is no longer the current memory location of the
    process.

    This only affects x86, which has a somewhat optimistic pte_accessible.
    All other architectures appear to be safe, and will either always flush,
    or flush whenever there is a valid mapping, even with no permissions
    (SPARC).

    The basic race looks like this:

    CPU A CPU B CPU C

    load TLB entry
    make entry PTE/PMD_NUMA
    fault on entry
    read/write old page
    start migrating page
    change PTE/PMD to new page
    read/write old page [*]
    flush TLB
    reload TLB from new entry
    read/write new page
    lose data

    [*] the old page may belong to a new user at this point!

    The obvious fix is to flush remote TLB entries, by making sure that
    pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
    still be accessible if there is a TLB flush pending for the mm.

    This should fix both NUMA migration and compaction.

    [mgorman@suse.de: fix build]
    Signed-off-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • do_huge_pmd_numa_page() handles the case where there is parallel THP
    migration. However, by the time it is checked the NUMA hinting
    information has already been disrupted. This patch adds an earlier
    check with some helpers.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • On a protection change it is no longer clear if the page should be still
    accessible. This patch clears the NUMA hinting fault bits on a
    protection change.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The anon_vma lock prevents parallel THP splits and any associated
    complexity that arises when handling splits during THP migration. This
    patch checks if the lock was successfully acquired and bails from THP
    migration if it failed for any reason.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • If the PMD is flushed then a parallel fault in handle_mm_fault() will
    enter the pmd_none and do_huge_pmd_anonymous_page() path where it'll
    attempt to insert a huge zero page. This is wasteful so the patch
    avoids clearing the PMD when setting pmd_numa.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Base pages are unmapped and flushed from cache and TLB during normal
    page migration and replaced with a migration entry that causes any
    parallel NUMA hinting fault or gup to block until migration completes.

    THP does not unmap pages due to a lack of support for migration entries
    at a PMD level. This allows races with get_user_pages and
    get_user_pages_fast which commit 3f926ab945b6 ("mm: Close races between
    THP migration and PMD numa clearing") made worse by introducing a
    pmd_clear_flush().

    This patch forces get_user_page (fast and normal) on a pmd_numa page to
    go through the slow get_user_page path where it will serialise against
    THP migration and properly account for the NUMA hinting fault. On the
    migration side the page table lock is taken for each PTE update.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

13 Dec, 2013

1 commit

  • Andrey Wagin reported crash on VM_BUG_ON() in pgtable_pmd_page_dtor() with
    fallowing backtrace:

    free_pgd_range+0x2bf/0x410
    free_pgtables+0xce/0x120
    unmap_region+0xe0/0x120
    do_munmap+0x249/0x360
    move_vma+0x144/0x270
    SyS_mremap+0x3b9/0x510
    system_call_fastpath+0x16/0x1b

    The crash can be reproduce with this test case:

    #define _GNU_SOURCE
    #include
    #include
    #include

    #define MB (1024 * 1024UL)
    #define GB (1024 * MB)

    int main(int argc, char **argv)
    {
    char *p;
    int i;

    p = mmap((void *) GB, 10 * MB, PROT_READ | PROT_WRITE,
    MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0);
    for (i = 0; i < 10 * MB; i += 4096)
    p[i] = 1;
    mremap(p, 10 * MB, 10 * MB, MREMAP_FIXED | MREMAP_MAYMOVE, 2 * GB);
    return 0;
    }

    Due to split PMD lock, we now store preallocated PTE tables for THP
    pages per-PMD table. It means we need to move them to other PMD table
    if huge PMD moved there.

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Andrey Vagin
    Tested-by: Andrey Vagin
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Nov, 2013

4 commits

  • Only trivial cases left. Let's convert them altogether.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With split page table lock we can't know which lock we need to take
    before we find the relevant pmd.

    Let's move lock taking inside the function.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With split ptlock it's important to know which lock
    pmd_trans_huge_lock() took. This patch adds one more parameter to the
    function to return the lock.

    In most places migration to new api is trivial. Exception is
    move_huge_pmd(): we need to take two locks if pmd tables are different.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With split page table lock for PMD level we can't hold mm->page_table_lock
    while updating nr_ptes.

    Let's convert it to atomic_long_t to avoid races.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: Naoya Horiguchi
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

13 Nov, 2013

3 commits

  • Khugepaged will scan/free HPAGE_PMD_NR normal pages and replace with a
    hugepage which is allocated from the node of the first scanned normal
    page, but this policy is too rough and may end with unexpected result to
    upper users.

    The problem is the original page-balancing among all nodes will be
    broken after hugepaged started. Thinking about the case if the first
    scanned normal page is allocated from node A, most of other scanned
    normal pages are allocated from node B or C.. But hugepaged will always
    allocate hugepage from node A which will cause extra memory pressure on
    node A which is not the situation before khugepaged started.

    This patch try to fix this problem by making khugepaged allocate
    hugepage from the node which have max record of scaned normal pages hit,
    so that the effect to original page-balancing can be minimized.

    The other problem is if normal scanned pages are equally allocated from
    Node A,B and C, after khugepaged started Node A will still suffer extra
    memory pressure.

    Andrew Davidoff reported a related issue several days ago. He wanted
    his application interleaving among all nodes and "numactl
    --interleave=all ./test" was used to run the testcase, but the result
    wasn't not as expected.

    cat /proc/2814/numa_maps:
    7f50bd440000 interleave:0-3 anon=51403 dirty=51403 N0=435 N1=435 N2=435 N3=50098

    The end result showed that most pages are from Node3 instead of
    interleave among node0-3 which was unreasonable.

    This patch also fix this issue by allocating hugepage round robin from
    all nodes have the same record, after this patch the result was as
    expected:

    7f78399c0000 interleave:0-3 anon=51403 dirty=51403 N0=12723 N1=12723 N2=13235 N3=12722

    The simple testcase is like this:

    int main() {
    char *p;
    int i;
    int j;

    for (i=0; i < 200; i++) {
    p = (char *)malloc(1048576);
    printf("malloc done\n");

    if (p == 0) {
    printf("Out of memory\n");
    return 1;
    }
    for (j=0; j < 1048576; j++) {
    p[j] = 'A';
    }
    printf("touched memory\n");

    sleep(1);
    }
    printf("enter sleep\n");
    while(1) {
    sleep(100);
    }
    }

    [akpm@linux-foundation.org: make last_khugepaged_target_node local to khugepaged_find_target_node()]
    Reported-by: Andrew Davidoff
    Tested-by: Andrew Davidoff
    Signed-off-by: Bob Liu
    Cc: Andrea Arcangeli
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Yasuaki Ishimatsu
    Cc: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     
  • Move alloc_hugepage() to a better place, no need for a seperate #ifndef
    CONFIG_NUMA

    Signed-off-by: Bob Liu
    Reviewed-by: Yasuaki Ishimatsu
    Acked-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Andrew Davidoff
    Cc: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bob Liu
     
  • Since commit 13ece886d99c ("thp: transparent hugepage config choice"),
    transparent hugepage support is disabled by default, and
    TRANSPARENT_HUGEPAGE_ALWAYS is configured when TRANSPARENT_HUGEPAGE=y.

    And since commit d39d33c332c6 ("thp: enable direct defrag"), defrag is
    enable for all transparent hugepage page faults by default, not only in
    MADV_HUGEPAGE regions.

    Signed-off-by: Jianguo Wu
    Reviewed-by: Wanpeng Li
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianguo Wu
     

01 Nov, 2013

1 commit

  • Resolve cherry-picking conflicts:

    Conflicts:
    mm/huge_memory.c
    mm/memory.c
    mm/mprotect.c

    See this upstream merge commit for more details:

    52469b4fcd4f Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

29 Oct, 2013

5 commits

  • THP migration uses the page lock to guard against parallel allocations
    but there are cases like this still open

    Task A Task B
    --------------------- ---------------------
    do_huge_pmd_numa_page do_huge_pmd_numa_page
    lock_page
    mpol_misplaced == -1
    unlock_page
    goto clear_pmdnuma
    lock_page
    mpol_misplaced == 2
    migrate_misplaced_transhuge
    pmd = pmd_mknonnuma
    set_pmd_at

    During hours of testing, one crashed with weird errors and while I have
    no direct evidence, I suspect something like the race above happened.
    This patch extends the page lock to being held until the pmd_numa is
    cleared to prevent migration starting in parallel while the pmd_numa is
    being cleared. It also flushes the old pmd entry and orders pagetable
    insertion before rmap insertion.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Cc:
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-9-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • There are three callers of task_numa_fault():

    - do_huge_pmd_numa_page():
    Accounts against the current node, not the node where the
    page resides, unless we migrated, in which case it accounts
    against the node we migrated to.

    - do_numa_page():
    Accounts against the current node, not the node where the
    page resides, unless we migrated, in which case it accounts
    against the node we migrated to.

    - do_pmd_numa_page():
    Accounts not at all when the page isn't migrated, otherwise
    accounts against the node we migrated towards.

    This seems wrong to me; all three sites should have the same
    sementaics, furthermore we should accounts against where the page
    really is, we already know where the task is.

    So modify all three sites to always account; we did after all receive
    the fault; and always account to where the page is after migration,
    regardless of success.

    They all still differ on when they clear the PTE/PMD; ideally that
    would get sorted too.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Cc:
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-8-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • THP migrations are serialised by the page lock but on its own that does
    not prevent THP splits. If the page is split during THP migration then
    the pmd_same checks will prevent page table corruption but the unlock page
    and other fix-ups potentially will cause corruption. This patch takes the
    anon_vma lock to prevent parallel splits during migration.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Cc:
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-7-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • The locking for migrating THP is unusual. While normal page migration
    prevents parallel accesses using a migration PTE, THP migration relies on
    a combination of the page_table_lock, the page lock and the existance of
    the NUMA hinting PTE to guarantee safety but there is a bug in the scheme.

    If a THP page is currently being migrated and another thread traps a
    fault on the same page it checks if the page is misplaced. If it is not,
    then pmd_numa is cleared. The problem is that it checks if the page is
    misplaced without holding the page lock meaning that the racing thread
    can be migrating the THP when the second thread clears the NUMA bit
    and faults a stale page.

    This patch checks if the page is potentially being migrated and stalls
    using the lock_page if it is potentially being migrated before checking
    if the page is misplaced or not.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Cc:
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-6-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • If another task handled a hinting fault in parallel then do not double
    account for it.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Cc:
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-5-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     

17 Oct, 2013

1 commit

  • Occasionally we hit the BUG_ON(pmd_trans_huge(*pmd)) at the end of
    __split_huge_page_pmd(): seen when doing madvise(,,MADV_DONTNEED).

    It's invalid: we don't always have down_write of mmap_sem there: a racing
    do_huge_pmd_wp_page() might have copied-on-write to another huge page
    before our split_huge_page() got the anon_vma lock.

    Forget the BUG_ON, just go back and try again if this happens.

    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Naoya Horiguchi
    Cc: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

09 Oct, 2013

4 commits

  • Adjust numa_scan_period in task_numa_placement, depending on how much
    useful work the numa code can do. The more local faults there are in a
    given scan window the longer the period (and hence the slower the scan rate)
    during the next window. If there are excessive shared faults then the scan
    period will decrease with the amount of scaling depending on whether the
    ratio of shared/private faults. If the preferred node changes then the
    scan rate is reset to recheck if the task is properly placed.

    Signed-off-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-59-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • And here's a little something to make sure not the whole world ends up
    in a single group.

    As while we don't migrate shared executable pages, we do scan/fault on
    them. And since everybody links to libc, everybody ends up in the same
    group.

    Suggested-by: Rik van Riel
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Link: http://lkml.kernel.org/r/1381141781-10992-47-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Change the per page last fault tracking to use cpu,pid instead of
    nid,pid. This will allow us to try and lookup the alternate task more
    easily. Note that even though it is the cpu that is store in the page
    flags that the mpol_misplaced decision is still based on the node.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Link: http://lkml.kernel.org/r/1381141781-10992-43-git-send-email-mgorman@suse.de
    [ Fixed build failure on 32-bit systems. ]
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Ideally it would be possible to distinguish between NUMA hinting faults that
    are private to a task and those that are shared. If treated identically
    there is a risk that shared pages bounce between nodes depending on
    the order they are referenced by tasks. Ultimately what is desirable is
    that task private pages remain local to the task while shared pages are
    interleaved between sharing tasks running on different nodes to give good
    average performance. This is further complicated by THP as even
    applications that partition their data may not be partitioning on a huge
    page boundary.

    To start with, this patch assumes that multi-threaded or multi-process
    applications partition their data and that in general the private accesses
    are more important for cpu->memory locality in the general case. Also,
    no new infrastructure is required to treat private pages properly but
    interleaving for shared pages requires additional infrastructure.

    To detect private accesses the pid of the last accessing task is required
    but the storage requirements are a high. This patch borrows heavily from
    Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
    to encode some bits from the last accessing task in the page flags as
    well as the node information. Collisions will occur but it is better than
    just depending on the node information. Node information is then used to
    determine if a page needs to migrate. The PID information is used to detect
    private/shared accesses. The preferred NUMA node is selected based on where
    the maximum number of approximately private faults were measured. Shared
    faults are not taken into consideration for a few reasons.

    First, if there are many tasks sharing the page then they'll all move
    towards the same node. The node will be compute overloaded and then
    scheduled away later only to bounce back again. Alternatively the shared
    tasks would just bounce around nodes because the fault information is
    effectively noise. Either way accounting for shared faults the same as
    private faults can result in lower performance overall.

    The second reason is based on a hypothetical workload that has a small
    number of very important, heavily accessed private pages but a large shared
    array. The shared array would dominate the number of faults and be selected
    as a preferred node even though it's the wrong decision.

    The third reason is that multiple threads in a process will race each
    other to fault the shared page making the fault information unreliable.

    Signed-off-by: Mel Gorman
    [ Fix complication error when !NUMA_BALANCING. ]
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-30-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman