12 Sep, 2013

40 commits

  • A memory cgroup with (1) multiple threshold notifications and (2) at least
    one threshold >=2G was not reliable. Specifically the notifications would
    either not fire or would not fire in the proper order.

    The __mem_cgroup_threshold() signaling logic depends on keeping 64 bit
    thresholds in sorted order. mem_cgroup_usage_register_event() sorts them
    with compare_thresholds(), which returns the difference of two 64 bit
    thresholds as an int. If the difference is positive but has bit[31] set,
    then sort() treats the difference as negative and breaks sort order.

    This fix compares the two arbitrary 64 bit thresholds returning the
    classic -1, 0, 1 result.

    The test below sets two notifications (at 0x1000 and 0x81001000):
    cd /sys/fs/cgroup/memory
    mkdir x
    for x in 4096 2164264960; do
    cgroup_event_listener x/memory.usage_in_bytes $x | sed "s/^/$x listener:/" &
    done
    echo $$ > x/cgroup.procs
    anon_leaker 500M

    v3.11-rc7 fails to signal the 4096 event listener:
    Leaking...
    Done leaking pages.

    Patched v3.11-rc7 properly notifies:
    Leaking...
    4096 listener:2013:8:31:14:13:36
    Done leaking pages.

    The fixed bug is old. It appears to date back to the introduction of
    memcg threshold notifications in v2.6.34-rc1-116-g2e72b6347c94 "memcg:
    implement memory thresholds"

    Signed-off-by: Greg Thelen
    Acked-by: Michal Hocko
    Acked-by: Kirill A. Shutemov
    Acked-by: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     
  • Use the helper function instead of __GFP_ZERO.

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • pgoff is not used after the statement "pgoff = vma->vm_pgoff;", so the
    assignment is redundant.

    Signed-off-by: Yanchuan Nian
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yanchuan Nian
     
  • madvise_hwpoison() has two locals called "ret". Fix it all up.

    Cc: Wanpeng Li
    Cc: Naoya Horiguchi
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The return value outside for loop is always zero which means
    madvise_hwpoison return success, however, this is not truth for
    soft_offline_page w/ failure return value.

    Signed-off-by: Wanpeng Li
    Reviewed-by: Naoya Horiguchi
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • Injecting memory failure for page 0x19d0 at 0xb77d2000
    MCE 0x19d0: non LRU page recovery: Ignored
    MCE: Software-unpoisoned page 0x19d0
    BUG: Bad page state in process bash pfn:019d0
    page:f3461a00 count:0 mapcount:0 mapping: (null) index:0x0
    page flags: 0x40000404(referenced|reserved)
    Modules linked in: nfsd auth_rpcgss i915 nfs_acl nfs lockd video drm_kms_helper drm bnep rfcomm sunrpc bluetooth psmouse parport_pc ppdev lp serio_raw fscache parport gpio_ich lpc_ich mac_hid i2c_algo_bit tpm_tis wmi usb_storage hid_generic usbhid hid e1000e firewire_ohci firewire_core ahci ptp libahci pps_core crc_itu_t
    CPU: 3 PID: 2123 Comm: bash Not tainted 3.11.0-rc6+ #12
    Hardware name: LENOVO 7034DD7/ , BIOS 9HKT47AUS 01//2012
    00000000 00000000 e9625ea0 c15ec49b f3461a00 e9625eb8 c15ea119 c17cbf18
    ef084314 000019d0 f3461a00 e9625ed8 c110dc8a f3461a00 00000001 00000000
    f3461a00 40000404 00000000 e9625ef8 c110dcc1 f3461a00 f3461a00 000019d0
    Call Trace:
    dump_stack+0x41/0x52
    bad_page+0xcf/0xeb
    free_pages_prepare+0x12a/0x140
    free_hot_cold_page+0x21/0x110
    __put_single_page+0x21/0x30
    put_page+0x25/0x40
    unpoison_memory+0x107/0x200
    hwpoison_unpoison+0x20/0x30
    simple_attr_write+0xb6/0xd0
    vfs_write+0xa0/0x1b0
    SyS_write+0x4f/0x90
    sysenter_do_call+0x12/0x22
    Disabling lock debugging due to kernel taint

    Testcase:

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define PAGES_TO_TEST 1
    #define PAGE_SIZE 4096

    int main(void)
    {
    char *mem;

    mem = mmap(NULL, PAGES_TO_TEST * PAGE_SIZE,
    PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);

    if (madvise(mem, PAGES_TO_TEST * PAGE_SIZE, MADV_HWPOISON) == -1)
    return -1;

    munmap(mem, PAGES_TO_TEST * PAGE_SIZE);

    return 0;
    }

    There is one page reference count for default empty zero page,
    madvise_hwpoison add another one by get_user_pages_fast. memory_hwpoison
    reduce one page reference count since it's a non LRU page.
    unpoison_memory release the last page reference count and free empty zero
    page to buddy system which is not correct since empty zero page has
    PG_reserved flag. This patch fix it by don't reduce the page reference
    count under 1 against empty zero page.

    Signed-off-by: Wanpeng Li
    Reviewed-by: Naoya Horiguchi
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • Hwpoison injection doesn't implement read method for
    corrupt-pfn/unpoison-pfn attributes:

    # cat /sys/kernel/debug/hwpoison/corrupt-pfn
    cat: /sys/kernel/debug/hwpoison/corrupt-pfn: Permission denied
    # cat /sys/kernel/debug/hwpoison/unpoison-pfn
    cat: /sys/kernel/debug/hwpoison/unpoison-pfn: Permission denied

    This patch changes the permission of corrupt-pfn/unpoison-pfn to 0200.

    Signed-off-by: Wanpeng Li
    Reviewed-by: Naoya Horiguchi
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • madvise hwpoison inject will poison the read-only empty zero page if there
    is no write access before poison. Empty zero page reference count will be
    increased for hwpoison, subsequent poison zero page will return directly
    since page has already been set PG_hwpoison, however, page reference count
    is still increased by get_user_pages_fast. The unpoison process will
    unpoison the empty zero page and decrease the reference count successfully
    for the fist time, however, subsequent unpoison empty zero page will
    return directly since page has already been unpoisoned and without
    decrease the page reference count of empty zero page.

    This patch fixes it by make madvise_hwpoison() put a page and return
    immediately (without calling memory_failure() or soft_offline_page()) when
    the page is already hwpoisoned.

    Testcase:

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define PAGES_TO_TEST 3
    #define PAGE_SIZE 4096

    int main(void)
    {
    char *mem;
    int i;

    mem = mmap(NULL, PAGES_TO_TEST * PAGE_SIZE,
    PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);

    if (madvise(mem, PAGES_TO_TEST * PAGE_SIZE, MADV_HWPOISON) == -1)
    return -1;

    munmap(mem, PAGES_TO_TEST * PAGE_SIZE);

    return 0;
    }

    Add printk to dump page reference count:

    [ 93.075959] Injecting memory failure for page 0x19d0 at 0xb77d8000
    [ 93.076207] MCE 0x19d0: non LRU page recovery: Ignored
    [ 93.076209] pfn 0x19d0, page count = 1 after memory failure
    [ 93.076220] Injecting memory failure for page 0x19d0 at 0xb77d9000
    [ 93.076221] MCE 0x19d0: already hardware poisoned
    [ 93.076222] pfn 0x19d0, page count = 2 after memory failure
    [ 93.076224] Injecting memory failure for page 0x19d0 at 0xb77da000
    [ 93.076224] MCE 0x19d0: already hardware poisoned
    [ 93.076225] pfn 0x19d0, page count = 3 after memory failure

    Signed-off-by: Wanpeng Li
    Suggested-by: Naoya Horiguchi
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • Add '#' to madvise_hwpoison.

    Before patch:

    [ 95.892866] Injecting memory failure for page 19d0 at b7786000
    [ 95.893151] MCE 0x19d0: non LRU page recovery: Ignored

    After patch:

    [ 95.892866] Injecting memory failure for page 0x19d0 at 0xb7786000
    [ 95.893151] MCE 0x19d0: non LRU page recovery: Ignored

    Signed-off-by: Wanpeng Li
    Reviewed-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • Drop forward reference declarations __soft_offline_page.

    Signed-off-by: Wanpeng Li
    Reviewed-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • Set pageblock migration type will hold zone->lock which is heavy contended
    in system to avoid race. However, soft offline page will set pageblock
    migration type twice during get page if the page is in used, not hugetlbfs
    page and not on lru list. There is unnecessary to set the pageblock
    migration type and hold heavy contended zone->lock again if the first
    round get page have already set the pageblock to right migration type.

    The trick here is migration type is MIGRATE_ISOLATE. There are other two
    parts can change MIGRATE_ISOLATE except hwpoison. One is memory hoplug,
    however, we hold lock_memory_hotplug() which avoid race. The second is
    CMA which umovable page allocation requst can't fallback to. So it's safe
    here.

    Signed-off-by: Wanpeng Li
    Cc: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • Replace atomic_long_sub() with atomic_long_dec() since the page is normal
    page instead of hugetlbfs page or thp.

    Signed-off-by: Wanpeng Li
    Reviewed-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • There is a race between hwpoison page and unpoison page, memory_failure
    set the page hwpoison and increase num_poisoned_pages without hold page
    lock, and one page count will be accounted against thp for
    num_poisoned_pages. However, unpoison can occur before memory_failure
    hold page lock and split transparent hugepage, unpoison will decrease
    num_poisoned_pages by 1 << compound_order since memory_failure has not yet
    split transparent hugepage with page lock held. That means we account one
    page for hwpoison and 1 << compound_order for unpoison. This patch fix it
    by inserting a PageTransHuge check before doing TestClearPageHWPoison,
    unpoison failed without clearing PageHWPoison and decreasing
    num_poisoned_pages.

    A B
    memory_failue
    TestSetPageHWPoison(p);
    if (PageHuge(p))
    nr_pages = 1 << compound_order(hpage);
    else
    nr_pages = 1;
    atomic_long_add(nr_pages, &num_poisoned_pages);
    unpoison_memory
    nr_pages = 1<< compound_trans_order(page);
    if(TestClearPageHWPoison(p))
    atomic_long_sub(nr_pages, &num_poisoned_pages);
    lock page
    if (!PageHWPoison(p))
    unlock page and return
    hwpoison_user_mappings
    if (PageTransHuge(hpage))
    split_huge_page(hpage);

    Signed-off-by: Wanpeng Li
    Suggested-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • compound lock is introduced by commit e9da73d67("thp: compound_lock."), it
    is used to serialize put_page against __split_huge_page_refcount(). In
    addition, transparent hugepages will be splitted in hwpoison handler and
    just one subpage will be poisoned. There is unnecessary to hold compound
    lock for hugetlbfs page. This patch replace compound_trans_order by
    compond_order in the place where the page is hugetlbfs page.

    Signed-off-by: Wanpeng Li
    Reviewed-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • memory_failure() store the page flag of the error page before doing unmap,
    and (only) if the first check with page flags at the time decided the
    error page is unknown, it do the second check with the stored page flag
    since memory_failure() does unmapping of the error pages before doing
    page_action(). This unmapping changes the page state, especially
    page_remove_rmap() (called from try_to_unmap_one()) clears PG_mlocked, so
    page_action() can't catch mlocked pages after that.

    However, memory_failure() can't handle memory errors on dirty mlocked
    pages correctly. try_to_unmap_one will move the dirty bit from pte to the
    physical page, the second check lose it since it check the stored page
    flag. This patch fix it by restore PG_dirty flag to stored page flag if
    the page is dirty.

    Testcase:

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include

    #define PAGES_TO_TEST 2
    #define PAGE_SIZE 4096

    int main(void)
    {
    char *mem;
    int i;

    mem = mmap(NULL, PAGES_TO_TEST * PAGE_SIZE,
    PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, 0, 0);

    for (i = 0; i < PAGES_TO_TEST; i++)
    mem[i * PAGE_SIZE] = 'a';

    if (madvise(mem, PAGES_TO_TEST * PAGE_SIZE, MADV_HWPOISON) == -1)
    return -1;

    return 0;
    }

    Before patch:

    [ 912.839247] Injecting memory failure for page 7dfb8 at 7f6b4e37b000
    [ 912.839257] MCE 0x7dfb8: clean mlocked LRU page recovery: Recovered
    [ 912.845550] MCE 0x7dfb8: clean mlocked LRU page still referenced by 1 users
    [ 912.852586] Injecting memory failure for page 7e6aa at 7f6b4e37c000
    [ 912.852594] MCE 0x7e6aa: clean mlocked LRU page recovery: Recovered
    [ 912.858936] MCE 0x7e6aa: clean mlocked LRU page still referenced by 1 users

    After patch:

    [ 163.590225] Injecting memory failure for page 91bc2f at 7f9f5b0e5000
    [ 163.590264] MCE 0x91bc2f: dirty mlocked LRU page recovery: Recovered
    [ 163.596680] MCE 0x91bc2f: dirty mlocked LRU page still referenced by 1 users
    [ 163.603831] Injecting memory failure for page 91cdd3 at 7f9f5b0e6000
    [ 163.603852] MCE 0x91cdd3: dirty mlocked LRU page recovery: Recovered
    [ 163.610305] MCE 0x91cdd3: dirty mlocked LRU page still referenced by 1 users

    Signed-off-by: Wanpeng Li
    Reviewed-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • Soft offline code expects that MIGRATE_ISOLATE is set on the target page
    only during soft offlining work. But currenly it doesn't work as expected
    when get_any_page() fails and returns negative value. In the result, end
    users can have unexpectedly isolated pages. This patch just fixes it.

    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Wanpeng Li
    Cc: Andi Kleen
    Cc: Fengguang Wu
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Set _mapcount PAGE_BUDDY_MAPCOUNT_VALUE to make the page buddy. Not the
    magic number -2.

    Signed-off-by: Wang Sheng-Hui
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Sheng-Hui
     
  • The feature prevents mistrusted filesystems (ie: FUSE mounts created by
    unprivileged users) to grow a large number of dirty pages before
    throttling. For such filesystems balance_dirty_pages always check bdi
    counters against bdi limits. I.e. even if global "nr_dirty" is under
    "freerun", it's not allowed to skip bdi checks. The only use case for now
    is fuse: it sets bdi max_ratio to 1% by default and system administrators
    are supposed to expect that this limit won't be exceeded.

    The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag. A
    filesystem may set the flag when it initializes its BDI.

    The problematic scenario comes from the fact that nobody pays attention to
    the NR_WRITEBACK_TEMP counter (i.e. number of pages under fuse
    writeback). The implementation of fuse writeback releases original page
    (by calling end_page_writeback) almost immediately. A fuse request queued
    for real processing bears a copy of original page. Hence, if userspace
    fuse daemon doesn't finalize write requests in timely manner, an
    aggressive mmap writer can pollute virtually all memory by those temporary
    fuse page copies. They are carefully accounted in NR_WRITEBACK_TEMP, but
    nobody cares.

    To make further explanations shorter, let me use "NR_WRITEBACK_TEMP
    problem" as a shortcut for "a possibility of uncontrolled grow of amount
    of RAM consumed by temporary pages allocated by kernel fuse to process
    writeback".

    The problem was very easy to reproduce. There is a trivial example
    filesystem implementation in fuse userspace distribution: fusexmp_fh.c. I
    added "sleep(1);" to the write methods, then recompiled and mounted it.
    Then created a huge file on the mount point and run a simple program which
    mmap-ed the file to a memory region, then wrote a data to the region. An
    hour later I observed almost all RAM consumed by fuse writeback. Since
    then some unrelated changes in kernel fuse made it more difficult to
    reproduce, but it is still possible now.

    Putting this theoretical happens-in-the-lab thing aside, there is another
    thing that really hurts real world (FUSE) users. This is write-through
    page cache policy FUSE currently uses. I.e. handling write(2), kernel
    fuse populates page cache and flushes user data to the server
    synchronously. This is excessively suboptimal. Pavel Emelyanov's patches
    ("writeback cache policy") solve the problem, but they also make resolving
    NR_WRITEBACK_TEMP problem absolutely necessary. Otherwise, simply copying
    a huge file to a fuse mount would result in memory starvation. Miklos,
    the maintainer of FUSE, believes strictlimit feature the way to go.

    And eventually putting FUSE topics aside, there is one more use-case for
    strictlimit feature. Using a slow USB stick (mass storage) in a machine
    with huge amount of RAM installed is a well-known pain. Let's make simple
    computations. Assuming 64GB of RAM installed, existing implementation of
    balance_dirty_pages will start throttling only after 9.6GB of RAM becomes
    dirty (freerun == 15% of total RAM). So, the command "cp 9GB_file
    /media/my-usb-storage/" may return in a few seconds, but subsequent
    "umount /media/my-usb-storage/" will take more than two hours if effective
    throughput of the storage is, to say, 1MB/sec.

    After inclusion of strictlimit feature, it will be trivial to add a knob
    (e.g. /sys/devices/virtual/bdi/x:y/strictlimit) to enable it on demand.
    Manually or via udev rule. May be I'm wrong, but it seems to be quite a
    natural desire to limit the amount of dirty memory for some devices we are
    not fully trust (in the sense of sustainable throughput).

    [akpm@linux-foundation.org: fix warning in page-writeback.c]
    Signed-off-by: Maxim Patlasov
    Cc: Jan Kara
    Cc: Miklos Szeredi
    Cc: Wu Fengguang
    Cc: Pavel Emelyanov
    Cc: James Bottomley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Maxim Patlasov
     
  • '*lenp' may be less than "sizeof(kbuf)" so we must check this before the
    next copy_to_user().

    pdflush_proc_obsolete() is called by sysctl which 'procname' is
    "nr_pdflush_threads", if the user passes buffer length less than
    "sizeof(kbuf)", it will cause issue.

    Signed-off-by: Chen Gang
    Reviewed-by: Jan Kara
    Cc: Tejun Heo
    Cc: Jeff Moyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • In alloc_new_pmd(), if pud_alloc() was called successfully, but
    pmd_alloc() fails, avoid leaking `pud'.

    Signed-off-by: Chen Gang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • Use wrapper function get_vm_area_size to calculate size of vm area.

    Signed-off-by: Wanpeng Li
    Cc: Dave Hansen
    Cc: Rik van Riel
    Cc: Fengguang Wu
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Cc: Yasuaki Ishimatsu
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Jiri Kosina
    Cc: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • After commit 9bdac9142407 ("sparsemem: Put mem map for one node
    together."), vmemmap for one node will be allocated together, its logic
    is similar as memory allocation for pageblock flags. This patch
    introduces alloc_usemap_and_memmap to extract the same logic of memory
    alloction for pageblock flags and vmemmap.

    Signed-off-by: Wanpeng Li
    Cc: Dave Hansen
    Cc: Rik van Riel
    Cc: Fengguang Wu
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Cc: Yasuaki Ishimatsu
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Jiri Kosina
    Cc: Yinghai Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • This patch is based on KOSAKI's work and I add a little more description,
    please refer https://lkml.org/lkml/2012/6/14/74.

    Currently, I found system can enter a state that there are lots of free
    pages in a zone but only order-0 and order-1 pages which means the zone is
    heavily fragmented, then high order allocation could make direct reclaim
    path's long stall(ex, 60 seconds) especially in no swap and no compaciton
    enviroment. This problem happened on v3.4, but it seems issue still lives
    in current tree, the reason is do_try_to_free_pages enter live lock:

    kswapd will go to sleep if the zones have been fully scanned and are still
    not balanced. As kswapd thinks there's little point trying all over again
    to avoid infinite loop. Instead it changes order from high-order to
    0-order because kswapd think order-0 is the most important. Look at
    73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep
    and may leave zone->all_unreclaimable =3D 0. It assume high-order users
    can still perform direct reclaim if they wish.

    Direct reclaim continue to reclaim for a high order which is not a
    COSTLY_ORDER without oom-killer until kswapd turn on
    zone->all_unreclaimble= . This is because to avoid too early oom-kill.
    So it means direct_reclaim depends on kswapd to break this loop.

    In worst case, direct-reclaim may continue to page reclaim forever when
    kswapd sleeps forever until someone like watchdog detect and finally kill
    the process. As described in:
    http://thread.gmane.org/gmane.linux.kernel.mm/103737

    We can't turn on zone->all_unreclaimable from direct reclaim path because
    direct reclaim path don't take any lock and this way is racy. Thus this
    patch removes zone->all_unreclaimable field completely and recalculates
    zone reclaimable state every time.

    Note: we can't take the idea that direct-reclaim see zone->pages_scanned
    directly and kswapd continue to use zone->all_unreclaimable. Because, it
    is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
    zone->all_unreclaimable as a name) describes the detail.

    [akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
    Cc: Aaditya Kumar
    Cc: Ying Han
    Cc: Nick Piggin
    Acked-by: Rik van Riel
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Bob Liu
    Cc: Neil Zhang
    Cc: Russell King - ARM Linux
    Reviewed-by: Michal Hocko
    Acked-by: Minchan Kim
    Acked-by: Johannes Weiner
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Lisa Du
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lisa Du
     
  • Currently munlock_vma_pages_range() calls follow_page_mask() to obtain
    each individual struct page. This entails repeated full page table
    translations and page table lock taken for each page separately.

    This patch avoids the costly follow_page_mask() where possible, by
    iterating over ptes within single pmd under single page table lock. The
    first pte is obtained by get_locked_pte() for non-THP page acquired by the
    initial follow_page_mask(). The rest of the on-stack pagevec for munlock
    is filled up using pte_walk as long as pte_present() and vm_normal_page()
    are sufficient to obtain the struct page.

    After this patch, a 14% speedup was measured for munlocking a 56GB large
    memory area with THP disabled.

    Signed-off-by: Vlastimil Babka
    Cc: Jörn Engel
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The performance of the fast path in munlock_vma_range() can be further
    improved by avoiding atomic ops of a redundant get_page()/put_page() pair.

    When calling get_page() during page isolation, we already have the pin
    from follow_page_mask(). This pin will be then returned by
    __pagevec_lru_add(), after which we do not reference the pages anymore.

    After this patch, an 8% speedup was measured for munlocking a 56GB large
    memory area with THP disabled.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Jörn Engel
    Acked-by: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • After introducing batching by pagevecs into munlock_vma_range(), we can
    further improve performance by bypassing the copying into per-cpu pagevec
    and the get_page/put_page pair associated with that. Instead we perform
    LRU putback directly from our pagevec. However, this is possible only for
    single-mapped pages that are evictable after munlock. Unevictable pages
    require rechecking after putting on the unevictable list, so for those we
    fallback to putback_lru_page(), hich handles that.

    After this patch, a 13% speedup was measured for munlocking a 56GB large
    memory area with THP disabled.

    [akpm@linux-foundation.org:clarify comment]
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Jörn Engel
    Acked-by: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Depending on previous batch which introduced batched isolation in
    munlock_vma_range(), we can batch also the updates of NR_MLOCK page stats.
    After the whole pagevec is processed for page isolation, the stats are
    updated only once with the number of successful isolations. There were
    however no measurable perfomance gains.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Jörn Engel
    Acked-by: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Currently, munlock_vma_range() calls munlock_vma_page on each page in a
    loop, which results in repeated taking and releasing of the lru_lock
    spinlock for isolating pages one by one. This patch batches the munlock
    operations using an on-stack pagevec, so that isolation is done under
    single lru_lock. For THP pages, the old behavior is preserved as they
    might be split while putting them into the pagevec. After this patch, a
    9% speedup was measured for munlocking a 56GB large memory area with THP
    disabled.

    A new function __munlock_pagevec() is introduced that takes a pagevec and:
    1) It clears PageMlocked and isolates all pages under lru_lock. Zone page
    stats can be also updated using the variant which assumes disabled
    interrupts. 2) It finishes the munlock and lru putback on all pages under
    their lock_page. Note that previously, lock_page covered also the
    PageMlocked clearing and page isolation, but it is not needed for those
    operations.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Jörn Engel
    Acked-by: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • In munlock_vma_range(), lru_add_drain() is currently called in a loop
    before each munlock_vma_page() call.

    This is suboptimal for performance when munlocking many pages. The
    benefits of per-cpu pagevec for batching the LRU putback are removed since
    the pagevec only holds at most one page from the previous loop's
    iteration.

    The lru_add_drain() call also does not serve any purposes for correctness
    - it does not even drain pagavecs of all cpu's. The munlock code already
    expects and handles situations where a page cannot be isolated from the
    LRU (e.g. because it is on some per-cpu pagevec).

    The history of the (not commented) call also suggest that it appears there
    as an oversight rather than intentionally. Before commit ff6a6da6 ("mm:
    accelerate munlock() treatment of THP pages") the call happened only once
    upon entering the function. The commit has moved the call into the while
    loope. So while the other changes in the commit improved munlock
    performance for THP pages, it introduced the abovementioned suboptimal
    per-cpu pagevec usage.

    Further in history, before commit 408e82b7 ("mm: munlock use
    follow_page"), munlock_vma_pages_range() was just a wrapper around
    __mlock_vma_pages_range which performed both mlock and munlock depending
    on a flag. However, before ba470de4 ("mmap: handle mlocked pages during
    map, remap, unmap") the function handled only mlock, not munlock. The
    lru_add_drain call thus comes from the implementation in commit b291f000
    ("mlock: mlocked pages are unevictable" and was intended only for
    mlocking, not munlocking. The original intention of draining the LRU
    pagevec at mlock time was to ensure the pages were on the LRU before the
    lock operation so that they could be placed on the unevictable list
    immediately. There is very little motivation to do the same in the
    munlock path this, particularly for every single page.

    This patch therefore removes the call completely. After removing the
    call, a 10% speedup was measured for munlock() of a 56GB large memory area
    with THP disabled.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Jörn Engel
    Acked-by: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The goal of this patch series is to improve performance of munlock() of
    large mlocked memory areas on systems without THP. This is motivated by
    reported very long times of crash recovery of processes with such areas,
    where munlock() can take several seconds. See
    http://lwn.net/Articles/548108/

    The work was driven by a simple benchmark (to be included in mmtests) that
    mmaps() e.g. 56GB with MAP_LOCKED | MAP_POPULATE and measures the time of
    munlock(). Profiling was performed by attaching operf --pid to the
    process and sending a signal to trigger the munlock() part and then notify
    bach the monitoring wrapper to stop operf, so that only munlock() appears
    in the profile.

    The profiles have shown that CPU time is spent mostly by atomic operations
    and repeated locking per single pages. This series aims to reduce both, starting
    from simpler to more complex changes.

    Patch 1 performs a simple cleanup in putback_lru_page() so that page lru base
    type is not determined without being actually needed.

    Patch 2 removes an unnecessary call to lru_add_drain() which drains the per-cpu
    pagevec after each munlocked page is put there.

    Patch 3 changes munlock_vma_range() to use an on-stack pagevec for isolating
    multiple non-THP pages under a single lru_lock instead of locking and
    processing each page separately.

    Patch 4 changes the NR_MLOCK accounting to be called only once per the pvec
    introduced by previous patch.

    Patch 5 uses the introduced pagevec to batch also the work of putback_lru_page
    when possible, bypassing the per-cpu pvec and associated overhead.

    Patch 6 removes a redundant get_page/put_page pair which saves costly atomic
    operations.

    Patch 7 avoids calling follow_page_mask() on each individual page, and obtains
    multiple page references under a single page table lock where possible.

    Measurements were made using 3.11-rc3 as a baseline. The first set of
    measurements shows the possibly ideal conditions where batching should
    help the most. All memory is allocated from a single NUMA node and THP is
    disabled.

    timedmunlock
    3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3
    0 1 2 3 4 5 6 7
    Elapsed min 3.38 ( 0.00%) 3.39 ( -0.13%) 3.00 ( 11.33%) 2.70 ( 20.20%) 2.67 ( 21.11%) 2.37 ( 29.88%) 2.20 ( 34.91%) 1.91 ( 43.59%)
    Elapsed mean 3.39 ( 0.00%) 3.40 ( -0.23%) 3.01 ( 11.33%) 2.70 ( 20.26%) 2.67 ( 21.21%) 2.38 ( 29.88%) 2.21 ( 34.93%) 1.92 ( 43.46%)
    Elapsed stddev 0.01 ( 0.00%) 0.01 (-43.09%) 0.01 ( 15.42%) 0.01 ( 23.42%) 0.00 ( 89.78%) 0.01 ( -7.15%) 0.00 ( 76.69%) 0.02 (-91.77%)
    Elapsed max 3.41 ( 0.00%) 3.43 ( -0.52%) 3.03 ( 11.29%) 2.72 ( 20.16%) 2.67 ( 21.63%) 2.40 ( 29.50%) 2.21 ( 35.21%) 1.96 ( 42.39%)
    Elapsed range 0.03 ( 0.00%) 0.04 (-51.16%) 0.02 ( 6.27%) 0.02 ( 14.67%) 0.00 ( 88.90%) 0.03 (-19.18%) 0.01 ( 73.70%) 0.06 (-113.35%

    The second set of measurements simulates the worst possible conditions for
    batching by using numactl --interleave, so that there is in fact only one
    page per pagevec. Even in this case the series seems to improve
    performance thanks to reduced atomic operations and removal of
    lru_add_drain().

    timedmunlock
    3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3
    0 1 2 3 4 5 6 7
    Elapsed min 4.00 ( 0.00%) 4.04 ( -0.93%) 3.87 ( 3.37%) 3.72 ( 6.94%) 3.81 ( 4.72%) 3.69 ( 7.82%) 3.64 ( 8.92%) 3.41 ( 14.81%)
    Elapsed mean 4.17 ( 0.00%) 4.15 ( 0.51%) 4.03 ( 3.49%) 3.89 ( 6.84%) 3.86 ( 7.48%) 3.89 ( 6.69%) 3.70 ( 11.27%) 3.48 ( 16.59%)
    Elapsed stddev 0.16 ( 0.00%) 0.08 ( 50.76%) 0.10 ( 41.58%) 0.16 ( 4.59%) 0.05 ( 72.38%) 0.19 (-12.91%) 0.05 ( 68.09%) 0.06 ( 66.03%)
    Elapsed max 4.34 ( 0.00%) 4.32 ( 0.56%) 4.19 ( 3.62%) 4.12 ( 5.15%) 3.91 ( 9.88%) 4.12 ( 5.25%) 3.80 ( 12.58%) 3.56 ( 18.08%)
    Elapsed range 0.34 ( 0.00%) 0.28 ( 17.91%) 0.32 ( 6.45%) 0.40 (-15.73%) 0.10 ( 70.06%) 0.43 (-24.84%) 0.15 ( 55.32%) 0.15 ( 56.16%)

    For completeness, a third set of measurements shows the situation where
    THP is enabled and allocations are again done on a single NUMA node. Here
    munlock() is already very fast thanks to huge pages, and this series does
    not compromise that performance. It seems that the removal of call to
    lru_add_drain() still helps a bit.

    timedmunlock
    3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3 3.11-rc3
    0 1 2 3 4 5 6 7
    Elapsed min 0.01 ( 0.00%) 0.01 ( -0.11%) 0.01 ( 6.59%) 0.01 ( 5.41%) 0.01 ( 5.45%) 0.01 ( 5.03%) 0.01 ( 6.08%) 0.01 ( 5.20%)
    Elapsed mean 0.01 ( 0.00%) 0.01 ( -0.27%) 0.01 ( 6.39%) 0.01 ( 5.30%) 0.01 ( 5.32%) 0.01 ( 5.03%) 0.01 ( 5.97%) 0.01 ( 5.22%)
    Elapsed stddev 0.00 ( 0.00%) 0.00 ( -9.59%) 0.00 ( 10.77%) 0.00 ( 3.24%) 0.00 ( 24.42%) 0.00 ( 31.86%) 0.00 ( -7.46%) 0.00 ( 6.11%)
    Elapsed max 0.01 ( 0.00%) 0.01 ( -0.01%) 0.01 ( 6.83%) 0.01 ( 5.42%) 0.01 ( 5.79%) 0.01 ( 5.53%) 0.01 ( 6.08%) 0.01 ( 5.26%)
    Elapsed range 0.00 ( 0.00%) 0.00 ( 7.30%) 0.00 ( 24.38%) 0.00 ( 6.10%) 0.00 ( 30.79%) 0.00 ( 42.52%) 0.00 ( 6.11%) 0.00 ( 10.07%)

    This patch (of 7):

    In putback_lru_page() since commit c53954a092 (""mm: remove lru parameter
    from __lru_cache_add and lru_cache_add_lru") it is no longer needed to
    determine lru list via page_lru_base_type().

    This patch replaces it with simple flag is_unevictable which says that the
    page was put on the inevictable list. This is the only information that
    matters in subsequent tests.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Jörn Engel
    Acked-by: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Pavel reported that in case if vma area get unmapped and then mapped (or
    expanded) in-place, the soft dirty tracker won't be able to recognize this
    situation since it works on pte level and ptes are get zapped on unmap,
    loosing soft dirty bit of course.

    So to resolve this situation we need to track actions on vma level, there
    VM_SOFTDIRTY flag comes in. When new vma area created (or old expanded)
    we set this bit, and keep it here until application calls for clearing
    soft dirty bit.

    Thus when user space application track memory changes now it can detect if
    vma area is renewed.

    Reported-by: Pavel Emelyanov
    Signed-off-by: Cyrill Gorcunov
    Cc: Andy Lutomirski
    Cc: Matt Mackall
    Cc: Xiao Guangrong
    Cc: Marcelo Tosatti
    Cc: KOSAKI Motohiro
    Cc: Stephen Rothwell
    Cc: Peter Zijlstra
    Cc: "Aneesh Kumar K.V"
    Cc: Rob Landley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     
  • cpuset_zone_allowed is changed to cpuset_zone_allowed_softwall and the
    comment is moved to __cpuset_node_allowed_softwall. So fix this comment.

    Signed-off-by: SeungHun Lee
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    SeungHun Lee
     
  • I am working with a tool that simulates oracle database I/O workload.
    This tool (orion to be specific -
    )
    allocates hugetlbfs pages using shmget() with SHM_HUGETLB flag. It then
    does aio into these pages from flash disks using various common block
    sizes used by database. I am looking at performance with two of the most
    common block sizes - 1M and 64K. aio performance with these two block
    sizes plunged after Transparent HugePages was introduced in the kernel.
    Here are performance numbers:

    pre-THP 2.6.39 3.11-rc5
    1M read 8384 MB/s 5629 MB/s 6501 MB/s
    64K read 7867 MB/s 4576 MB/s 4251 MB/s

    I have narrowed the performance impact down to the overheads introduced by
    THP in __get_page_tail() and put_compound_page() routines. perf top shows
    >40% of cycles being spent in these two routines. Every time direct I/O
    to hugetlbfs pages starts, kernel calls get_page() to grab a reference to
    the pages and calls put_page() when I/O completes to put the reference
    away. THP introduced significant amount of locking overhead to get_page()
    and put_page() when dealing with compound pages because hugepages can be
    split underneath get_page() and put_page(). It added this overhead
    irrespective of whether it is dealing with hugetlbfs pages or transparent
    hugepages. This resulted in 20%-45% drop in aio performance when using
    hugetlbfs pages.

    Since hugetlbfs pages can not be split, there is no reason to go through
    all the locking overhead for these pages from what I can see. I added
    code to __get_page_tail() and put_compound_page() to bypass all the
    locking code when working with hugetlbfs pages. This improved performance
    significantly. Performance numbers with this patch:

    pre-THP 3.11-rc5 3.11-rc5 + Patch
    1M read 8384 MB/s 6501 MB/s 8371 MB/s
    64K read 7867 MB/s 4251 MB/s 6510 MB/s

    Performance with 64K read is still lower than what it was before THP, but
    still a 53% improvement. It does mean there is more work to be done but I
    will take a 53% improvement for now.

    Please take a look at the following patch and let me know if it looks
    reasonable.

    [akpm@linux-foundation.org: tweak comments]
    Signed-off-by: Khalid Aziz
    Cc: Pravin B Shelar
    Cc: Christoph Lameter
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Andi Kleen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Khalid Aziz
     
  • If kswapd was reclaiming for a high order and resets it to 0 due to
    fragmentation it will still call compact_pgdat. For the most part, this
    will fail a compaction_suitable() test and not compact but it is
    unnecessarily sloppy. It could be fixed in the caller but fix it in the
    API instead.

    [dhillf@gmail.com: pointed out that it was a potential problem]
    Signed-off-by: Mel Gorman
    Cc: Hillf Danton
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The memcg_cache_params structure contains the common part and the union,
    which represents two different types of data: one for root cashes and
    another for child caches.

    The size of child data is fixed. The size of the memcg_caches array is
    calculated in runtime.

    Currently the size of memcg_cache_params for root caches is calculated
    incorrectly, because it includes the size of parameters for child caches.

    ssize_t size = memcg_caches_array_size(num_groups);
    size *= sizeof(void *);

    size += sizeof(struct memcg_cache_params);

    v2: Fix a typo in calculations

    Signed-off-by: Andrey Vagin
    Cc: Glauber Costa
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Vagin
     
  • Current early_pfn_to_nid() on arch that support memblock go over
    memblock.memory one by one, so will take too many try near the end.

    We can use existing memblock_search to find the node id for given pfn,
    that could save some time on bigger system that have many entries
    memblock.memory array.

    Here are the timing differences for several machines. In each case with
    the patch less time was spent in __early_pfn_to_nid().

    3.11-rc5 with patch difference (%)
    -------- ---------- --------------
    UV1: 256 nodes 9TB: 411.66 402.47 -9.19 (2.23%)
    UV2: 255 nodes 16TB: 1141.02 1138.12 -2.90 (0.25%)
    UV2: 64 nodes 2TB: 128.15 126.53 -1.62 (1.26%)
    UV2: 32 nodes 2TB: 121.87 121.07 -0.80 (0.66%)
    Time in seconds.

    Signed-off-by: Yinghai Lu
    Cc: Tejun Heo
    Acked-by: Russ Anderson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     
  • new_vma_page() is called only by page migration called from do_mbind(),
    where pages to be migrated are queued into a pagelist by
    queue_pages_range(). queue_pages_range() confirms that a queued page
    belongs to some vma, so !vma case is not supposed to be happen. This
    patch adds BUG_ON() to catch this unexpected case.

    Signed-off-by: Naoya Horiguchi
    Reported-by: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • The function check_range() (and its family) is not well-named, because it
    does not only checking something, but moving pages from list to list to do
    page migration for them. So queue_pages_*range is more desirable name.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Wanpeng Li
    Cc: Hillf Danton
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Now hugepage migration is enabled, although restricted on pmd-based
    hugepages for now (due to lack of testing.) So we should allocate
    migratable hugepages from ZONE_MOVABLE if possible.

    This patch makes GFP flags in hugepage allocation dependent on migration
    support, not only the value of hugepages_treat_as_movable. It provides no
    change on the behavior for architectures which do not support hugepage
    migration,

    Signed-off-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Reviewed-by: Wanpeng Li
    Cc: Hillf Danton
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Currently hugepage migration works well only for pmd-based hugepages
    (mainly due to lack of testing,) so we had better not enable migration of
    other levels of hugepages until we are ready for it.

    Some users of hugepage migration (mbind, move_pages, and migrate_pages) do
    page table walk and check pud/pmd_huge() there, so they are safe. But the
    other users (softoffline and memory hotremove) don't do this, so without
    this patch they can try to migrate unexpected types of hugepages.

    To prevent this, we introduce hugepage_migration_support() as an
    architecture dependent check of whether hugepage are implemented on a pmd
    basis or not. And on some architecture multiple sizes of hugepages are
    available, so hugepage_migration_support() also checks hugepage size.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Hillf Danton
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi