26 Sep, 2014

17 commits

  • commit 5dab29113ca56335c78be3f98bf5ddf2ef8eb6a6 upstream.

    ALLOC_NO_WATERMARK is set in a few cases. Always by kswapd, always for
    __GFP_MEMALLOC, sometimes for swap-over-nfs, tasks etc. Each of these
    cases are relatively rare events but the ALLOC_NO_WATERMARK check is an
    unlikely branch in the fast path. This patch moves the check out of the
    fast path and after it has been determined that the watermarks have not
    been met. This helps the common fast path at the cost of making the slow
    path slower and hitting kswapd with a performance cost. It's a reasonable
    tradeoff.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit a6e21b14f22041382e832d30deda6f26f37b1097 upstream.

    Currently it's calculated once per zone in the zonelist.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit d34c5fa06fade08a689fc171bf756fba2858ae73 upstream.

    A node/zone index is used to check if pages are compatible for merging
    but this happens unconditionally even if the buddy page is not free. Defer
    the calculation as long as possible. Ideally we would check the zone boundary
    but nodes can overlap.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit d8846374a85f4290a473a4e2a64c1ba046c4a0e1 upstream.

    There is no need to calculate zone_idx(preferred_zone) multiple times
    or use the pgdat to figure it out.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit 664eeddeef6539247691197c1ac124d4aa872ab6 upstream.

    If cpusets are not in use then we still check a global variable on every
    page allocation. Use jump labels to avoid the overhead.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit 800a1e750c7b04c2aa2459afca77e936e01c0029 upstream.

    If a zone cannot be used for a dirty page then it gets marked "full" which
    is cached in the zlc and later potentially skipped by allocation requests
    that have nothing to do with dirty zones.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit 65bb371984d6a2c909244eb749e482bb40b72e36 upstream.

    The zlc is used on NUMA machines to quickly skip over zones that are full.
    However it is always updated, even for the first zone scanned when the
    zlc might not even be active. As it's a write to a bitmap that
    potentially bounces cache line it's deceptively expensive and most
    machines will not care. Only update the zlc if it was active.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit 5bcc9f86ef09a933255ee66bd899d4601785dad5 upstream.

    For the MIGRATE_RESERVE pages, it is useful when they do not get
    misplaced on free_list of other migratetype, otherwise they might get
    allocated prematurely and e.g. fragment the MIGRATE_RESEVE pageblocks.
    While this cannot be avoided completely when allocating new
    MIGRATE_RESERVE pageblocks in min_free_kbytes sysctl handler, we should
    prevent the misplacement where possible.

    Currently, it is possible for the misplacement to happen when a
    MIGRATE_RESERVE page is allocated on pcplist through rmqueue_bulk() as a
    fallback for other desired migratetype, and then later freed back
    through free_pcppages_bulk() without being actually used. This happens
    because free_pcppages_bulk() uses get_freepage_migratetype() to choose
    the free_list, and rmqueue_bulk() calls set_freepage_migratetype() with
    the *desired* migratetype and not the page's original MIGRATE_RESERVE
    migratetype.

    This patch fixes the problem by moving the call to
    set_freepage_migratetype() from rmqueue_bulk() down to
    __rmqueue_smallest() and __rmqueue_fallback() where the actual page's
    migratetype (e.g. from which free_list the page is taken from) is used.
    Note that this migratetype might be different from the pageblock's
    migratetype due to freepage stealing decisions. This is OK, as page
    stealing never uses MIGRATE_RESERVE as a fallback, and also takes care
    to leave all MIGRATE_CMA pages on the correct freelist.

    Therefore, as an additional benefit, the call to
    get_pageblock_migratetype() from rmqueue_bulk() when CMA is enabled, can
    be removed completely. This relies on the fact that MIGRATE_CMA
    pageblocks are created only during system init, and the above. The
    related is_migrate_isolate() check is also unnecessary, as memory
    isolation has other ways to move pages between freelists, and drain pcp
    lists containing pages that should be isolated. The buffered_rmqueue()
    can also benefit from calling get_freepage_migratetype() instead of
    get_pageblock_migratetype().

    Signed-off-by: Vlastimil Babka
    Reported-by: Yong-Taek Lee
    Reported-by: Bartlomiej Zolnierkiewicz
    Suggested-by: Joonsoo Kim
    Acked-by: Joonsoo Kim
    Suggested-by: Mel Gorman
    Acked-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Marek Szyprowski
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Michal Nazarewicz
    Cc: "Wang, Yalin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Vlastimil Babka
     
  • commit e0b9daeb453e602a95ea43853dc12d385558ce1f upstream.

    We're going to want to manipulate the migration mode for compaction in the
    page allocator, and currently compact_control's sync field is only a bool.

    Currently, we only do MIGRATE_ASYNC or MIGRATE_SYNC_LIGHT compaction
    depending on the value of this bool. Convert the bool to enum
    migrate_mode and pass the migration mode in directly. Later, we'll want
    to avoid MIGRATE_SYNC_LIGHT for thp allocations in the pagefault patch to
    avoid unnecessary latency.

    This also alters compaction triggered from sysfs, either for the entire
    system or for a node, to force MIGRATE_SYNC.

    [akpm@linux-foundation.org: fix build]
    [iamjoonsoo.kim@lge.com: use MIGRATE_SYNC in alloc_contig_range()]
    Signed-off-by: David Rientjes
    Suggested-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Greg Thelen
    Cc: Naoya Horiguchi
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    David Rientjes
     
  • commit 68711a746345c44ae00c64d8dbac6a9ce13ac54a upstream.

    Memory migration uses a callback defined by the caller to determine how to
    allocate destination pages. When migration fails for a source page,
    however, it frees the destination page back to the system.

    This patch adds a memory migration callback defined by the caller to
    determine how to free destination pages. If a caller, such as memory
    compaction, builds its own freelist for migration targets, this can reuse
    already freed memory instead of scanning additional memory.

    If the caller provides a function to handle freeing of destination pages,
    it is called when page migration fails. If the caller passes NULL then
    freeing back to the system will be handled as usual. This patch
    introduces no functional change.

    Signed-off-by: David Rientjes
    Reviewed-by: Naoya Horiguchi
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Greg Thelen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    David Rientjes
     
  • commit d26914d11751b23ca2e8747725f2cae10c2f2c1b upstream.

    Since put_mems_allowed() is strictly optional, its a seqcount retry, we
    don't need to evaluate the function if the allocation was in fact
    successful, saving a smp_rmb some loads and comparisons on some relative
    fast-paths.

    Since the naming, get/put_mems_allowed() does suggest a mandatory
    pairing, rename the interface, as suggested by Mel, to resemble the
    seqcount interface.

    This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
    where it is important to note that the return value of the latter call
    is inverted from its previous incarnation.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit 943dca1a1fcbccb58de944669b833fd38a6c809b upstream.

    Yasuaki Ishimatsu reported memory hot-add spent more than 5 _hours_ on
    9TB memory machine since onlining memory sections is too slow. And we
    found out setup_zone_migrate_reserve spent >90% of the time.

    The problem is, setup_zone_migrate_reserve scans all pageblocks
    unconditionally, but it is only necessary if the number of reserved
    block was reduced (i.e. memory hot remove).

    Moreover, maximum MIGRATE_RESERVE per zone is currently 2. It means
    that the number of reserved pageblocks is almost always unchanged.

    This patch adds zone->nr_migrate_reserve_block to maintain the number of
    MIGRATE_RESERVE pageblocks and it reduces the overhead of
    setup_zone_migrate_reserve dramatically. The following table shows time
    of onlining a memory section.

    Amount of memory | 128GB | 192GB | 256GB|
    ---------------------------------------------
    linux-3.12 | 23.9 | 31.4 | 44.5 |
    This patch | 8.3 | 8.3 | 8.6 |
    Mel's proposal patch | 10.9 | 19.2 | 31.3 |
    ---------------------------------------------
    (millisecond)

    128GB : 4 nodes and each node has 32GB of memory
    192GB : 6 nodes and each node has 32GB of memory
    256GB : 8 nodes and each node has 32GB of memory

    (*1) Mel proposed his idea by the following threads.
    https://lkml.org/lkml/2013/10/30/272

    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Yasuaki Ishimatsu
    Reported-by: Yasuaki Ishimatsu
    Tested-by: Yasuaki Ishimatsu
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Yasuaki Ishimatsu
     
  • commit de6c60a6c115acaa721cfd499e028a413d1fcbf3 upstream.

    Currently there are several functions to manipulate the deferred
    compaction state variables. The remaining case where the variables are
    touched directly is when a successful allocation occurs in direct
    compaction, or is expected to be successful in the future by kswapd.
    Here, the lowest order that is expected to fail is updated, and in the
    case of successful allocation, the deferred status and counter is reset
    completely.

    Create a new function compaction_defer_reset() to encapsulate this
    functionality and make it easier to understand the code. No functional
    change.

    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Vlastimil Babka
     
  • commit 0cbef29a782162a3896487901eca4550bfa397ef upstream.

    When __rmqueue_fallback() doesn't find a free block with the required size
    it splits a larger page and puts the rest of the page onto the free list.

    But it has one serious mistake. When putting back, __rmqueue_fallback()
    always use start_migratetype if type is not CMA. However,
    __rmqueue_fallback() is only called when all of the start_migratetype
    queue is empty. That said, __rmqueue_fallback always puts back memory to
    the wrong queue except try_to_steal_freepages() changed pageblock type
    (i.e. requested size is smaller than half of page block). The end result
    is that the antifragmentation framework increases fragmenation instead of
    decreasing it.

    Mel's original anti fragmentation does the right thing. But commit
    47118af076f6 ("mm: mmzone: MIGRATE_CMA migration type added") broke it.

    This patch restores sane and old behavior. It also removes an incorrect
    comment which was introduced by commit fef903efcf0c ("mm/page_alloc.c:
    restructure free-page stealing code and fix a bug").

    Signed-off-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Michal Nazarewicz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    KOSAKI Motohiro
     
  • commit 52c8f6a5aeb0bdd396849ecaa72d96f8175528f5 upstream.

    In general, every tracepoint should be zero overhead if it is disabled.
    However, trace_mm_page_alloc_extfrag() is one of exception. It evaluate
    "new_type == start_migratetype" even if tracepoint is disabled.

    However, the code can be moved into tracepoint's TP_fast_assign() and
    TP_fast_assign exist exactly such purpose. This patch does it.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    KOSAKI Motohiro
     
  • commit 70ef57e6c22c3323dce179b7d0d433c479266612 upstream.

    We had a report about strange OOM killer strikes on a PPC machine
    although there was a lot of swap free and a tons of anonymous memory
    which could be swapped out. In the end it turned out that the OOM was a
    side effect of zone reclaim which wasn't unmapping and swapping out and
    so the system was pushed to the OOM. Although this sounds like a bug
    somewhere in the kswapd vs. zone reclaim vs. direct reclaim
    interaction numactl on the said hardware suggests that the zone reclaim
    should not have been set in the first place:

    node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
    node 0 size: 0 MB
    node 0 free: 0 MB
    node 2 cpus:
    node 2 size: 7168 MB
    node 2 free: 6019 MB
    node distances:
    node 0 2
    0: 10 40
    2: 40 10

    So all the CPUs are associated with Node0 which doesn't have any memory
    while Node2 contains all the available memory. Node distances cause an
    automatic zone_reclaim_mode enabling.

    Zone reclaim is intended to keep the allocations local but this doesn't
    make any sense on the memoryless nodes. So let's exclude such nodes for
    init_zone_allows_reclaim which evaluates zone reclaim behavior and
    suitable reclaim_nodes.

    Signed-off-by: Michal Hocko
    Acked-by: David Rientjes
    Acked-by: Nishanth Aravamudan
    Tested-by: Nishanth Aravamudan
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Michal Hocko
     
  • commit da8c757b080ee84f219fa2368cb5dd23ac304fc0 upstream.

    If echo -1 > /proc/vm/sys/min_free_kbytes, the system will hang. Changing
    proc_dointvec() to proc_dointvec_minmax() in the
    min_free_kbytes_sysctl_handler() can prevent this to happen.

    mhocko said:

    : You can still do echo $BIG_VALUE > /proc/vm/sys/min_free_kbytes and make
    : your machine unusable but I agree that proc_dointvec_minmax is more
    : suitable here as we already have:
    :
    : .proc_handler = min_free_kbytes_sysctl_handler,
    : .extra1 = &zero,
    :
    : It used to work properly but then 6fce56ec91b5 ("sysctl: Remove references
    : to ctl_name and strategy from the generic sysctl table") has removed
    : sysctl_intvec strategy and so extra1 is ignored.

    Signed-off-by: Han Pingtian
    Acked-by: Michal Hocko
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Han Pingtian
     

19 Aug, 2014

1 commit

  • commit b104a35d32025ca740539db2808aa3385d0f30eb upstream.

    The page allocator relies on __GFP_WAIT to determine if ALLOC_CPUSET
    should be set in allocflags. ALLOC_CPUSET controls if a page allocation
    should be restricted only to the set of allowed cpuset mems.

    Transparent hugepages clears __GFP_WAIT when defrag is disabled to prevent
    the fault path from using memory compaction or direct reclaim. Thus, it
    is unfairly able to allocate outside of its cpuset mems restriction as a
    side-effect.

    This patch ensures that ALLOC_CPUSET is only cleared when the gfp mask is
    truly GFP_ATOMIC by verifying it is also not a thp allocation.

    Signed-off-by: David Rientjes
    Reported-by: Alex Thorlton
    Tested-by: Alex Thorlton
    Cc: Bob Liu
    Cc: Dave Hansen
    Cc: Hedi Berriche
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Srivatsa S. Bhat
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    David Rientjes
     

18 Jul, 2014

2 commits

  • commit dc78327c0ea7da5186d8cbc1647bd6088c5c9fa5 upstream.

    With a kernel configured with ARM64_64K_PAGES && !TRANSPARENT_HUGEPAGE,
    the following is triggered at early boot:

    SMP: Total of 8 processors activated.
    devtmpfs: initialized
    Unable to handle kernel NULL pointer dereference at virtual address 00000008
    pgd = fffffe0000050000
    [00000008] *pgd=00000043fba00003, *pmd=00000043fba00003, *pte=00e0000078010407
    Internal error: Oops: 96000006 [#1] SMP
    Modules linked in:
    CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.15.0-rc864k+ #44
    task: fffffe03bc040000 ti: fffffe03bc080000 task.ti: fffffe03bc080000
    PC is at __list_add+0x10/0xd4
    LR is at free_one_page+0x270/0x638
    ...
    Call trace:
    __list_add+0x10/0xd4
    free_one_page+0x26c/0x638
    __free_pages_ok.part.52+0x84/0xbc
    __free_pages+0x74/0xbc
    init_cma_reserved_pageblock+0xe8/0x104
    cma_init_reserved_areas+0x190/0x1e4
    do_one_initcall+0xc4/0x154
    kernel_init_freeable+0x204/0x2a8
    kernel_init+0xc/0xd4

    This happens because init_cma_reserved_pageblock() calls
    __free_one_page() with pageblock_order as page order but it is bigger
    than MAX_ORDER. This in turn causes accesses past zone->free_list[].

    Fix the problem by changing init_cma_reserved_pageblock() such that it
    splits pageblock into individual MAX_ORDER pages if pageblock is bigger
    than a MAX_ORDER page.

    In cases where !CONFIG_HUGETLB_PAGE_SIZE_VARIABLE, which is all
    architectures expect for ia64, powerpc and tile at the moment, the
    “pageblock_order > MAX_ORDER” condition will be optimised out since both
    sides of the operator are constants. In cases where pageblock size is
    variable, the performance degradation should not be significant anyway
    since init_cma_reserved_pageblock() is called only at boot time at most
    MAX_CMA_AREAS times which by default is eight.

    Signed-off-by: Michal Nazarewicz
    Reported-by: Mark Salter
    Tested-by: Mark Salter
    Tested-by: Christopher Covington
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Marek Szyprowski
    Cc: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Michal Nazarewicz
     
  • commit 7cd2b0a34ab8e4db971920eef8982f985441adfb upstream.

    Oleg reports a division by zero error on zero-length write() to the
    percpu_pagelist_fraction sysctl:

    divide error: 0000 [#1] SMP DEBUG_PAGEALLOC
    CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000
    RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120
    RSP: 0018:ffff8800d87a3e78 EFLAGS: 00010246
    RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010
    RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50
    R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060
    R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800
    FS: 00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0
    Call Trace:
    proc_sys_call_handler+0xb3/0xc0
    proc_sys_write+0x14/0x20
    vfs_write+0xba/0x1e0
    SyS_write+0x46/0xb0
    tracesys+0xe1/0xe6

    However, if the percpu_pagelist_fraction sysctl is set by the user, it
    is also impossible to restore it to the kernel default since the user
    cannot write 0 to the sysctl.

    This patch allows the user to write 0 to restore the default behavior.
    It still requires a fraction equal to or larger than 8, however, as
    stated by the documentation for sanity. If a value in the range [1, 7]
    is written, the sysctl will return EINVAL.

    This successfully solves the divide by zero issue at the same time.

    Signed-off-by: David Rientjes
    Reported-by: Oleg Drokin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    David Rientjes
     

02 Jul, 2014

1 commit

  • commit e58469bafd0524e848c3733bc3918d854595e20f upstream.

    The test_bit operations in get/set pageblock flags are expensive. This
    patch reads the bitmap on a word basis and use shifts and masks to isolate
    the bits of interest. Similarly masks are used to set a local copy of the
    bitmap and then use cmpxchg to update the bitmap if there have been no
    other changes made in parallel.

    In a test running dd onto tmpfs the overhead of the pageblock-related
    functions went from 1.27% in profiles to 0.5%.

    In addition to the performance benefits, this patch closes races that are
    possible between:

    a) get_ and set_pageblock_migratetype(), where get_pageblock_migratetype()
    reads part of the bits before and other part of the bits after
    set_pageblock_migratetype() has updated them.

    b) set_pageblock_migratetype() and set_pageblock_skip(), where the non-atomic
    read-modify-update set bit operation in set_pageblock_skip() will cause
    lost updates to some bits changed in the set_pageblock_migratetype().

    Joonsoo Kim first reported the case a) via code inspection. Vlastimil
    Babka's testing with a debug patch showed that either a) or b) occurs
    roughly once per mmtests' stress-highalloc benchmark (although not
    necessarily in the same pageblock). Furthermore during development of
    unrelated compaction patches, it was observed that frequent calls to
    {start,undo}_isolate_page_range() the race occurs several thousands of
    times and has resulted in NULL pointer dereferences in move_freepages()
    and free_one_page() in places where free_list[migratetype] is
    manipulated by e.g. list_move(). Further debugging confirmed that
    migratetype had invalid value of 6, causing out of bounds access to the
    free_list array.

    That confirmed that the race exist, although it may be extremely rare,
    and currently only fatal where page isolation is performed due to
    memory hot remove. Races on pageblocks being updated by
    set_pageblock_migratetype(), where both old and new migratetype are
    lower MIGRATE_RESERVE, currently cannot result in an invalid value
    being observed, although theoretically they may still lead to
    unexpected creation or destruction of MIGRATE_RESERVE pageblocks.
    Furthermore, things could get suddenly worse when memory isolation is
    used more, or when new migratetypes are added.

    After this patch, the race has no longer been observed in testing.

    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Reported-by: Joonsoo Kim
    Reported-and-tested-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Mel Gorman
     

15 May, 2014

1 commit

  • commit 3a025760fc158b3726eac89ee95d7f29599e9dfa upstream.

    On NUMA systems, a node may start thrashing cache or even swap anonymous
    pages while there are still free pages on remote nodes.

    This is a result of commits 81c0a2bb515f ("mm: page_alloc: fair zone
    allocator policy") and fff4068cba48 ("mm: page_alloc: revert NUMA aspect
    of fair allocation policy").

    Before those changes, the allocator would first try all allowed zones,
    including those on remote nodes, before waking any kswapds. But now,
    the allocator fastpath doubles as the fairness pass, which in turn can
    only consider the local node to prevent remote spilling based on
    exhausted fairness batches alone. Remote nodes are only considered in
    the slowpath, after the kswapds are woken up. But if remote nodes still
    have free memory, kswapd should not be woken to rebalance the local node
    or it may thrash cash or swap prematurely.

    Fix this by adding one more unfair pass over the zonelist that is
    allowed to spill to remote nodes after the local fairness pass fails but
    before entering the slowpath and waking the kswapds.

    This also gets rid of the GFP_THISNODE exemption from the fairness
    protocol because the unfair pass is no longer tied to kswapd, which
    GFP_THISNODE is not allowed to wake up.

    However, because remote spills can be more frequent now - we prefer them
    over local kswapd reclaim - the allocation batches on remote nodes could
    underflow more heavily. When resetting the batches, use
    atomic_long_read() directly instead of zone_page_state() to calculate the
    delta as the latter filters negative counter values.

    Signed-off-by: Johannes Weiner
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Johannes Weiner
     

03 Apr, 2014

1 commit

  • commit 668f9abbd4334e6c29fa8acd71635c4f9101caa7 upstream.

    Commit bf6bddf1924e ("mm: introduce compaction and migration for
    ballooned pages") introduces page_count(page) into memory compaction
    which dereferences page->first_page if PageTail(page).

    This results in a very rare NULL pointer dereference on the
    aforementioned page_count(page). Indeed, anything that does
    compound_head(), including page_count() is susceptible to racing with
    prep_compound_page() and seeing a NULL or dangling page->first_page
    pointer.

    This patch uses Andrea's implementation of compound_trans_head() that
    deals with such a race and makes it the default compound_head()
    implementation. This includes a read memory barrier that ensures that
    if PageTail(head) is true that we return a head page that is neither
    NULL nor dangling. The patch then adds a store memory barrier to
    prep_compound_page() to ensure page->first_page is set.

    This is the safest way to ensure we see the head page that we are
    expecting, PageTail(page) is already in the unlikely() path and the
    memory barriers are unfortunately required.

    Hugetlbfs is the exception, we don't enforce a store memory barrier
    during init since no race is possible.

    Signed-off-by: David Rientjes
    Cc: Holger Kiehl
    Cc: Christoph Lameter
    Cc: Rafael Aquini
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    David Rientjes
     

23 Mar, 2014

1 commit

  • commit 27329369c9ecf37771b2a65202cbf5578cff3331 upstream.

    Jan Stancek reports manual page migration encountering allocation
    failures after some pages when there is still plenty of memory free, and
    bisected the problem down to commit 81c0a2bb515f ("mm: page_alloc: fair
    zone allocator policy").

    The problem is that GFP_THISNODE obeys the zone fairness allocation
    batches on one hand, but doesn't reset them and wake kswapd on the other
    hand. After a few of those allocations, the batches are exhausted and
    the allocations fail.

    Fixing this means either having GFP_THISNODE wake up kswapd, or
    GFP_THISNODE not participating in zone fairness at all. The latter
    seems safer as an acute bugfix, we can clean up later.

    Reported-by: Jan Stancek
    Signed-off-by: Johannes Weiner
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Johannes Weiner
     

10 Jan, 2014

1 commit

  • commit fff4068cba484e6b0abe334ed6b15d5a215a3b25 upstream.

    Commit 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") meant
    to bring aging fairness among zones in system, but it was overzealous
    and badly regressed basic workloads on NUMA systems.

    Due to the way kswapd and page allocator interacts, we still want to
    make sure that all zones in any given node are used equally for all
    allocations to maximize memory utilization and prevent thrashing on the
    highest zone in the node.

    While the same principle applies to NUMA nodes - memory utilization is
    obviously improved by spreading allocations throughout all nodes -
    remote references can be costly and so many workloads prefer locality
    over memory utilization. The original change assumed that
    zone_reclaim_mode would be a good enough predictor for that, but it
    turned out to be as indicative as a coin flip.

    Revert the NUMA aspect of the fairness until we can find a proper way to
    make it configurable and agree on a sane default.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Michal Hocko
    Signed-off-by: Mel Gorman
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Johannes Weiner
     

01 Oct, 2013

1 commit

  • This reverts commit cea27eb2a202 ("mm/memory-hotplug: fix lowmem count
    overflow when offline pages").

    The fixed bug by commit cea27eb was fixed to another way by commit
    3dcc0571cd64 ("mm: correctly update zone->managed_pages"). That commit
    enhances memory_hotplug.c to adjust totalhigh_pages when hot-removing
    memory, for details please refer to:

    http://marc.info/?l=linux-mm&m=136957578620221&w=2

    As a result, commit cea27eb2a202 currently causes duplicated decreasing
    of totalhigh_pages, thus the revert.

    Signed-off-by: Joonyoung Shim
    Reviewed-by: Wanpeng Li
    Cc: Jiang Liu
    Cc: KOSAKI Motohiro
    Cc: Bartlomiej Zolnierkiewicz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonyoung Shim
     

12 Sep, 2013

14 commits

  • Set _mapcount PAGE_BUDDY_MAPCOUNT_VALUE to make the page buddy. Not the
    magic number -2.

    Signed-off-by: Wang Sheng-Hui
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Sheng-Hui
     
  • This patch is based on KOSAKI's work and I add a little more description,
    please refer https://lkml.org/lkml/2012/6/14/74.

    Currently, I found system can enter a state that there are lots of free
    pages in a zone but only order-0 and order-1 pages which means the zone is
    heavily fragmented, then high order allocation could make direct reclaim
    path's long stall(ex, 60 seconds) especially in no swap and no compaciton
    enviroment. This problem happened on v3.4, but it seems issue still lives
    in current tree, the reason is do_try_to_free_pages enter live lock:

    kswapd will go to sleep if the zones have been fully scanned and are still
    not balanced. As kswapd thinks there's little point trying all over again
    to avoid infinite loop. Instead it changes order from high-order to
    0-order because kswapd think order-0 is the most important. Look at
    73ce02e9 in detail. If watermarks are ok, kswapd will go back to sleep
    and may leave zone->all_unreclaimable =3D 0. It assume high-order users
    can still perform direct reclaim if they wish.

    Direct reclaim continue to reclaim for a high order which is not a
    COSTLY_ORDER without oom-killer until kswapd turn on
    zone->all_unreclaimble= . This is because to avoid too early oom-kill.
    So it means direct_reclaim depends on kswapd to break this loop.

    In worst case, direct-reclaim may continue to page reclaim forever when
    kswapd sleeps forever until someone like watchdog detect and finally kill
    the process. As described in:
    http://thread.gmane.org/gmane.linux.kernel.mm/103737

    We can't turn on zone->all_unreclaimable from direct reclaim path because
    direct reclaim path don't take any lock and this way is racy. Thus this
    patch removes zone->all_unreclaimable field completely and recalculates
    zone reclaimable state every time.

    Note: we can't take the idea that direct-reclaim see zone->pages_scanned
    directly and kswapd continue to use zone->all_unreclaimable. Because, it
    is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
    zone->all_unreclaimable as a name) describes the detail.

    [akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
    Cc: Aaditya Kumar
    Cc: Ying Han
    Cc: Nick Piggin
    Acked-by: Rik van Riel
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Bob Liu
    Cc: Neil Zhang
    Cc: Russell King - ARM Linux
    Reviewed-by: Michal Hocko
    Acked-by: Minchan Kim
    Acked-by: Johannes Weiner
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Lisa Du
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lisa Du
     
  • cpuset_zone_allowed is changed to cpuset_zone_allowed_softwall and the
    comment is moved to __cpuset_node_allowed_softwall. So fix this comment.

    Signed-off-by: SeungHun Lee
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    SeungHun Lee
     
  • Current early_pfn_to_nid() on arch that support memblock go over
    memblock.memory one by one, so will take too many try near the end.

    We can use existing memblock_search to find the node id for given pfn,
    that could save some time on bigger system that have many entries
    memblock.memory array.

    Here are the timing differences for several machines. In each case with
    the patch less time was spent in __early_pfn_to_nid().

    3.11-rc5 with patch difference (%)
    -------- ---------- --------------
    UV1: 256 nodes 9TB: 411.66 402.47 -9.19 (2.23%)
    UV2: 255 nodes 16TB: 1141.02 1138.12 -2.90 (0.25%)
    UV2: 64 nodes 2TB: 128.15 126.53 -1.62 (1.26%)
    UV2: 32 nodes 2TB: 121.87 121.07 -0.80 (0.66%)
    Time in seconds.

    Signed-off-by: Yinghai Lu
    Cc: Tejun Heo
    Acked-by: Russ Anderson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     
  • Until now we can't offline memory blocks which contain hugepages because a
    hugepage is considered as an unmovable page. But now with this patch
    series, a hugepage has become movable, so by using hugepage migration we
    can offline such memory blocks.

    What's different from other users of hugepage migration is that we need to
    decompose all the hugepages inside the target memory block into free buddy
    pages after hugepage migration, because otherwise free hugepages remaining
    in the memory block intervene the memory offlining. For this reason we
    introduce new functions dissolve_free_huge_page() and
    dissolve_free_huge_pages().

    Other than that, what this patch does is straightforwardly to add hugepage
    migration code, that is, adding hugepage code to the functions which scan
    over pfn and collect hugepages to be migrated, and adding a hugepage
    allocation function to alloc_migrate_target().

    As for larger hugepages (1GB for x86_64), it's not easy to do hotremove
    over them because it's larger than memory block. So we now simply leave
    it to fail as it is.

    [yongjun_wei@trendmicro.com.cn: remove duplicated include]
    Signed-off-by: Naoya Horiguchi
    Acked-by: Andi Kleen
    Cc: Hillf Danton
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: KOSAKI Motohiro
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Wei Yongjun
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • Use "zone_is_empty()" instead of "if (zone->spanned_pages)".
    Simplify the code, no functional change.

    Signed-off-by: Xishi Qiu
    Cc: Cody P Schafer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • The main idea behind this patchset is to reduce the vmstat update overhead
    by avoiding interrupt enable/disable and the use of per cpu atomics.

    This patch (of 3):

    It is better to have a separate folding function because
    refresh_cpu_vm_stats() also does other things like expire pages in the
    page allocator caches.

    If we have a separate function then refresh_cpu_vm_stats() is only called
    from the local cpu which allows additional optimizations.

    The folding function is only called when a cpu is being downed and
    therefore no other processor will be accessing the counters. Also
    simplifies synchronization.

    [akpm@linux-foundation.org: fix UP build]
    Signed-off-by: Christoph Lameter
    Cc: KOSAKI Motohiro
    CC: Tejun Heo
    Cc: Joonsoo Kim
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • We rarely allocate a page with ALLOC_NO_WATERMARKS and it is used in slow
    path. For helping compiler optimization, add unlikely macro to
    ALLOC_NO_WATERMARKS checking.

    This patch doesn't have any effect now, because gcc already optimize this
    properly. But we cannot assume that gcc always does right and nobody
    re-evaluate if gcc do proper optimization with their change, for example,
    it is not optimized properly on v3.10. So adding compiler hint here is
    reasonable.

    Signed-off-by: Joonsoo Kim
    Acked-by: Johannes Weiner
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Each zone that holds userspace pages of one workload must be aged at a
    speed proportional to the zone size. Otherwise, the time an individual
    page gets to stay in memory depends on the zone it happened to be
    allocated in. Asymmetry in the zone aging creates rather unpredictable
    aging behavior and results in the wrong pages being reclaimed, activated
    etc.

    But exactly this happens right now because of the way the page allocator
    and kswapd interact. The page allocator uses per-node lists of all zones
    in the system, ordered by preference, when allocating a new page. When
    the first iteration does not yield any results, kswapd is woken up and the
    allocator retries. Due to the way kswapd reclaims zones below the high
    watermark while a zone can be allocated from when it is above the low
    watermark, the allocator may keep kswapd running while kswapd reclaim
    ensures that the page allocator can keep allocating from the first zone in
    the zonelist for extended periods of time. Meanwhile the other zones
    rarely see new allocations and thus get aged much slower in comparison.

    The result is that the occasional page placed in lower zones gets
    relatively more time in memory, even gets promoted to the active list
    after its peers have long been evicted. Meanwhile, the bulk of the
    working set may be thrashing on the preferred zone even though there may
    be significant amounts of memory available in the lower zones.

    Even the most basic test -- repeatedly reading a file slightly bigger than
    memory -- shows how broken the zone aging is. In this scenario, no single
    page should be able stay in memory long enough to get referenced twice and
    activated, but activation happens in spades:

    $ grep active_file /proc/zoneinfo
    nr_inactive_file 0
    nr_active_file 0
    nr_inactive_file 0
    nr_active_file 8
    nr_inactive_file 1582
    nr_active_file 11994
    $ cat data data data data >/dev/null
    $ grep active_file /proc/zoneinfo
    nr_inactive_file 0
    nr_active_file 70
    nr_inactive_file 258753
    nr_active_file 443214
    nr_inactive_file 149793
    nr_active_file 12021

    Fix this with a very simple round robin allocator. Each zone is allowed a
    batch of allocations that is proportional to the zone's size, after which
    it is treated as full. The batch counters are reset when all zones have
    been tried and the allocator enters the slowpath and kicks off kswapd
    reclaim. Allocation and reclaim is now fairly spread out to all
    available/allowable zones:

    $ grep active_file /proc/zoneinfo
    nr_inactive_file 0
    nr_active_file 0
    nr_inactive_file 174
    nr_active_file 4865
    nr_inactive_file 53
    nr_active_file 860
    $ cat data data data data >/dev/null
    $ grep active_file /proc/zoneinfo
    nr_inactive_file 0
    nr_active_file 0
    nr_inactive_file 666622
    nr_active_file 4988
    nr_inactive_file 190969
    nr_active_file 937

    When zone_reclaim_mode is enabled, allocations will now spread out to all
    zones on the local node, not just the first preferred zone (which on a 4G
    node might be a tiny Normal zone).

    Signed-off-by: Johannes Weiner
    Acked-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Paul Bolle
    Cc: Zlatko Calusic
    Tested-by: Kevin Hilman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Allocations that do not have to respect the watermarks are rare
    high-priority events. Reorder the code such that per-zone dirty limits
    and future checks important only to regular page allocations are ignored
    in these extraordinary situations.

    Signed-off-by: Johannes Weiner
    Cc: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Paul Bolle
    Tested-by: Zlatko Calusic
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • We should not check loop+1 with loop end in loop body. Just duplicate two
    lines code to avoid it.

    That will help a bit when we have huge amount of pages on system with
    16TiB memory.

    Signed-off-by: Yinghai Lu
    Cc: Mel Gorman
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yinghai Lu
     
  • In the current code, the value of fallback_migratetype that is printed
    using the mm_page_alloc_extfrag tracepoint, is the value of the
    migratetype *after* it has been set to the preferred migratetype (if the
    ownership was changed). Obviously that wouldn't have been the original
    intent. (We already have a separate 'change_ownership' field to tell
    whether the ownership of the pageblock was changed from the
    fallback_migratetype to the preferred type.)

    The intent of the fallback_migratetype field is to show the migratetype
    from which we borrowed pages in order to satisfy the allocation request.
    So fix the code to print that value correctly.

    Signed-off-by: Srivatsa S. Bhat
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Cody P Schafer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Srivatsa S. Bhat
     
  • The free-page stealing code in __rmqueue_fallback() is somewhat hard to
    follow, and has an incredible amount of subtlety hidden inside!

    First off, there is a minor bug in the reporting of change-of-ownership of
    pageblocks. Under some conditions, we try to move upto
    'pageblock_nr_pages' no. of pages to the preferred allocation list. But
    we change the ownership of that pageblock to the preferred type only if we
    manage to successfully move atleast half of that pageblock (or if
    page_group_by_mobility_disabled is set).

    However, the current code ignores the latter part and sets the
    'migratetype' variable to the preferred type, irrespective of whether we
    actually changed the pageblock migratetype of that block or not. So, the
    page_alloc_extfrag tracepoint can end up printing incorrect info (i.e.,
    'change_ownership' might be shown as 1 when it must have been 0).

    So fixing this involves moving the update of the 'migratetype' variable to
    the right place. But looking closer, we observe that the 'migratetype'
    variable is used subsequently for checks such as "is_migrate_cma()".
    Obviously the intent there is to check if the *fallback* type is
    MIGRATE_CMA, but since we already set the 'migratetype' variable to
    start_migratetype, we end up checking if the *preferred* type is
    MIGRATE_CMA!!

    To make things more interesting, this actually doesn't cause a bug in
    practice, because we never change *anything* if the fallback type is CMA.

    So, restructure the code in such a way that it is trivial to understand
    what is going on, and also fix the above mentioned bug. And while at it,
    also add a comment explaining the subtlety behind the migratetype used in
    the call to expand().

    [akpm@linux-foundation.org: remove unneeded `inline', small coding-style fix]
    Signed-off-by: Srivatsa S. Bhat
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Cody P Schafer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Srivatsa S. Bhat
     
  • Fix all errors reported by checkpatch and some small spelling mistakes.

    Signed-off-by: Pintu Kumar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pintu Kumar