07 Sep, 2017

2 commits

  • For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
    victims and then, among other things, to give these threads full access
    to memory reserves. There are few shortcomings of this implementation,
    though.

    First of all and the most serious one is that the full access to memory
    reserves is quite dangerous because we leave no safety room for the
    system to operate and potentially do last emergency steps to move on.

    Secondly this flag is per task_struct while the OOM killer operates on
    mm_struct granularity so all processes sharing the given mm are killed.
    Giving the full access to all these task_structs could lead to a quick
    memory reserves depletion. We have tried to reduce this risk by giving
    TIF_MEMDIE only to the main thread and the currently allocating task but
    that doesn't really solve this problem while it surely opens up a room
    for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside the
    allocator without access to memory reserves because a particular thread
    was not the group leader.

    Now that we have the oom reaper and that all oom victims are reapable
    after 1b51e65eab64 ("oom, oom_reaper: allow to reap mm shared by the
    kthreads") we can be more conservative and grant only partial access to
    memory reserves because there are reasonable chances of the parallel
    memory freeing. We still want some access to reserves because we do not
    want other consumers to eat up the victim's freed memory. oom victims
    will still contend with __GFP_HIGH users but those shouldn't be so
    aggressive to starve oom victims completely.

    Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
    the half of the reserves. This makes the access to reserves independent
    on which task has passed through mark_oom_victim. Also drop any usage
    of TIF_MEMDIE from the page allocator proper and replace it by
    tsk_is_oom_victim as well which will make page_alloc.c completely
    TIF_MEMDIE free finally.

    CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
    ALLOC_NO_WATERMARKS approach.

    There is a demand to make the oom killer memcg aware which will imply
    many tasks killed at once. This change will allow such a usecase
    without worrying about complete memory reserves depletion.

    Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • build_all_zonelists gets a zone parameter to initialize zone's pagesets.
    There is only a single user which gives a non-NULL zone parameter and
    that one doesn't really need the rest of the build_all_zonelists (see
    commit 6dcd73d7011b ("memory-hotplug: allocate zone's pcp before
    onlining pages")).

    Therefore remove setup_zone_pageset from build_all_zonelists and call it
    from its only user directly. This will also remove a pointless zonlists
    rebuilding which is always good.

    Link: http://lkml.kernel.org/r/20170721143915.14161-5-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Shaohua Li
    Cc: Toshi Kani
    Cc: Wen Congyang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

03 Aug, 2017

1 commit

  • Nadav Amit identified a theoritical race between page reclaim and
    mprotect due to TLB flushes being batched outside of the PTL being held.

    He described the race as follows:

    CPU0 CPU1
    ---- ----
    user accesses memory using RW PTE
    [PTE now cached in TLB]
    try_to_unmap_one()
    ==> ptep_get_and_clear()
    ==> set_tlb_ubc_flush_pending()
    mprotect(addr, PROT_READ)
    ==> change_pte_range()
    ==> [ PTE non-present - no flush ]

    user writes using cached RW PTE
    ...

    try_to_unmap_flush()

    The same type of race exists for reads when protecting for PROT_NONE and
    also exists for operations that can leave an old TLB entry behind such
    as munmap, mremap and madvise.

    For some operations like mprotect, it's not necessarily a data integrity
    issue but it is a correctness issue as there is a window where an
    mprotect that limits access still allows access. For munmap, it's
    potentially a data integrity issue although the race is massive as an
    munmap, mmap and return to userspace must all complete between the
    window when reclaim drops the PTL and flushes the TLB. However, it's
    theoritically possible so handle this issue by flushing the mm if
    reclaim is potentially currently batching TLB flushes.

    Other instances where a flush is required for a present pte should be ok
    as either the page lock is held preventing parallel reclaim or a page
    reference count is elevated preventing a parallel free leading to
    corruption. In the case of page_mkclean there isn't an obvious path
    that userspace could take advantage of without using the operations that
    are guarded by this patch. Other users such as gup as a race with
    reclaim looks just at PTEs. huge page variants should be ok as they
    don't race with reclaim. mincore only looks at PTEs. userfault also
    should be ok as if a parallel reclaim takes place, it will either fault
    the page back in or read some of the data before the flush occurs
    triggering a fault.

    Note that a variant of this patch was acked by Andy Lutomirski but this
    was for the x86 parts on top of his PCID work which didn't make the 4.13
    merge window as expected. His ack is dropped from this version and
    there will be a follow-on patch on top of PCID that will include his
    ack.

    [akpm@linux-foundation.org: tweak comments]
    [akpm@linux-foundation.org: fix spello]
    Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.de
    Reported-by: Nadav Amit
    Signed-off-by: Mel Gorman
    Cc: Andy Lutomirski
    Cc: [v4.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

13 Jul, 2017

1 commit

  • __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
    the page allocator. This has been true but only for allocations
    requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always
    ignored for smaller sizes. This is a bit unfortunate because there is
    no way to express the same semantic for those requests and they are
    considered too important to fail so they might end up looping in the
    page allocator for ever, similarly to GFP_NOFAIL requests.

    Now that the whole tree has been cleaned up and accidental or misled
    usage of __GFP_REPEAT flag has been removed for !costly requests we can
    give the original flag a better name and more importantly a more useful
    semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
    that the allocator would try really hard but there is no promise of a
    success. This will work independent of the order and overrides the
    default allocator behavior. Page allocator users have several levels of
    guarantee vs. cost options (take GFP_KERNEL as an example)

    - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
    attempt to free memory at all. The most light weight mode which even
    doesn't kick the background reclaim. Should be used carefully because
    it might deplete the memory and the next user might hit the more
    aggressive reclaim

    - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
    allocation without any attempt to free memory from the current
    context but can wake kswapd to reclaim memory if the zone is below
    the low watermark. Can be used from either atomic contexts or when
    the request is a performance optimization and there is another
    fallback for a slow path.

    - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
    non sleeping allocation with an expensive fallback so it can access
    some portion of memory reserves. Usually used from interrupt/bh
    context with an expensive slow path fallback.

    - GFP_KERNEL - both background and direct reclaim are allowed and the
    _default_ page allocator behavior is used. That means that !costly
    allocation requests are basically nofail but there is no guarantee of
    that behavior so failures have to be checked properly by callers
    (e.g. OOM killer victim is allowed to fail currently).

    - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
    and all allocation requests fail early rather than cause disruptive
    reclaim (one round of reclaim in this implementation). The OOM killer
    is not invoked.

    - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
    behavior and all allocation requests try really hard. The request
    will fail if the reclaim cannot make any progress. The OOM killer
    won't be triggered.

    - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
    and all allocation requests will loop endlessly until they succeed.
    This might be really dangerous especially for larger orders.

    Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
    because they already had their semantic. No new users are added.
    __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
    there is no progress and we have already passed the OOM point.

    This means that all the reclaim opportunities have been exhausted except
    the most disruptive one (the OOM killer) and a user defined fallback
    behavior is more sensible than keep retrying in the page allocator.

    [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
    [mhocko@suse.com: semantic fix]
    Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
    [mhocko@kernel.org: address other thing spotted by Vlastimil]
    Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Alex Belits
    Cc: Chris Wilson
    Cc: Christoph Hellwig
    Cc: Darrick J. Wong
    Cc: David Daney
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: NeilBrown
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

09 May, 2017

3 commits

  • The main goal of direct compaction is to form a high-order page for
    allocation, but it should also help against long-term fragmentation when
    possible.

    Most lower-than-pageblock-order compactions are for non-movable
    allocations, which means that if we compact in a movable pageblock and
    terminate as soon as we create the high-order page, it's unlikely that
    the fallback heuristics will claim the whole block. Instead there might
    be a single unmovable page in a pageblock full of movable pages, and the
    next unmovable allocation might pick another pageblock and increase
    long-term fragmentation.

    To help against such scenarios, this patch changes the termination
    criteria for compaction so that the current pageblock is finished even
    though the high-order page already exists. Note that it might be
    possible that the high-order page formed elsewhere in the zone due to
    parallel activity, but this patch doesn't try to detect that.

    This is only done with sync compaction, because async compaction is
    limited to pageblock of the same migratetype, where it cannot result in
    a migratetype fallback. (Async compaction also eagerly skips
    order-aligned blocks where isolation fails, which is against the goal of
    migrating away as much of the pageblock as possible.)

    As a result of this patch, long-term memory fragmentation should be
    reduced.

    In testing based on 4.9 kernel with stress-highalloc from mmtests
    configured for order-4 GFP_KERNEL allocations, this patch has reduced
    the number of unmovable allocations falling back to movable pageblocks
    by 20%. The number

    Link: http://lkml.kernel.org/r/20170307131545.28577-9-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Preparation patch. We are going to need migratetype at lower layers
    than compact_zone() and compact_finished().

    Link: http://lkml.kernel.org/r/20170307131545.28577-7-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Patch series "try to reduce fragmenting fallbacks", v3.

    Last year, Johannes Weiner has reported a regression in page mobility
    grouping [1] and while the exact cause was not found, I've come up with
    some ways to improve it by reducing the number of allocations falling
    back to different migratetype and causing permanent fragmentation.

    The series was tested with mmtests stress-highalloc modified to do
    GFP_KERNEL order-4 allocations, on 4.9 with "mm, vmscan: fix zone
    balance check in prepare_kswapd_sleep" (without that, kcompactd indeed
    wasn't woken up) on UMA machine with 4GB memory. There were 5 repeats
    of each run, as the extfrag stats are quite volatile (note the stats
    below are sums, not averages, as it was less perl hacking for me).

    Success rate are the same, already high due to the low allocation order
    used, so I'm not including them.

    Compaction stats:
    (the patches are stacked, and I haven't measured the non-functional-changes
    patches separately)

    patch 1 patch 2 patch 3 patch 4 patch 7 patch 8
    Compaction stalls 22449 24680 24846 19765 22059 17480
    Compaction success 12971 14836 14608 10475 11632 8757
    Compaction failures 9477 9843 10238 9290 10426 8722
    Page migrate success 3109022 3370438 3312164 1695105 1608435 2111379
    Page migrate failure 911588 1149065 1028264 1112675 1077251 1026367
    Compaction pages isolated 7242983 8015530 7782467 4629063 4402787 5377665
    Compaction migrate scanned 980838938 987367943 957690188 917647238 947155598 1018922197
    Compaction free scanned 557926893 598946443 602236894 594024490 541169699 763651731
    Compaction cost 10243 10578 10304 8286 8398 9440

    Compaction stats are mostly within noise until patch 4, which decreases
    the number of compactions, and migrations. Part of that could be due to
    more pageblocks marked as unmovable, and async compaction skipping
    those. This changes a bit with patch 7, but not so much. Patch 8
    increases free scanner stats and migrations, which comes from the
    changed termination criteria. Interestingly number of compactions
    decreases - probably the fully compacted pageblock satisfies multiple
    subsequent allocations, so it amortizes.

    Next comes the extfrag tracepoint, where "fragmenting" means that an
    allocation had to fallback to a pageblock of another migratetype which
    wasn't fully free (which is almost all of the fallbacks). I have
    locally added another tracepoint for "Page steal" into
    steal_suitable_fallback() which triggers in situations where we are
    allowed to do move_freepages_block(). If we decide to also do
    set_pageblock_migratetype(), it's "Pages steal with pageblock" with
    break down for which allocation migratetype we are stealing and from
    which fallback migratetype. The last part "due to counting" comes from
    patch 4 and counts the events where the counting of movable pages
    allowed us to change pageblock's migratetype, while the number of free
    pages alone wouldn't be enough to cross the threshold.

    patch 1 patch 2 patch 3 patch 4 patch 7 patch 8
    Page alloc extfrag event 10155066 8522968 10164959 15622080 13727068 13140319
    Extfrag fragmenting 10149231 8517025 10159040 15616925 13721391 13134792
    Extfrag fragmenting for unmovable 159504 168500 184177 97835 70625 56948
    Extfrag fragmenting unmovable placed with movable 153613 163549 172693 91740 64099 50917
    Extfrag fragmenting unmovable placed with reclaim. 5891 4951 11484 6095 6526 6031
    Extfrag fragmenting for reclaimable 4738 4829 6345 4822 5640 5378
    Extfrag fragmenting reclaimable placed with movable 1836 1902 1851 1579 1739 1760
    Extfrag fragmenting reclaimable placed with unmov. 2902 2927 4494 3243 3901 3618
    Extfrag fragmenting for movable 9984989 8343696 9968518 15514268 13645126 13072466
    Pages steal 179954 192291 210880 123254 94545 81486
    Pages steal with pageblock 22153 18943 20154 33562 29969 33444
    Pages steal with pageblock for unmovable 14350 12858 13256 20660 19003 20852
    Pages steal with pageblock for unmovable from mov. 12812 11402 11683 19072 17467 19298
    Pages steal with pageblock for unmovable from recl. 1538 1456 1573 1588 1536 1554
    Pages steal with pageblock for movable 7114 5489 5965 11787 10012 11493
    Pages steal with pageblock for movable from unmov. 6885 5291 5541 11179 9525 10885
    Pages steal with pageblock for movable from recl. 229 198 424 608 487 608
    Pages steal with pageblock for reclaimable 689 596 933 1115 954 1099
    Pages steal with pageblock for reclaimable from unmov. 273 219 537 658 547 667
    Pages steal with pageblock for reclaimable from mov. 416 377 396 457 407 432
    Pages steal with pageblock due to counting 11834 10075 7530
    ... for unmovable 8993 7381 4616
    ... for movable 2792 2653 2851
    ... for reclaimable 49 41 63

    What we can see is that "Extfrag fragmenting for unmovable" and "...
    placed with movable" drops with almost each patch, which is good as we
    are polluting less movable pageblocks with unmovable pages.

    The most significant change is patch 4 with movable page counting. On
    the other hand it increases "Extfrag fragmenting for movable" by 50%.
    "Pages steal" drops though, so these movable allocation fallbacks find
    only small free pages and are not allowed to steal whole pageblocks
    back. "Pages steal with pageblock" raises, because the patch increases
    the chances of pageblock migratetype changes to happen. This affects
    all migratetypes.

    The summary is that patch 4 is not a clear win wrt these stats, but I
    believe that the tradeoff it makes is a good one. There's less
    pollution of movable pageblocks by unmovable allocations. There's less
    stealing between pageblock, and those that remain have higher chance of
    changing migratetype also the pageblock itself, so it should more
    faithfully reflect the migratetype of the pages within the pageblock.
    The increase of movable allocations falling back to unmovable pageblock
    might look dramatic, but those allocations can be migrated by compaction
    when needed, and other patches in the series (7-9) improve that aspect.

    Patches 7 and 8 continue the trend of reduced unmovable fallbacks and
    also reduce the impact on movable fallbacks from patch 4.

    [1] https://www.spinics.net/lists/linux-mm/msg114237.html

    This patch (of 8):

    While currently there are (mostly by accident) no holes in struct
    compact_control (on x86_64), but we are going to add more bool flags, so
    place them all together to the end of the structure. While at it, just
    order all fields from largest to smallest.

    Link: http://lkml.kernel.org/r/20170307131545.28577-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

04 May, 2017

3 commits

  • Introduce two helpers, is_migrate_highatomic() and is_migrate_highatomic_page().

    Simplify the code, no functional changes.

    [akpm@linux-foundation.org: use static inlines rather than macros, per mhocko]
    Link: http://lkml.kernel.org/r/58B94F15.6060606@huawei.com
    Signed-off-by: Xishi Qiu
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • NR_PAGES_SCANNED counts number of pages scanned since the last page free
    event in the allocator. This was used primarily to measure the
    reclaimability of zones and nodes, and determine when reclaim should
    give up on them. In that role, it has been replaced in the preceding
    patches by a different mechanism.

    Being implemented as an efficient vmstat counter, it was automatically
    exported to userspace as well. It's however unlikely that anyone
    outside the kernel is using this counter in any meaningful way.

    Remove the counter and the unused pgdat_reclaimable().

    Link: http://lkml.kernel.org/r/20170228214007.5621-8-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Hillf Danton
    Acked-by: Michal Hocko
    Cc: Jia He
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
    cleanups".

    Jia reported a scenario in which the kswapd of a node indefinitely spins
    at 100% CPU usage. We have seen similar cases at Facebook.

    The kernel's current method of judging its ability to reclaim a node (or
    whether to back off and sleep) is based on the amount of scanned pages
    in proportion to the amount of reclaimable pages. In Jia's and our
    scenarios, there are no reclaimable pages in the node, however, and the
    condition for backing off is never met. Kswapd busyloops in an attempt
    to restore the watermarks while having nothing to work with.

    This series reworks the definition of an unreclaimable node based not on
    scanning but on whether kswapd is able to actually reclaim pages in
    MAX_RECLAIM_RETRIES (16) consecutive runs. This is the same criteria
    the page allocator uses for giving up on direct reclaim and invoking the
    OOM killer. If it cannot free any pages, kswapd will go to sleep and
    leave further attempts to direct reclaim invocations, which will either
    make progress and re-enable kswapd, or invoke the OOM killer.

    Patch #1 fixes the immediate problem Jia reported, the remainder are
    smaller fixlets, cleanups, and overall phasing out of the old method.

    Patch #6 is the odd one out. It's a nice cleanup to get_scan_count(),
    and directly related to #5, but in itself not relevant to the series.

    If the whole series is too ambitious for 4.11, I would consider the
    first three patches fixes, the rest cleanups.

    This patch (of 9):

    Jia He reports a problem with kswapd spinning at 100% CPU when
    requesting more hugepages than memory available in the system:

    $ echo 4000 >/proc/sys/vm/nr_hugepages

    top - 13:42:59 up 3:37, 1 user, load average: 1.09, 1.03, 1.01
    Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie
    %Cpu(s): 0.0 us, 12.5 sy, 0.0 ni, 85.5 id, 2.0 wa, 0.0 hi, 0.0 si, 0.0 st
    KiB Mem: 31371520 total, 30915136 used, 456384 free, 320 buffers
    KiB Swap: 6284224 total, 115712 used, 6168512 free. 48192 cached Mem

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    76 root 20 0 0 0 0 R 100.0 0.000 217:17.29 kswapd3

    At that time, there are no reclaimable pages left in the node, but as
    kswapd fails to restore the high watermarks it refuses to go to sleep.

    Kswapd needs to back away from nodes that fail to balance. Up until
    commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of
    nodes") kswapd had such a mechanism. It considered zones whose
    theoretically reclaimable pages it had reclaimed six times over as
    unreclaimable and backed away from them. This guard was erroneously
    removed as the patch changed the definition of a balanced node.

    However, simply restoring this code wouldn't help in the case reported
    here: there *are* no reclaimable pages that could be scanned until the
    threshold is met. Kswapd would stay awake anyway.

    Introduce a new and much simpler way of backing off. If kswapd runs
    through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
    page, make it back off from the node. This is the same number of shots
    direct reclaim takes before declaring OOM. Kswapd will go to sleep on
    that node until a direct reclaimer manages to reclaim some pages, thus
    proving the node reclaimable again.

    [hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
    Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
    [shakeelb@google.com: fix condition for throttle_direct_reclaim]
    Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
    Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Signed-off-by: Shakeel Butt
    Reported-by: Jia He
    Tested-by: Jia He
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

08 Apr, 2017

1 commit

  • We currently have 2 specific WQ_RECLAIM workqueues in the mm code.
    vmstat_wq for updating pcp stats and lru_add_drain_wq dedicated to drain
    per cpu lru caches. This seems more than necessary because both can run
    on a single WQ. Both do not block on locks requiring a memory
    allocation nor perform any allocations themselves. We will save one
    rescuer thread this way.

    On the other hand drain_all_pages() queues work on the system wq which
    doesn't have rescuer and so this depend on memory allocation (when all
    workers are stuck allocating and new ones cannot be created).

    Initially we thought this would be more of a theoretical problem but
    Hugh Dickins has reported:

    : 4.11-rc has been giving me hangs after hours of swapping load. At
    : first they looked like memory leaks ("fork: Cannot allocate memory");
    : but for no good reason I happened to do "cat /proc/sys/vm/stat_refresh"
    : before looking at /proc/meminfo one time, and the stat_refresh stuck
    : in D state, waiting for completion of flush_work like many kworkers.
    : kthreadd waiting for completion of flush_work in drain_all_pages().

    This worker should be using WQ_RECLAIM as well in order to guarantee a
    forward progress. We can reuse the same one as for lru draining and
    vmstat.

    Link: http://lkml.kernel.org/r/20170307131751.24936-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Suggested-by: Tetsuo Handa
    Acked-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Tested-by: Yang Li
    Tested-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

25 Feb, 2017

1 commit

  • Current rmap code can miss a VMA that maps PTE-mapped THP if the first
    suppage of the THP was unmapped from the VMA.

    We need to walk rmap for the whole range of offsets that THP covers, not
    only the first one.

    vma_address() also need to be corrected to check the range instead of
    the first subpage.

    Link: http://lkml.kernel.org/r/20170129173858.45174-6-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Hillf Danton
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Srikar Dronamraju
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

23 Feb, 2017

3 commits

  • Logic on whether we can reap pages from the VMA should match what we
    have in madvise_dontneed(). In particular, we should skip, VM_PFNMAP
    VMAs, but we don't now.

    Let's just extract condition on which we can shoot down pagesi from a
    VMA with MADV_DONTNEED into separate function and use it in both places.

    Link: http://lkml.kernel.org/r/20170118122429.43661-4-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • A "compact_daemon_wake" vmstat exists that represents the number of
    times kcompactd has woken up. This doesn't represent how much work it
    actually did, though.

    It's useful to understand how much compaction work is being done by
    kcompactd versus other methods such as direct compaction and explicitly
    triggered per-node (or system) compaction.

    This adds two new vmstats: "compact_daemon_migrate_scanned" and
    "compact_daemon_free_scanned" to represent the number of pages kcompactd
    has scanned as part of its migration scanner and freeing scanner,
    respectively.

    These values are still accounted for in the general
    "compact_migrate_scanned" and "compact_free_scanned" for compatibility.

    It could be argued that explicitly triggered compaction could also be
    tracked separately, and that could be added if others find it useful.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1612071749390.69852@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • In __free_one_page() we do the buddy merging arithmetics on "page/buddy
    index", which is just the lower MAX_ORDER bits of pfn. The operations
    we do that affect the higher bits are bitwise AND and subtraction (in
    that order), where the final result will be the same with the higher
    bits left unmasked, as long as these bits are equal for both buddies -
    which must be true by the definition of a buddy.

    We can therefore use pfn's directly instead of "index" and skip the
    zeroing of >MAX_ORDER bits. This can help a bit by itself, although
    compiler might be smart enough already. It also helps the next patch to
    avoid page_to_pfn() for memory hole checks.

    Link: http://lkml.kernel.org/r/20161216120009.20064-1-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: "Kirill A. Shutemov"
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

26 Dec, 2016

1 commit

  • Add a new page flag, PageWaiters, to indicate the page waitqueue has
    tasks waiting. This can be tested rather than testing waitqueue_active
    which requires another cacheline load.

    This bit is always set when the page has tasks on page_waitqueue(page),
    and is set and cleared under the waitqueue lock. It may be set when
    there are no tasks on the waitqueue, which will cause a harmless extra
    wakeup check that will clears the bit.

    The generic bit-waitqueue infrastructure is no longer used for pages.
    Instead, waitqueues are used directly with a custom key type. The
    generic code was not flexible enough to have PageWaiters manipulation
    under the waitqueue lock (which simplifies concurrency).

    This improves the performance of page lock intensive microbenchmarks by
    2-3%.

    Putting two bits in the same word opens the opportunity to remove the
    memory barrier between clearing the lock bit and testing the waiters
    bit, after some work on the arch primitives (e.g., ensuring memory
    operand widths match and cover both bits).

    Signed-off-by: Nicholas Piggin
    Cc: Dave Hansen
    Cc: Bob Peterson
    Cc: Steven Whitehouse
    Cc: Andrew Lutomirski
    Cc: Andreas Gruenbacher
    Cc: Peter Zijlstra
    Cc: Mel Gorman
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     

15 Dec, 2016

2 commits

  • Add orig_pte field to vm_fault structure to allow ->page_mkwrite
    handlers to fully handle the fault.

    This also allows us to save some passing of extra arguments around.

    Link: http://lkml.kernel.org/r/1479460644-25076-8-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Currently we have two different structures for passing fault information
    around - struct vm_fault and struct fault_env. DAX will need more
    information in struct vm_fault to handle its faults so the content of
    that structure would become event closer to fault_env. Furthermore it
    would need to generate struct fault_env to be able to call some of the
    generic functions. So at this point I don't think there's much use in
    keeping these two structures separate. Just embed into struct vm_fault
    all that is needed to use it for both purposes.

    Link: http://lkml.kernel.org/r/1479460644-25076-2-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Kirill A. Shutemov
    Cc: Ross Zwisler
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

08 Oct, 2016

2 commits

  • Several people have reported premature OOMs for order-2 allocations
    (stack) due to OOM rework in 4.7. In the scenario (parallel kernel
    build and dd writing to two drives) many pageblocks get marked as
    Unmovable and compaction free scanner struggles to isolate free pages.
    Joonsoo Kim pointed out that the free scanner skips pageblocks that are
    not movable to prevent filling them and forcing non-movable allocations
    to fallback to other pageblocks. Such heuristic makes sense to help
    prevent long-term fragmentation, but premature OOMs are relatively more
    urgent problem. As a compromise, this patch disables the heuristic only
    for the ultimate compaction priority.

    Link: http://lkml.kernel.org/r/20160906135258.18335-5-vbabka@suse.cz
    Reported-by: Ralf-Peter Rohbeck
    Reported-by: Arkadiusz Miskiewicz
    Reported-by: Olaf Hering
    Suggested-by: Joonsoo Kim
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Patch series "make direct compaction more deterministic")

    This is mostly a followup to Michal's oom detection rework, which
    highlighted the need for direct compaction to provide better feedback in
    reclaim/compaction loop, so that it can reliably recognize when
    compaction cannot make further progress, and allocation should invoke
    OOM killer or fail. We've discussed this at LSF/MM [1] where I proposed
    expanding the async/sync migration mode used in compaction to more
    general "priorities". This patchset adds one new priority that just
    overrides all the heuristics and makes compaction fully scan all zones.
    I don't currently think that we need more fine-grained priorities, but
    we'll see. Other than that there's some smaller fixes and cleanups,
    mainly related to the THP-specific hacks.

    I've tested this with stress-highalloc in GFP_KERNEL order-4 and
    THP-like order-9 scenarios. There's some improvement for compaction
    stats for the order-4, which is likely due to the better watermarks
    handling. In the previous version I reported mostly noise wrt
    compaction stats, and decreased direct reclaim - now the reclaim is
    without difference. I believe this is due to the less aggressive
    compaction priority increase in patch 6.

    "before" is a mmotm tree prior to 4.7 release plus the first part of the
    series that was sent and merged separately

    before after
    order-4:

    Compaction stalls 27216 30759
    Compaction success 19598 25475
    Compaction failures 7617 5283
    Page migrate success 370510 464919
    Page migrate failure 25712 27987
    Compaction pages isolated 849601 1041581
    Compaction migrate scanned 143146541 101084990
    Compaction free scanned 208355124 144863510
    Compaction cost 1403 1210

    order-9:

    Compaction stalls 7311 7401
    Compaction success 1634 1683
    Compaction failures 5677 5718
    Page migrate success 194657 183988
    Page migrate failure 4753 4170
    Compaction pages isolated 498790 456130
    Compaction migrate scanned 565371 524174
    Compaction free scanned 4230296 4250744
    Compaction cost 215 203

    [1] https://lwn.net/Articles/684611/

    This patch (of 11):

    A recent patch has added whole_zone flag that compaction sets when
    scanning starts from the zone boundary, in order to report that zone has
    been fully scanned in one attempt. For allocations that want to try
    really hard or cannot fail, we will want to introduce a mode where
    scanning whole zone is guaranteed regardless of the cached positions.

    This patch reuses the whole_zone flag in a way that if it's already
    passed true to compaction, the cached scanner positions are ignored.
    Employing this flag during reclaim/compaction loop will be done in the
    next patch. This patch however converts compaction invoked from
    userspace via procfs to use this flag. Before this patch, the cached
    positions were first reset to zone boundaries and then read back from
    struct zone, so there was a window where a parallel compaction could
    replace the reset values, making the manual compaction less effective.
    Using the flag instead of performing reset is more robust.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20160810091226.6709-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Tested-by: Lorenzo Stoakes
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

29 Jul, 2016

4 commits

  • Async compaction detects contention either due to failing trylock on
    zone->lock or lru_lock, or by need_resched(). Since 1f9efdef4f3f ("mm,
    compaction: khugepaged should not give up due to need_resched()") the
    code got quite complicated to distinguish these two up to the
    __alloc_pages_slowpath() level, so different decisions could be taken
    for khugepaged allocations.

    After the recent changes, khugepaged allocations don't check for
    contended compaction anymore, so we again don't need to distinguish lock
    and sched contention, and simplify the current convoluted code a lot.

    However, I believe it's also possible to simplify even more and
    completely remove the check for contended compaction after the initial
    async compaction for costly orders, which was originally aimed at THP
    page fault allocations. There are several reasons why this can be done
    now:

    - with the new defaults, THP page faults no longer do reclaim/compaction at
    all, unless the system admin has overridden the default, or application has
    indicated via madvise that it can benefit from THP's. In both cases, it
    means that the potential extra latency is expected and worth the benefits.
    - even if reclaim/compaction proceeds after this patch where it previously
    wouldn't, the second compaction attempt is still async and will detect the
    contention and back off, if the contention persists
    - there are still heuristics like deferred compaction and pageblock skip bits
    in place that prevent excessive THP page fault latencies

    Link: http://lkml.kernel.org/r/20160721073614.24395-9-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The fair zone allocation policy interleaves allocation requests between
    zones to avoid an age inversion problem whereby new pages are reclaimed
    to balance a zone. Reclaim is now node-based so this should no longer
    be an issue and the fair zone allocation policy is not free. This patch
    removes it.

    Link: http://lkml.kernel.org/r/1467970510-21195-30-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • As reclaim is now per-node based, convert zone_reclaim to be
    node_reclaim. It is possible that a node will be reclaimed multiple
    times if it has multiple zones but this is unavoidable without caching
    all nodes traversed so far. The documentation and interface to
    userspace is the same from a configuration perspective and will will be
    similar in behaviour unless the node-local allocation requests were also
    limited to lower zones.

    Link: http://lkml.kernel.org/r/1467970510-21195-24-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

27 Jul, 2016

3 commits

  • The idea borrowed from Peter's patch from patchset on speculative page
    faults[1]:

    Instead of passing around the endless list of function arguments,
    replace the lot with a single structure so we can change context without
    endless function signature changes.

    The changes are mostly mechanical with exception of faultaround code:
    filemap_map_pages() got reworked a bit.

    This patch is preparation for the next one.

    [1] http://lkml.kernel.org/r/20141020222841.302891540@infradead.org

    Link: http://lkml.kernel.org/r/1466021202-61880-9-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • This patch makes swapin readahead to improve thp collapse rate. When
    khugepaged scanned pages, there can be a few of the pages in swap area.

    With the patch THP can collapse 4kB pages into a THP when there are up
    to max_ptes_swap swap ptes in a 2MB range.

    The patch was tested with a test program that allocates 400B of memory,
    writes to it, and then sleeps. I force the system to swap out all.
    Afterwards, the test program touches the area by writing, it skips a
    page in each 20 pages of the area.

    Without the patch, system did not swap in readahead. THP rate was %65
    of the program of the memory, it did not change over time.

    With this patch, after 10 minutes of waiting khugepaged had collapsed
    %99 of the program's memory.

    [kirill.shutemov@linux.intel.com: trivial cleanup of exit path of the function]
    [kirill.shutemov@linux.intel.com: __collapse_huge_page_swapin(): drop unused 'pte' parameter]
    [kirill.shutemov@linux.intel.com: do not hold anon_vma lock during swap in]
    Signed-off-by: Ebru Akagunduz
    Acked-by: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: Joonsoo Kim
    Cc: Xie XiuQi
    Cc: Cyrill Gorcunov
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Cc: Aneesh Kumar K.V
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ebru Akagunduz
     
  • This patch is motivated from Hugh and Vlastimil's concern [1].

    There are two ways to get freepage from the allocator. One is using
    normal memory allocation API and the other is __isolate_free_page()
    which is internally used for compaction and pageblock isolation. Later
    usage is rather tricky since it doesn't do whole post allocation
    processing done by normal API.

    One problematic thing I already know is that poisoned page would not be
    checked if it is allocated by __isolate_free_page(). Perhaps, there
    would be more.

    We could add more debug logic for allocated page in the future and this
    separation would cause more problem. I'd like to fix this situation at
    this time. Solution is simple. This patch commonize some logic for
    newly allocated page and uses it on all sites. This will solve the
    problem.

    [1] http://marc.info/?i=alpine.LSU.2.11.1604270029350.7066%40eggly.anvils%3E

    [iamjoonsoo.kim@lge.com: mm-page_alloc-introduce-post-allocation-processing-on-page-allocator-v3]
    Link: http://lkml.kernel.org/r/1464230275-25791-7-git-send-email-iamjoonsoo.kim@lge.com
    Link: http://lkml.kernel.org/r/1466150259-27727-9-git-send-email-iamjoonsoo.kim@lge.com
    Link: http://lkml.kernel.org/r/1464230275-25791-7-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Alexander Potapenko
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

25 Jun, 2016

1 commit

  • Commit d0164adc89f6 ("mm, page_alloc: distinguish between being unable
    to sleep, unwilling to sleep and avoiding waking kswapd") modified
    __GFP_WAIT to explicitly identify the difference between atomic callers
    and those that were unwilling to sleep. Later the definition was
    removed entirely.

    The GFP_RECLAIM_MASK is the set of flags that affect watermark checking
    and reclaim behaviour but __GFP_ATOMIC was never added. Without it,
    atomic users of the slab allocator strip the __GFP_ATOMIC flag and
    cannot access the page allocator atomic reserves. This patch addresses
    the problem.

    The user-visible impact depends on the workload but potentially atomic
    allocations unnecessarily fail without this path.

    Link: http://lkml.kernel.org/r/20160610093832.GK2527@techsingularity.net
    Signed-off-by: Mel Gorman
    Reported-by: Marcin Wojtas
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: [4.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

24 May, 2016

2 commits

  • All the callers of vm_mmap seem to check for the failure already and
    bail out in one way or another on the error which means that we can
    change it to use killable version of vm_mmap_pgoff and return -EINTR if
    the current task gets killed while waiting for mmap_sem. This also
    means that vm_mmap_pgoff can be killable by default and drop the
    additional parameter.

    This will help in the OOM conditions when the oom victim might be stuck
    waiting for the mmap_sem for write which in turn can block oom_reaper
    which relies on the mmap_sem for read to make a forward progress and
    reclaim the address space of the victim.

    Please note that load_elf_binary is ignoring vm_mmap error for
    current->personality & MMAP_PAGE_ZERO case but that shouldn't be a
    problem because the address is not used anywhere and we never return to
    the userspace if we got killed.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Andrea Arcangeli
    Cc: Al Viro
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • This is a follow up work for oom_reaper [1]. As the async OOM killing
    depends on oom_sem for read we would really appreciate if a holder for
    write didn't stood in the way. This patchset is changing many of
    down_write calls to be killable to help those cases when the writer is
    blocked and waiting for readers to release the lock and so help
    __oom_reap_task to process the oom victim.

    Most of the patches are really trivial because the lock is help from a
    shallow syscall paths where we can return EINTR trivially and allow the
    current task to die (note that EINTR will never get to the userspace as
    the task has fatal signal pending). Others seem to be easy as well as
    the callers are already handling fatal errors and bail and return to
    userspace which should be sufficient to handle the failure gracefully.
    I am not familiar with all those code paths so a deeper review is really
    appreciated.

    As this work is touching more areas which are not directly connected I
    have tried to keep the CC list as small as possible and people who I
    believed would be familiar are CCed only to the specific patches (all
    should have received the cover though).

    This patchset is based on linux-next and it depends on
    down_write_killable for rw_semaphores which got merged into tip
    locking/rwsem branch and it is merged into this next tree. I guess it
    would be easiest to route these patches via mmotm because of the
    dependency on the tip tree but if respective maintainers prefer other
    way I have no objections.

    I haven't covered all the mmap_write(mm->mmap_sem) instances here

    $ git grep "down_write(.*\)" next/master | wc -l
    98
    $ git grep "down_write(.*\)" | wc -l
    62

    I have tried to cover those which should be relatively easy to review in
    this series because this alone should be a nice improvement. Other
    places can be changed on top.

    [0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org
    [1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org
    [2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org

    This patch (of 18):

    This is the first step in making mmap_sem write waiters killable. It
    focuses on the trivial ones which are taking the lock early after
    entering the syscall and they are not changing state before.

    Therefore it is very easy to change them to use down_write_killable and
    immediately return with -EINTR. This will allow the waiter to pass away
    without blocking the mmap_sem which might be required to make a forward
    progress. E.g. the oom reaper will need the lock for reading to
    dismantle the OOM victim address space.

    The only tricky function in this patch is vm_mmap_pgoff which has many
    call sites via vm_mmap. To reduce the risk keep vm_mmap with the
    original non-killable semantic for now.

    vm_munmap callers do not bother checking the return value so open code
    it into the munmap syscall path for now for simplicity.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Konstantin Khlebnikov
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

21 May, 2016

1 commit

  • COMPACT_COMPLETE now means that compaction and free scanner met. This
    is not very useful information if somebody just wants to use this
    feedback and make any decisions based on that. The current caller might
    be a poor guy who just happened to scan tiny portion of the zone and
    that could be the reason no suitable pages were compacted. Make sure we
    distinguish the full and partial zone walks.

    Consumers should treat COMPACT_PARTIAL_SKIPPED as a potential success
    and be optimistic in retrying.

    The existing users of COMPACT_COMPLETE are conservatively changed to use
    COMPACT_PARTIAL_SKIPPED as well but some of them should be probably
    reconsidered and only defer the compaction only for COMPACT_COMPLETE
    with the new semantic.

    This patch shouldn't introduce any functional changes.

    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Hillf Danton
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Tetsuo Handa
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

20 May, 2016

4 commits

  • The classzone_idx can be inferred from preferred_zoneref so remove the
    unnecessary field and save stack space.

    Signed-off-by: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The allocator fast path looks up the first usable zone in a zonelist and
    then get_page_from_freelist does the same job in the zonelist iterator.
    This patch preserves the necessary information.

    4.6.0-rc2 4.6.0-rc2
    fastmark-v1r20 initonce-v1r20
    Min alloc-odr0-1 364.00 ( 0.00%) 359.00 ( 1.37%)
    Min alloc-odr0-2 262.00 ( 0.00%) 260.00 ( 0.76%)
    Min alloc-odr0-4 214.00 ( 0.00%) 214.00 ( 0.00%)
    Min alloc-odr0-8 186.00 ( 0.00%) 186.00 ( 0.00%)
    Min alloc-odr0-16 173.00 ( 0.00%) 173.00 ( 0.00%)
    Min alloc-odr0-32 165.00 ( 0.00%) 165.00 ( 0.00%)
    Min alloc-odr0-64 161.00 ( 0.00%) 162.00 ( -0.62%)
    Min alloc-odr0-128 159.00 ( 0.00%) 161.00 ( -1.26%)
    Min alloc-odr0-256 168.00 ( 0.00%) 170.00 ( -1.19%)
    Min alloc-odr0-512 180.00 ( 0.00%) 181.00 ( -0.56%)
    Min alloc-odr0-1024 190.00 ( 0.00%) 190.00 ( 0.00%)
    Min alloc-odr0-2048 196.00 ( 0.00%) 196.00 ( 0.00%)
    Min alloc-odr0-4096 202.00 ( 0.00%) 202.00 ( 0.00%)
    Min alloc-odr0-8192 206.00 ( 0.00%) 205.00 ( 0.49%)
    Min alloc-odr0-16384 206.00 ( 0.00%) 205.00 ( 0.49%)

    The benefit is negligible and the results are within the noise but each
    cycle counts.

    Signed-off-by: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • alloc_flags is a bitmask of flags but it is signed which does not
    necessarily generate the best code depending on the compiler. Even
    without an impact, it makes more sense that this be unsigned.

    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Many developers already know that field for reference count of the
    struct page is _count and atomic type. They would try to handle it
    directly and this could break the purpose of page reference count
    tracepoint. To prevent direct _count modification, this patch rename it
    to _refcount and add warning message on the code. After that, developer
    who need to handle reference count will find that field should not be
    accessed directly.

    [akpm@linux-foundation.org: fix comments, per Vlastimil]
    [akpm@linux-foundation.org: Documentation/vm/transhuge.txt too]
    [sfr@canb.auug.org.au: sync ethernet driver changes]
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Stephen Rothwell
    Cc: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Johannes Berg
    Cc: "David S. Miller"
    Cc: Sunil Goutham
    Cc: Chris Metcalf
    Cc: Manish Chopra
    Cc: Yuval Mintz
    Cc: Tariq Toukan
    Cc: Saeed Mahameed
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

26 Mar, 2016

1 commit

  • This patch (of 5):

    This is based on the idea from Mel Gorman discussed during LSFMM 2015
    and independently brought up by Oleg Nesterov.

    The OOM killer currently allows to kill only a single task in a good
    hope that the task will terminate in a reasonable time and frees up its
    memory. Such a task (oom victim) will get an access to memory reserves
    via mark_oom_victim to allow a forward progress should there be a need
    for additional memory during exit path.

    It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
    construct workloads which break the core assumption mentioned above and
    the OOM victim might take unbounded amount of time to exit because it
    might be blocked in the uninterruptible state waiting for an event (e.g.
    lock) which is blocked by another task looping in the page allocator.

    This patch reduces the probability of such a lockup by introducing a
    specialized kernel thread (oom_reaper) which tries to reclaim additional
    memory by preemptively reaping the anonymous or swapped out memory owned
    by the oom victim under an assumption that such a memory won't be needed
    when its owner is killed and kicked from the userspace anyway. There is
    one notable exception to this, though, if the OOM victim was in the
    process of coredumping the result would be incomplete. This is
    considered a reasonable constrain because the overall system health is
    more important than debugability of a particular application.

    A kernel thread has been chosen because we need a reliable way of
    invocation so workqueue context is not appropriate because all the
    workers might be busy (e.g. allocating memory). Kswapd which sounds
    like another good fit is not appropriate as well because it might get
    blocked on locks during reclaim as well.

    oom_reaper has to take mmap_sem on the target task for reading so the
    solution is not 100% because the semaphore might be held or blocked for
    write but the probability is reduced considerably wrt. basically any
    lock blocking forward progress as described above. In order to prevent
    from blocking on the lock without any forward progress we are using only
    a trylock and retry 10 times with a short sleep in between. Users of
    mmap_sem which need it for write should be carefully reviewed to use
    _killable waiting as much as possible and reduce allocations requests
    done with the lock held to absolute minimum to reduce the risk even
    further.

    The API between oom killer and oom reaper is quite trivial.
    wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only
    NULL->mm transition and oom_reaper clear this atomically once it is done
    with the work. This means that only a single mm_struct can be reaped at
    the time. As the operation is potentially disruptive we are trying to
    limit it to the ncessary minimum and the reaper blocks any updates while
    it operates on an mm. mm_struct is pinned by mm_count to allow parallel
    exit_mmap and a race is detected by atomic_inc_not_zero(mm_users).

    Signed-off-by: Michal Hocko
    Suggested-by: Oleg Nesterov
    Suggested-by: Mel Gorman
    Acked-by: Mel Gorman
    Acked-by: David Rientjes
    Cc: Mel Gorman
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Cc: Andrea Argangeli
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

18 Mar, 2016

4 commits

  • Most of the mm subsystem uses pr_ so make it consistent.

    Miscellanea:

    - Realign arguments
    - Add missing newline to format
    - kmemleak-test.c has a "kmemleak: " prefix added to the
    "Kmemleak testing" logging message via pr_fmt

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • The success of CMA allocation largely depends on the success of
    migration and key factor of it is page reference count. Until now, page
    reference is manipulated by direct calling atomic functions so we cannot
    follow up who and where manipulate it. Then, it is hard to find actual
    reason of CMA allocation failure. CMA allocation should be guaranteed
    to succeed so finding offending place is really important.

    In this patch, call sites where page reference is manipulated are
    converted to introduced wrapper function. This is preparation step to
    add tracepoint to each page reference manipulation function. With this
    facility, we can easily find reason of CMA allocation failure. There is
    no functional change in this patch.

    In addition, this patch also converts reference read sites. It will
    help a second step that renames page._count to something else and
    prevents later attempt to direct access to it (Suggested by Andrew).

    Signed-off-by: Joonsoo Kim
    Acked-by: Michal Nazarewicz
    Acked-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Sergey Senozhatsky
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Similarly to direct reclaim/compaction, kswapd attempts to combine
    reclaim and compaction to attempt making memory allocation of given
    order available.

    The details differ from direct reclaim e.g. in having high watermark as
    a goal. The code involved in kswapd's reclaim/compaction decisions has
    evolved to be quite complex.

    Testing reveals that it doesn't actually work in at least one scenario,
    and closer inspection suggests that it could be greatly simplified
    without compromising on the goal (make high-order page available) or
    efficiency (don't reclaim too much). The simplification relieas of
    doing all compaction in kcompactd, which is simply woken up when high
    watermarks are reached by kswapd's reclaim.

    The scenario where kswapd compaction doesn't work was found with mmtests
    test stress-highalloc configured to attempt order-9 allocations without
    direct reclaim, just waking up kswapd. There was no compaction attempt
    from kswapd during the whole test. Some added instrumentation shows
    what happens:

    - balance_pgdat() sets end_zone to Normal, as it's not balanced
    - reclaim is attempted on DMA zone, which sets nr_attempted to 99, but
    it cannot reclaim anything, so sc.nr_reclaimed is 0
    - for zones DMA32 and Normal, kswapd_shrink_zone uses testorder=0, so
    it merely checks if high watermarks were reached for base pages.
    This is true, so no reclaim is attempted. For DMA, testorder=0
    wasn't used, as compaction_suitable() returned COMPACT_SKIPPED
    - even though the pgdat_needs_compaction flag wasn't set to false, no
    compaction happens due to the condition sc.nr_reclaimed >
    nr_attempted being false (as 0 < 99)
    - priority-- due to nr_reclaimed being 0, repeat until priority reaches
    0 pgdat_balanced() is false as only the small zone DMA appears
    balanced (curiously in that check, watermark appears OK and
    compaction_suitable() returns COMPACT_PARTIAL, because a lower
    classzone_idx is used there)

    Now, even if it was decided that reclaim shouldn't be attempted on the
    DMA zone, the scenario would be the same, as (sc.nr_reclaimed=0 >
    nr_attempted=0) is also false. The condition really should use >= as
    the comment suggests. Then there is a mismatch in the check for setting
    pgdat_needs_compaction to false using low watermark, while the rest uses
    high watermark, and who knows what other subtlety. Hopefully this
    demonstrates that this is unsustainable.

    Luckily we can simplify this a lot. The reclaim/compaction decisions
    make sense for direct reclaim scenario, but in kswapd, our primary goal
    is to reach high watermark in order-0 pages. Afterwards we can attempt
    compaction just once. Unlike direct reclaim, we don't reclaim extra
    pages (over the high watermark), the current code already disallows it
    for good reasons.

    After this patch, we simply wake up kcompactd to process the pgdat,
    after we have either succeeded or failed to reach the high watermarks in
    kswapd, which goes to sleep. We pass kswapd's order and classzone_idx,
    so kcompactd can apply the same criteria to determine which zones are
    worth compacting. Note that we use the classzone_idx from
    wakeup_kswapd(), not balanced_classzone_idx which can include higher
    zones that kswapd tried to balance too, but didn't consider them in
    pgdat_balanced().

    Since kswapd now cannot create high-order pages itself, we need to
    adjust how it determines the zones to be balanced. The key element here
    is adding a "highorder" parameter to zone_balanced, which, when set to
    false, makes it consider only order-0 watermark instead of the desired
    higher order (this was done previously by kswapd_shrink_zone(), but not
    elsewhere). This false is passed for example in pgdat_balanced().
    Importantly, wakeup_kswapd() uses true to make sure kswapd and thus
    kcompactd are woken up for a high-order allocation failure.

    The last thing is to decide what to do with pageblock_skip bitmap
    handling. Compaction maintains a pageblock_skip bitmap to record
    pageblocks where isolation recently failed. This bitmap can be reset by
    three ways:

    1) direct compaction is restarting after going through the full deferred cycle

    2) kswapd goes to sleep, and some other direct compaction has previously
    finished scanning the whole zone and set zone->compact_blockskip_flush.
    Note that a successful direct compaction clears this flag.

    3) compaction was invoked manually via trigger in /proc

    The case 2) is somewhat fuzzy to begin with, but after introducing
    kcompactd we should update it. The check for direct compaction in 1),
    and to set the flush flag in 2) use current_is_kswapd(), which doesn't
    work for kcompactd. Thus, this patch adds bool direct_compaction to
    compact_control to use in 2). For the case 1) we remove the check
    completely - unlike the former kswapd compaction, kcompactd does use the
    deferred compaction functionality, so flushing tied to restarting from
    deferred compaction makes sense here.

    Note that when kswapd goes to sleep, kcompactd is woken up, so it will
    see the flushed pageblock_skip bits. This is different from when the
    former kswapd compaction observed the bits and I believe it makes more
    sense. Kcompactd can afford to be more thorough than a direct
    compaction trying to limit allocation latency, or kswapd whose primary
    goal is to reclaim.

    For testing, I used stress-highalloc configured to do order-9
    allocations with GFP_NOWAIT|__GFP_HIGH|__GFP_COMP, so they relied just
    on kswapd/kcompactd reclaim/compaction (the interfering kernel builds in
    phases 1 and 2 work as usual):

    stress-highalloc
    4.5-rc1+before 4.5-rc1+after
    -nodirect -nodirect
    Success 1 Min 1.00 ( 0.00%) 5.00 (-66.67%)
    Success 1 Mean 1.40 ( 0.00%) 6.20 (-55.00%)
    Success 1 Max 2.00 ( 0.00%) 7.00 (-16.67%)
    Success 2 Min 1.00 ( 0.00%) 5.00 (-66.67%)
    Success 2 Mean 1.80 ( 0.00%) 6.40 (-52.38%)
    Success 2 Max 3.00 ( 0.00%) 7.00 (-16.67%)
    Success 3 Min 34.00 ( 0.00%) 62.00 ( 1.59%)
    Success 3 Mean 41.80 ( 0.00%) 63.80 ( 1.24%)
    Success 3 Max 53.00 ( 0.00%) 65.00 ( 2.99%)

    User 3166.67 3181.09
    System 1153.37 1158.25
    Elapsed 1768.53 1799.37

    4.5-rc1+before 4.5-rc1+after
    -nodirect -nodirect
    Direct pages scanned 32938 32797
    Kswapd pages scanned 2183166 2202613
    Kswapd pages reclaimed 2152359 2143524
    Direct pages reclaimed 32735 32545
    Percentage direct scans 1% 1%
    THP fault alloc 579 612
    THP collapse alloc 304 316
    THP splits 0 0
    THP fault fallback 793 778
    THP collapse fail 11 16
    Compaction stalls 1013 1007
    Compaction success 92 67
    Compaction failures 920 939
    Page migrate success 238457 721374
    Page migrate failure 23021 23469
    Compaction pages isolated 504695 1479924
    Compaction migrate scanned 661390 8812554
    Compaction free scanned 13476658 84327916
    Compaction cost 262 838

    After this patch we see improvements in allocation success rate
    (especially for phase 3) along with increased compaction activity. The
    compaction stalls (direct compaction) in the interfering kernel builds
    (probably THP's) also decreased somewhat thanks to kcompactd activity,
    yet THP alloc successes improved a bit.

    Note that elapsed and user time isn't so useful for this benchmark,
    because of the background interference being unpredictable. It's just
    to quickly spot some major unexpected differences. System time is
    somewhat more useful and that didn't increase.

    Also (after adjusting mmtests' ftrace monitor):

    Time kswapd awake 2547781 2269241
    Time kcompactd awake 0 119253
    Time direct compacting 939937 557649
    Time kswapd compacting 0 0
    Time kcompactd compacting 0 119099

    The decrease of overal time spent compacting appears to not match the
    increased compaction stats. I suspect the tasks get rescheduled and
    since the ftrace monitor doesn't see that, the reported time is wall
    time, not CPU time. But arguably direct compactors care about overall
    latency anyway, whether busy compacting or waiting for CPU doesn't
    matter. And that latency seems to almost halved.

    It's also interesting how much time kswapd spent awake just going
    through all the priorities and failing to even try compacting, over and
    over.

    We can also configure stress-highalloc to perform both direct
    reclaim/compaction and wakeup kswapd/kcompactd, by using
    GFP_KERNEL|__GFP_HIGH|__GFP_COMP:

    stress-highalloc
    4.5-rc1+before 4.5-rc1+after
    -direct -direct
    Success 1 Min 4.00 ( 0.00%) 9.00 (-50.00%)
    Success 1 Mean 8.00 ( 0.00%) 10.00 (-19.05%)
    Success 1 Max 12.00 ( 0.00%) 11.00 ( 15.38%)
    Success 2 Min 4.00 ( 0.00%) 9.00 (-50.00%)
    Success 2 Mean 8.20 ( 0.00%) 10.00 (-16.28%)
    Success 2 Max 13.00 ( 0.00%) 11.00 ( 8.33%)
    Success 3 Min 75.00 ( 0.00%) 74.00 ( 1.33%)
    Success 3 Mean 75.60 ( 0.00%) 75.20 ( 0.53%)
    Success 3 Max 77.00 ( 0.00%) 76.00 ( 0.00%)

    User 3344.73 3246.04
    System 1194.24 1172.29
    Elapsed 1838.04 1836.76

    4.5-rc1+before 4.5-rc1+after
    -direct -direct
    Direct pages scanned 125146 120966
    Kswapd pages scanned 2119757 2135012
    Kswapd pages reclaimed 2073183 2108388
    Direct pages reclaimed 124909 120577
    Percentage direct scans 5% 5%
    THP fault alloc 599 652
    THP collapse alloc 323 354
    THP splits 0 0
    THP fault fallback 806 793
    THP collapse fail 17 16
    Compaction stalls 2457 2025
    Compaction success 906 518
    Compaction failures 1551 1507
    Page migrate success 2031423 2360608
    Page migrate failure 32845 40852
    Compaction pages isolated 4129761 4802025
    Compaction migrate scanned 11996712 21750613
    Compaction free scanned 214970969 344372001
    Compaction cost 2271 2694

    In this scenario, this patch doesn't change the overall success rate as
    direct compaction already tries all it can. There's however significant
    reduction in direct compaction stalls (that is, the number of
    allocations that went into direct compaction). The number of successes
    (i.e. direct compaction stalls that ended up with successful
    allocation) is reduced by the same number. This means the offload to
    kcompactd is working as expected, and direct compaction is reduced
    either due to detecting contention, or compaction deferred by kcompactd.
    In the previous version of this patchset there was some apparent
    reduction of success rate, but the changes in this version (such as
    using sync compaction only), new baseline kernel, and/or averaging
    results from 5 executions (my bet), made this go away.

    Ftrace-based stats seem to roughly agree:

    Time kswapd awake 2532984 2326824
    Time kcompactd awake 0 257916
    Time direct compacting 864839 735130
    Time kswapd compacting 0 0
    Time kcompactd compacting 0 257585

    Signed-off-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Currently /proc/kpageflags returns nothing for "tail" buddy pages, which
    is inconvenient when grasping how free pages are distributed. This
    patch sets KPF_BUDDY for such pages.

    With this patch:

    $ grep MemFree /proc/meminfo ; tools/vm/page-types -b buddy
    MemFree: 3134992 kB
    flags page-count MB symbolic-flags long-symbolic-flags
    0x0000000000000400 779272 3044 __________B_______________________________ buddy
    0x0000000000000c00 4385 17 __________BM______________________________ buddy,mmap
    total 783657 3061

    783657 pages is 3134628 kB (roughly consistent with the global counter,)
    so it's OK.

    [akpm@linux-foundation.org: update comment, per Naoya]
    Signed-off-by: Naoya Horiguchi
    Reviewed-by: Vladimir Davydov >
    Cc: Konstantin Khlebnikov
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi