29 Jul, 2016

40 commits

  • The fair zone allocation policy interleaves allocation requests between
    zones to avoid an age inversion problem whereby new pages are reclaimed
    to balance a zone. Reclaim is now node-based so this should no longer
    be an issue and the fair zone allocation policy is not free. This patch
    removes it.

    Link: http://lkml.kernel.org/r/1467970510-21195-30-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This is convenient when tracking down why the skip count is high because
    it'll show what classzone kswapd woke up at and what zones are being
    isolated.

    Link: http://lkml.kernel.org/r/1467970510-21195-29-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The buffer_heads_over_limit limit in kswapd is inconsistent with direct
    reclaim behaviour. It may force an an attempt to reclaim from all zones
    and then not reclaim at all because higher zones were balanced than
    required by the original request.

    This patch will causes kswapd to consider reclaiming from all zones if
    buffer_heads_over_limit. However, if there are eligible zones for the
    allocation request that woke kswapd then no reclaim will occur even if
    buffer_heads_over_limit. This avoids kswapd over-reclaiming just
    because buffer_heads_over_limit.

    [mgorman@techsingularity.net: fix comment about buffer_heads_over_limit]
    Link: http://lkml.kernel.org/r/1468404004-5085-2-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-28-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • As pointed out by Minchan Kim, the first call to prepare_kswapd_sleep()
    always passes in 0 for `remaining' and the second call can trivially
    check the parameter in advance.

    Suggested-by: Minchan Kim
    Link: http://lkml.kernel.org/r/1467970510-21195-27-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The scan_control structure has enough information available for
    compaction_ready() to make a decision. The classzone_idx manipulations
    in shrink_zones() are no longer necessary as the highest populated zone
    is no longer used to determine if shrink_slab should be called or not.

    [mgorman@techsingularity.net remove redundant check in shrink_zones()]
    Link: http://lkml.kernel.org/r/1468588165-12461-3-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-26-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Hillf Danton
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • shrink_node receives all information it needs about classzone_idx from
    sc->reclaim_idx so remove the aliases.

    Link: http://lkml.kernel.org/r/1467970510-21195-25-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Hillf Danton
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • As reclaim is now per-node based, convert zone_reclaim to be
    node_reclaim. It is possible that a node will be reclaimed multiple
    times if it has multiple zones but this is unavoidable without caching
    all nodes traversed so far. The documentation and interface to
    userspace is the same from a configuration perspective and will will be
    similar in behaviour unless the node-local allocation requests were also
    limited to lower zones.

    Link: http://lkml.kernel.org/r/1467970510-21195-24-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The ac_classzone_idx is used as the basis for waking kswapd and that is
    based on the preferred zoneref. If the preferred zoneref's first zone
    is lower than what is available on other nodes, it's possible that
    kswapd is woken on a zone with only higher, but still eligible, zones.
    As classzone_idx is strictly adhered to now, it causes a problem because
    eligible pages are skipped.

    For example, node 0 has only DMA32 and node 1 has only NORMAL. An
    allocating context running on node 0 may wake kswapd on node 1 telling
    it to skip all NORMAL pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-23-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Hillf Danton
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • kswapd is woken when zones are below the low watermark but the wakeup
    decision is not taking the classzone into account. Now that reclaim is
    node-based, it is only required to wake kswapd once per node and only if
    all zones are unbalanced for the requested classzone.

    Note that one node might be checked multiple times if the zonelist is
    ordered by node because there is no cheap way of tracking what nodes
    have already been visited. For zone-ordering, each node should be
    checked only once.

    Link: http://lkml.kernel.org/r/1467970510-21195-22-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • As reclaim is now node-based, it follows that page write activity due to
    page reclaim should also be accounted for on the node. For consistency,
    also account page writes and page dirtying on a per-node basis.

    After this patch, there are a few remaining zone counters that may appear
    strange but are fine. NUMA stats are still per-zone as this is a
    user-space interface that tools consume. NR_MLOCK, NR_SLAB_*,
    NR_PAGETABLE, NR_KERNEL_STACK and NR_BOUNCE are all allocations that
    potentially pin low memory and cannot trivially be reclaimed on demand.
    This information is still useful for debugging a page allocation failure
    warning.

    Link: http://lkml.kernel.org/r/1467970510-21195-21-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There are now a number of accounting oddities such as mapped file pages
    being accounted for on the node while the total number of file pages are
    accounted on the zone. This can be coped with to some extent but it's
    confusing so this patch moves the relevant file-based accounted. Due to
    throttling logic in the page allocator for reliable OOM detection, it is
    still necessary to track dirty and writeback pages on a per-zone basis.

    [mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
    Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • NR_FILE_PAGES is the number of file pages.
    NR_FILE_MAPPED is the number of mapped file pages.
    NR_ANON_PAGES is the number of mapped anon pages.

    This is unhelpful naming as it's easy to confuse NR_FILE_MAPPED and
    NR_ANON_PAGES for mapped pages. This patch renames NR_ANON_PAGES so we
    have

    NR_FILE_PAGES is the number of file pages.
    NR_FILE_MAPPED is the number of mapped file pages.
    NR_ANON_MAPPED is the number of mapped anon pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-19-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Reclaim makes decisions based on the number of pages that are mapped but
    it's mixing node and zone information. Account NR_FILE_MAPPED and
    NR_ANON_PAGES pages on the node.

    Link: http://lkml.kernel.org/r/1467970510-21195-18-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Historically dirty pages were spread among zones but now that LRUs are
    per-node it is more appropriate to consider dirty pages in a node.

    Link: http://lkml.kernel.org/r/1467970510-21195-17-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Signed-off-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Working set and refault detection is still zone-based, fix it.

    Link: http://lkml.kernel.org/r/1467970510-21195-16-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Memcg needs adjustment after moving LRUs to the node. Limits are
    tracked per memcg but the soft-limit excess is tracked per zone. As
    global page reclaim is based on the node, it is easy to imagine a
    situation where a zone soft limit is exceeded even though the memcg
    limit is fine.

    This patch moves the soft limit tree the node. Technically, all the
    variable names should also change but people are already familiar by the
    meaning of "mz" even if "mn" would be a more appropriate name now.

    Link: http://lkml.kernel.org/r/1467970510-21195-15-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Earlier patches focused on having direct reclaim and kswapd use data
    that is node-centric for reclaiming but shrink_node() itself still uses
    too much zone information. This patch removes unnecessary zone-based
    information with the most important decision being whether to continue
    reclaim or not. Some memcg APIs are adjusted as a result even though
    memcg itself still uses some zone information.

    [mgorman@techsingularity.net: optimization]
    Link: http://lkml.kernel.org/r/1468588165-12461-2-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-14-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • kswapd scans from highest to lowest for a zone that requires balancing.
    This was necessary when reclaim was per-zone to fairly age pages on
    lower zones. Now that we are reclaiming on a per-node basis, any
    eligible zone can be used and pages will still be aged fairly. This
    patch avoids reclaiming excessively unless buffer_heads are over the
    limit and it's necessary to reclaim from a higher zone than requested by
    the waker of kswapd to relieve low memory pressure.

    [hillf.zj@alibaba-inc.com: Force kswapd reclaim no more than needed]
    Link: http://lkml.kernel.org/r/1466518566-30034-12-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-13-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Signed-off-by: Hillf Danton
    Acked-by: Vlastimil Babka
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Reclaim may stall if there is too much dirty or congested data on a
    node. This was previously based on zone flags and the logic for
    clearing the flags is in two places. As congestion/dirty tracking is
    now tracked on a per-node basis, we can remove some duplicate logic.

    Link: http://lkml.kernel.org/r/1467970510-21195-12-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Hillf Danton
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Direct reclaim iterates over all zones in the zonelist and shrinking
    them but this is in conflict with node-based reclaim. In the default
    case, only shrink once per node.

    Link: http://lkml.kernel.org/r/1467970510-21195-11-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • kswapd goes through some complex steps trying to figure out if it should
    stay awake based on the classzone_idx and the requested order. It is
    unnecessarily complex and passes in an invalid classzone_idx to
    balance_pgdat(). What matters most of all is whether a larger order has
    been requsted and whether kswapd successfully reclaimed at the previous
    order. This patch irons out the logic to check just that and the end
    result is less headache inducing.

    Link: http://lkml.kernel.org/r/1467970510-21195-10-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The balance gap was introduced to apply equal pressure to all zones when
    reclaiming for a higher zone. With node-based LRU, the need for the
    balance gap is removed and the code is dead so remove it.

    [vbabka@suse.cz: Also remove KSWAPD_ZONE_BALANCE_GAP_RATIO]
    Link: http://lkml.kernel.org/r/1467970510-21195-9-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Patch "mm: vmscan: Begin reclaiming pages on a per-node basis" started
    thinking of reclaim in terms of nodes but kswapd is still zone-centric.
    This patch gets rid of many of the node-based versus zone-based
    decisions.

    o A node is considered balanced when any eligible lower zone is balanced.
    This eliminates one class of age-inversion problem because we avoid
    reclaiming a newer page just because it's in the wrong zone
    o pgdat_balanced disappears because we now only care about one zone being
    balanced.
    o Some anomalies related to writeback and congestion tracking being based on
    zones disappear.
    o kswapd no longer has to take care to reclaim zones in the reverse order
    that the page allocator uses.
    o Most importantly of all, reclaim from node 0 with multiple zones will
    have similar aging and reclaiming characteristics as every
    other node.

    Link: http://lkml.kernel.org/r/1467970510-21195-8-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • kswapd checks all eligible zones to see if they need balancing even if
    it was woken for a lower zone. This made sense when we reclaimed on a
    per-zone basis because we wanted to shrink zones fairly so avoid
    age-inversion problems. Ideally this is completely unnecessary when
    reclaiming on a per-node basis. In theory, there may still be anomalies
    when all requests are for lower zones and very old pages are preserved
    in higher zones but this should be the exceptional case.

    Link: http://lkml.kernel.org/r/1467970510-21195-7-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch makes reclaim decisions on a per-node basis. A reclaimer
    knows what zone is required by the allocation request and skips pages
    from higher zones. In many cases this will be ok because it's a
    GFP_HIGHMEM request of some description. On 64-bit, ZONE_DMA32 requests
    will cause some problems but 32-bit devices on 64-bit platforms are
    increasingly rare. Historically it would have been a major problem on
    32-bit with big Highmem:Lowmem ratios but such configurations are also
    now rare and even where they exist, they are not encouraged. If it
    really becomes a problem, it'll manifest as very low reclaim
    efficiencies.

    Link: http://lkml.kernel.org/r/1467970510-21195-6-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Hillf Danton
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Zone padding separates write-intensive fields used by page allocation,
    compaction and vmstats but the comments are a little misleading and need
    clarification.

    Link: http://lkml.kernel.org/r/1467970510-21195-5-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Node-based reclaim requires node-based LRUs and locking. This is a
    preparation patch that just moves the lru_lock to the node so later
    patches are easier to review. It is a mechanical change but note this
    patch makes contention worse because the LRU lock is hotter and direct
    reclaim and kswapd can contend on the same lock even when reclaiming
    from different zones.

    Link: http://lkml.kernel.org/r/1467970510-21195-3-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Patchset: "Move LRU page reclaim from zones to nodes v9"

    This series moves LRUs from the zones to the node. While this is a
    current rebase, the test results were based on mmotm as of June 23rd.
    Conceptually, this series is simple but there are a lot of details.
    Some of the broad motivations for this are;

    1. The residency of a page partially depends on what zone the page was
    allocated from. This is partially combatted by the fair zone allocation
    policy but that is a partial solution that introduces overhead in the
    page allocator paths.

    2. Currently, reclaim on node 0 behaves slightly different to node 1. For
    example, direct reclaim scans in zonelist order and reclaims even if
    the zone is over the high watermark regardless of the age of pages
    in that LRU. Kswapd on the other hand starts reclaim on the highest
    unbalanced zone. A difference in distribution of file/anon pages due
    to when they were allocated results can result in a difference in
    again. While the fair zone allocation policy mitigates some of the
    problems here, the page reclaim results on a multi-zone node will
    always be different to a single-zone node.
    it was scheduled on as a result.

    3. kswapd and the page allocator scan zones in the opposite order to
    avoid interfering with each other but it's sensitive to timing. This
    mitigates the page allocator using pages that were allocated very recently
    in the ideal case but it's sensitive to timing. When kswapd is allocating
    from lower zones then it's great but during the rebalancing of the highest
    zone, the page allocator and kswapd interfere with each other. It's worse
    if the highest zone is small and difficult to balance.

    4. slab shrinkers are node-based which makes it harder to identify the exact
    relationship between slab reclaim and LRU reclaim.

    The reason we have zone-based reclaim is that we used to have
    large highmem zones in common configurations and it was necessary
    to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
    less of a concern as machines with lots of memory will (or should) use
    64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
    rare. Machines that do use highmem should have relatively low highmem:lowmem
    ratios than we worried about in the past.

    Conceptually, moving to node LRUs should be easier to understand. The
    page allocator plays fewer tricks to game reclaim and reclaim behaves
    similarly on all nodes.

    The series has been tested on a 16 core UMA machine and a 2-socket 48
    core NUMA machine. The UMA results are presented in most cases as the NUMA
    machine behaved similarly.

    pagealloc
    ---------

    This is a microbenchmark that shows the benefit of removing the fair zone
    allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
    shown as the other orders were comparable.

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623 nodelru-v9
    Min total-odr0-1 490.00 ( 0.00%) 457.00 ( 6.73%)
    Min total-odr0-2 347.00 ( 0.00%) 329.00 ( 5.19%)
    Min total-odr0-4 288.00 ( 0.00%) 273.00 ( 5.21%)
    Min total-odr0-8 251.00 ( 0.00%) 239.00 ( 4.78%)
    Min total-odr0-16 234.00 ( 0.00%) 222.00 ( 5.13%)
    Min total-odr0-32 223.00 ( 0.00%) 211.00 ( 5.38%)
    Min total-odr0-64 217.00 ( 0.00%) 208.00 ( 4.15%)
    Min total-odr0-128 214.00 ( 0.00%) 204.00 ( 4.67%)
    Min total-odr0-256 250.00 ( 0.00%) 230.00 ( 8.00%)
    Min total-odr0-512 271.00 ( 0.00%) 269.00 ( 0.74%)
    Min total-odr0-1024 291.00 ( 0.00%) 282.00 ( 3.09%)
    Min total-odr0-2048 303.00 ( 0.00%) 296.00 ( 2.31%)
    Min total-odr0-4096 311.00 ( 0.00%) 309.00 ( 0.64%)
    Min total-odr0-8192 316.00 ( 0.00%) 314.00 ( 0.63%)
    Min total-odr0-16384 317.00 ( 0.00%) 315.00 ( 0.63%)
    Min total-odr1-1 742.00 ( 0.00%) 712.00 ( 4.04%)
    Min total-odr1-2 562.00 ( 0.00%) 530.00 ( 5.69%)
    Min total-odr1-4 457.00 ( 0.00%) 433.00 ( 5.25%)
    Min total-odr1-8 411.00 ( 0.00%) 381.00 ( 7.30%)
    Min total-odr1-16 381.00 ( 0.00%) 356.00 ( 6.56%)
    Min total-odr1-32 372.00 ( 0.00%) 346.00 ( 6.99%)
    Min total-odr1-64 372.00 ( 0.00%) 343.00 ( 7.80%)
    Min total-odr1-128 375.00 ( 0.00%) 351.00 ( 6.40%)
    Min total-odr1-256 379.00 ( 0.00%) 351.00 ( 7.39%)
    Min total-odr1-512 385.00 ( 0.00%) 355.00 ( 7.79%)
    Min total-odr1-1024 386.00 ( 0.00%) 358.00 ( 7.25%)
    Min total-odr1-2048 390.00 ( 0.00%) 362.00 ( 7.18%)
    Min total-odr1-4096 390.00 ( 0.00%) 362.00 ( 7.18%)
    Min total-odr1-8192 388.00 ( 0.00%) 363.00 ( 6.44%)

    This shows a steady improvement throughout. The primary benefit is from
    reduced system CPU usage which is obvious from the overall times;

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623nodelru-v8
    User 189.19 191.80
    System 2604.45 2533.56
    Elapsed 2855.30 2786.39

    The vmstats also showed that the fair zone allocation policy was definitely
    removed as can be seen here;

    4.7.0-rc3 4.7.0-rc3
    mmotm-20160623 nodelru-v8
    DMA32 allocs 28794729769 0
    Normal allocs 48432501431 77227309877
    Movable allocs 0 0

    tiobench on ext4
    ----------------

    tiobench is a benchmark that artifically benefits if old pages remain resident
    while new pages get reclaimed. The fair zone allocation policy mitigates this
    problem so pages age fairly. While the benchmark has problems, it is important
    that tiobench performance remains constant as it implies that page aging
    problems that the fair zone allocation policy fixes are not re-introduced.

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623 nodelru-v9
    Min PotentialReadSpeed 89.65 ( 0.00%) 90.21 ( 0.62%)
    Min SeqRead-MB/sec-1 82.68 ( 0.00%) 82.01 ( -0.81%)
    Min SeqRead-MB/sec-2 72.76 ( 0.00%) 72.07 ( -0.95%)
    Min SeqRead-MB/sec-4 75.13 ( 0.00%) 74.92 ( -0.28%)
    Min SeqRead-MB/sec-8 64.91 ( 0.00%) 65.19 ( 0.43%)
    Min SeqRead-MB/sec-16 62.24 ( 0.00%) 62.22 ( -0.03%)
    Min RandRead-MB/sec-1 0.88 ( 0.00%) 0.88 ( 0.00%)
    Min RandRead-MB/sec-2 0.95 ( 0.00%) 0.92 ( -3.16%)
    Min RandRead-MB/sec-4 1.43 ( 0.00%) 1.34 ( -6.29%)
    Min RandRead-MB/sec-8 1.61 ( 0.00%) 1.60 ( -0.62%)
    Min RandRead-MB/sec-16 1.80 ( 0.00%) 1.90 ( 5.56%)
    Min SeqWrite-MB/sec-1 76.41 ( 0.00%) 76.85 ( 0.58%)
    Min SeqWrite-MB/sec-2 74.11 ( 0.00%) 73.54 ( -0.77%)
    Min SeqWrite-MB/sec-4 80.05 ( 0.00%) 80.13 ( 0.10%)
    Min SeqWrite-MB/sec-8 72.88 ( 0.00%) 73.20 ( 0.44%)
    Min SeqWrite-MB/sec-16 75.91 ( 0.00%) 76.44 ( 0.70%)
    Min RandWrite-MB/sec-1 1.18 ( 0.00%) 1.14 ( -3.39%)
    Min RandWrite-MB/sec-2 1.02 ( 0.00%) 1.03 ( 0.98%)
    Min RandWrite-MB/sec-4 1.05 ( 0.00%) 0.98 ( -6.67%)
    Min RandWrite-MB/sec-8 0.89 ( 0.00%) 0.92 ( 3.37%)
    Min RandWrite-MB/sec-16 0.92 ( 0.00%) 0.93 ( 1.09%)

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623 approx-v9
    User 645.72 525.90
    System 403.85 331.75
    Elapsed 6795.36 6783.67

    This shows that the series has little or not impact on tiobench which is
    desirable and a reduction in system CPU usage. It indicates that the fair
    zone allocation policy was removed in a manner that didn't reintroduce
    one class of page aging bug. There were only minor differences in overall
    reclaim activity

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623nodelru-v8
    Minor Faults 645838 647465
    Major Faults 573 640
    Swap Ins 0 0
    Swap Outs 0 0
    DMA allocs 0 0
    DMA32 allocs 46041453 44190646
    Normal allocs 78053072 79887245
    Movable allocs 0 0
    Allocation stalls 24 67
    Stall zone DMA 0 0
    Stall zone DMA32 0 0
    Stall zone Normal 0 2
    Stall zone HighMem 0 0
    Stall zone Movable 0 65
    Direct pages scanned 10969 30609
    Kswapd pages scanned 93375144 93492094
    Kswapd pages reclaimed 93372243 93489370
    Direct pages reclaimed 10969 30609
    Kswapd efficiency 99% 99%
    Kswapd velocity 13741.015 13781.934
    Direct efficiency 100% 100%
    Direct velocity 1.614 4.512
    Percentage direct scans 0% 0%

    kswapd activity was roughly comparable. There were differences in direct
    reclaim activity but negligible in the context of the overall workload
    (velocity of 4 pages per second with the patches applied, 1.6 pages per
    second in the baseline kernel).

    pgbench read-only large configuration on ext4
    ---------------------------------------------

    pgbench is a database benchmark that can be sensitive to page reclaim
    decisions. This also checks if removing the fair zone allocation policy
    is safe

    pgbench Transactions
    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623 nodelru-v8
    Hmean 1 188.26 ( 0.00%) 189.78 ( 0.81%)
    Hmean 5 330.66 ( 0.00%) 328.69 ( -0.59%)
    Hmean 12 370.32 ( 0.00%) 380.72 ( 2.81%)
    Hmean 21 368.89 ( 0.00%) 369.00 ( 0.03%)
    Hmean 30 382.14 ( 0.00%) 360.89 ( -5.56%)
    Hmean 32 428.87 ( 0.00%) 432.96 ( 0.95%)

    Negligible differences again. As with tiobench, overall reclaim activity
    was comparable.

    bonnie++ on ext4
    ----------------

    No interesting performance difference, negligible differences on reclaim
    stats.

    paralleldd on ext4
    ------------------

    This workload uses varying numbers of dd instances to read large amounts of
    data from disk.

    4.7.0-rc3 4.7.0-rc3
    mmotm-20160623 nodelru-v9
    Amean Elapsd-1 186.04 ( 0.00%) 189.41 ( -1.82%)
    Amean Elapsd-3 192.27 ( 0.00%) 191.38 ( 0.46%)
    Amean Elapsd-5 185.21 ( 0.00%) 182.75 ( 1.33%)
    Amean Elapsd-7 183.71 ( 0.00%) 182.11 ( 0.87%)
    Amean Elapsd-12 180.96 ( 0.00%) 181.58 ( -0.35%)
    Amean Elapsd-16 181.36 ( 0.00%) 183.72 ( -1.30%)

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623 nodelru-v9
    User 1548.01 1552.44
    System 8609.71 8515.08
    Elapsed 3587.10 3594.54

    There is little or no change in performance but some drop in system CPU usage.

    4.7.0-rc3 4.7.0-rc3
    mmotm-20160623 nodelru-v9
    Minor Faults 362662 367360
    Major Faults 1204 1143
    Swap Ins 22 0
    Swap Outs 2855 1029
    DMA allocs 0 0
    DMA32 allocs 31409797 28837521
    Normal allocs 46611853 49231282
    Movable allocs 0 0
    Direct pages scanned 0 0
    Kswapd pages scanned 40845270 40869088
    Kswapd pages reclaimed 40830976 40855294
    Direct pages reclaimed 0 0
    Kswapd efficiency 99% 99%
    Kswapd velocity 11386.711 11369.769
    Direct efficiency 100% 100%
    Direct velocity 0.000 0.000
    Percentage direct scans 0% 0%
    Page writes by reclaim 2855 1029
    Page writes file 0 0
    Page writes anon 2855 1029
    Page reclaim immediate 771 1628
    Sector Reads 293312636 293536360
    Sector Writes 18213568 18186480
    Page rescued immediate 0 0
    Slabs scanned 128257 132747
    Direct inode steals 181 56
    Kswapd inode steals 59 1131

    It basically shows that kswapd was active at roughly the same rate in
    both kernels. There was also comparable slab scanning activity and direct
    reclaim was avoided in both cases. There appears to be a large difference
    in numbers of inodes reclaimed but the workload has few active inodes and
    is likely a timing artifact.

    stutter
    -------

    stutter simulates a simple workload. One part uses a lot of anonymous
    memory, a second measures mmap latency and a third copies a large file.
    The primary metric is checking for mmap latency.

    stutter
    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623 nodelru-v8
    Min mmap 16.6283 ( 0.00%) 13.4258 ( 19.26%)
    1st-qrtle mmap 54.7570 ( 0.00%) 34.9121 ( 36.24%)
    2nd-qrtle mmap 57.3163 ( 0.00%) 46.1147 ( 19.54%)
    3rd-qrtle mmap 58.9976 ( 0.00%) 47.1882 ( 20.02%)
    Max-90% mmap 59.7433 ( 0.00%) 47.4453 ( 20.58%)
    Max-93% mmap 60.1298 ( 0.00%) 47.6037 ( 20.83%)
    Max-95% mmap 73.4112 ( 0.00%) 82.8719 (-12.89%)
    Max-99% mmap 92.8542 ( 0.00%) 88.8870 ( 4.27%)
    Max mmap 1440.6569 ( 0.00%) 121.4201 ( 91.57%)
    Mean mmap 59.3493 ( 0.00%) 42.2991 ( 28.73%)
    Best99%Mean mmap 57.2121 ( 0.00%) 41.8207 ( 26.90%)
    Best95%Mean mmap 55.9113 ( 0.00%) 39.9620 ( 28.53%)
    Best90%Mean mmap 55.6199 ( 0.00%) 39.3124 ( 29.32%)
    Best50%Mean mmap 53.2183 ( 0.00%) 33.1307 ( 37.75%)
    Best10%Mean mmap 45.9842 ( 0.00%) 20.4040 ( 55.63%)
    Best5%Mean mmap 43.2256 ( 0.00%) 17.9654 ( 58.44%)
    Best1%Mean mmap 32.9388 ( 0.00%) 16.6875 ( 49.34%)

    This shows a number of improvements with the worst-case outlier greatly
    improved.

    Some of the vmstats are interesting

    4.7.0-rc4 4.7.0-rc4
    mmotm-20160623nodelru-v8
    Swap Ins 163 502
    Swap Outs 0 0
    DMA allocs 0 0
    DMA32 allocs 618719206 1381662383
    Normal allocs 891235743 564138421
    Movable allocs 0 0
    Allocation stalls 2603 1
    Direct pages scanned 216787 2
    Kswapd pages scanned 50719775 41778378
    Kswapd pages reclaimed 41541765 41777639
    Direct pages reclaimed 209159 0
    Kswapd efficiency 81% 99%
    Kswapd velocity 16859.554 14329.059
    Direct efficiency 96% 0%
    Direct velocity 72.061 0.001
    Percentage direct scans 0% 0%
    Page writes by reclaim 6215049 0
    Page writes file 6215049 0
    Page writes anon 0 0
    Page reclaim immediate 70673 90
    Sector Reads 81940800 81680456
    Sector Writes 100158984 98816036
    Page rescued immediate 0 0
    Slabs scanned 1366954 22683

    While this is not guaranteed in all cases, this particular test showed
    a large reduction in direct reclaim activity. It's also worth noting
    that no page writes were issued from reclaim context.

    This series is not without its hazards. There are at least three areas
    that I'm concerned with even though I could not reproduce any problems in
    that area.

    1. Reclaim/compaction is going to be affected because the amount of reclaim is
    no longer targetted at a specific zone. Compaction works on a per-zone basis
    so there is no guarantee that reclaiming a few THP's worth page pages will
    have a positive impact on compaction success rates.

    2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers
    are called is now different. This may or may not be a problem but if it
    is, it'll be because shrinkers are not called enough and some balancing
    is required.

    3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are
    distributed between zones and the fair zone allocation policy used to do
    something very similar for anon. The distribution is now different but not
    necessarily in any way that matters but it's still worth bearing in mind.

    VM statistic counters for reclaim decisions are zone-based. If the kernel
    is to reclaim on a per-node basis then we need to track per-node
    statistics but there is no infrastructure for that. The most notable
    change is that the old node_page_state is renamed to
    sum_zone_node_page_state. The new node_page_state takes a pglist_data and
    uses per-node stats but none exist yet. There is some renaming such as
    vm_stat to vm_zone_stat and the addition of vm_node_stat and the renaming
    of mod_state to mod_zone_state. Otherwise, this is mostly a mechanical
    patch with no functional change. There is a lot of similarity between the
    node and zone helpers which is unfortunate but there was no obvious way of
    reusing the code and maintaining type safety.

    Link: http://lkml.kernel.org/r/1467970510-21195-2-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Joonsoo Kim
    Cc: Hillf Danton
    Cc: Michal Hocko
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The helper early_page_nid_uninitialised() has been dead since commit
    974a786e63c9 ("mm, page_alloc: remove MIGRATE_RESERVE") so remove the
    dead code.

    Link: http://lkml.kernel.org/r/1468008031-3848-2-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
    changing cpuset's mems") has added TIF_MEMDIE and PF_EXITING check but
    it is checking the flag on the current task rather than the given one.

    This doesn't make much sense and it is actually wrong. If the current
    task which updates the nodemask of a cpuset got killed by the OOM killer
    then a part of the cpuset cgroup processes would have incompatible
    nodemask which is surprising to say the least.

    The comment suggests the intention was to skip oom victim or an exiting
    task so we should be checking the given task. But even then it would be
    layering violation because it is the memory allocator to interpret the
    TIF_MEMDIE meaning. Simply drop both checks. All tasks in the cpuset
    should simply follow the same mask.

    Link: http://lkml.kernel.org/r/1467029719-17602-3-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: David Rientjes
    Cc: Miao Xie
    Cc: Miao Xie
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • freezing_slow_path() is checking TIF_MEMDIE to skip OOM killed tasks.
    It is, however, checking the flag on the current task rather than the
    given one. This is really confusing because freezing() can be called
    also on !current tasks. It would end up working correctly for its main
    purpose because __refrigerator will be always called on the current task
    so the oom victim will never get frozen. But it could lead to
    surprising results when a task which is freezing a cgroup got oom killed
    because only part of the cgroup would get frozen. This is highly
    unlikely but worth fixing as the resulting code would be more clear
    anyway.

    Link: http://lkml.kernel.org/r/1467029719-17602-2-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: David Rientjes
    Cc: Miao Xie
    Cc: Miao Xie
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The caller __alloc_pages_direct_compact() already checked (order == 0)
    so there's no need to check again.

    Link: http://lkml.kernel.org/r/1465973568-3496-1-git-send-email-opensource.ganesh@gmail.com
    Signed-off-by: Ganesh Mahendran
    Cc: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ganesh Mahendran
     
  • Commit 23047a96d7cf ("mm: workingset: per-cgroup cache thrash
    detection") added a page->mem_cgroup lookup to the cache eviction,
    refault, and activation paths, as well as locking to the activation
    path, and the vm-scalability tests showed a regression of -23%.

    While the test in question is an artificial worst-case scenario that
    doesn't occur in real workloads - reading two sparse files in parallel
    at full CPU speed just to hammer the LRU paths - there is still some
    optimizations that can be done in those paths.

    Inline the lookup functions to eliminate calls. Also, page->mem_cgroup
    doesn't need to be stabilized when counting an activation; we merely
    need to hold the RCU lock to prevent the memcg from being freed.

    This cuts down on overhead quite a bit:

    23047a96d7cfcfca 063f6715e77a7be5770d6081fe
    ---------------- --------------------------
    %stddev %change %stddev
    \ | \
    21621405 +- 0% +11.3% 24069657 +- 2% vm-scalability.throughput

    [linux@roeck-us.net: drop unnecessary include file]
    [hannes@cmpxchg.org: add WARN_ON_ONCE()s]
    Link: http://lkml.kernel.org/r/20160707194024.GA26580@cmpxchg.org
    Link: http://lkml.kernel.org/r/20160624175101.GA3024@cmpxchg.org
    Reported-by: Ye Xiaolong
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Guenter Roeck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • We need to assure the comment is consistent with the code.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/1466171914-21027-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     
  • "mm, oom: fortify task_will_free_mem" has dropped task_lock around
    task_will_free_mem in oom_kill_process bacause it assumed that a
    potential race when the selected task exits will not be a problem as the
    oom_reaper will call exit_oom_victim.

    Tetsuo was objecting that nommu doesn't have oom_reaper so the race
    would be still possible. The code would be racy and lockup prone
    theoretically in other aspects without the oom reaper anyway so I didn't
    considered this a big deal. But it seems that further changes I am
    planning in this area will benefit from stable task->mm in this path as
    well. So let's drop find_lock_task_mm from task_will_free_mem and call
    it from under task_lock as we did previously. Just pull the task->mm !=
    NULL check inside the function.

    Link: http://lkml.kernel.org/r/1467201562-6709-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: Vladimir Davydov
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The only case where the oom_reaper is not triggered for the oom victim
    is when it shares the memory with a kernel thread (aka use_mm) or with
    the global init. After "mm, oom: skip vforked tasks from being
    selected" the victim cannot be a vforked task of the global init so we
    are left with clone(CLONE_VM) (without CLONE_SIGHAND). use_mm() users
    are quite rare as well.

    In order to help forward progress for the OOM killer, make sure that
    this really rare case will not get in the way - we do this by hiding the
    mm from the oom killer by setting MMF_OOM_REAPED flag for it.
    oom_scan_process_thread will ignore any TIF_MEMDIE task if it has
    MMF_OOM_REAPED flag set to catch these oom victims.

    After this patch we should guarantee forward progress for the OOM killer
    even when the selected victim is sharing memory with a kernel thread or
    global init as long as the victims mm is still alive.

    Link: http://lkml.kernel.org/r/1466426628-15074-11-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Cc: Vladimir Davydov
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • oom_reaper relies on the mmap_sem for read to do its job. Many places
    which might block readers have been converted to use down_write_killable
    and that has reduced chances of the contention a lot. Some paths where
    the mmap_sem is held for write can take other locks and they might
    either be not prepared to fail due to fatal signal pending or too
    impractical to be changed.

    This patch introduces MMF_OOM_NOT_REAPABLE flag which gets set after the
    first attempt to reap a task's mm fails. If the flag is present after
    the failure then we set MMF_OOM_REAPED to hide this mm from the oom
    killer completely so it can go and chose another victim.

    As a result a risk of OOM deadlock when the oom victim would be blocked
    indefinetly and so the oom killer cannot make any progress should be
    mitigated considerably while we still try really hard to perform all
    reclaim attempts and stay predictable in the behavior.

    Link: http://lkml.kernel.org/r/1466426628-15074-10-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Cc: Vladimir Davydov
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The 0-day robot has encountered the following:

    Out of memory: Kill process 3914 (trinity-c0) score 167 or sacrifice child
    Killed process 3914 (trinity-c0) total-vm:55864kB, anon-rss:1512kB, file-rss:1088kB, shmem-rss:25616kB
    oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26488kB
    oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
    oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
    oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:27296kB
    oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:28148kB

    oom_reaper is trying to reap the same task again and again.

    This is possible only when the oom killer is bypassed because of
    task_will_free_mem because we skip over tasks with MMF_OOM_REAPED
    already set during select_bad_process. Teach task_will_free_mem to skip
    over MMF_OOM_REAPED tasks as well because they will be unlikely to free
    anything more.

    Analyzed by Tetsuo Handa.

    Link: http://lkml.kernel.org/r/1466426628-15074-9-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Cc: Tetsuo Handa
    Cc: Vladimir Davydov
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • task_will_free_mem is rather weak. It doesn't really tell whether the
    task has chance to drop its mm. 98748bd72200 ("oom: consider
    multi-threaded tasks in task_will_free_mem") made a first step into making
    it more robust for multi-threaded applications so now we know that the
    whole process is going down and probably drop the mm.

    This patch builds on top for more complex scenarios where mm is shared
    between different processes - CLONE_VM without CLONE_SIGHAND, or in kernel
    use_mm().

    Make sure that all processes sharing the mm are killed or exiting. This
    will allow us to replace try_oom_reaper by wake_oom_reaper because
    task_will_free_mem implies the task is reapable now. Therefore all paths
    which bypass the oom killer are now reapable and so they shouldn't lock up
    the oom killer.

    Link: http://lkml.kernel.org/r/1466426628-15074-8-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Cc: Vladimir Davydov
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko