29 Jul, 2016

7 commits

  • If per-zone LRU accounting is available then there is no point
    approximating whether reclaim and compaction should retry based on pgdat
    statistics. This is effectively a revert of "mm, vmstat: remove zone
    and node double accounting by approximating retries" with the difference
    that inactive/active stats are still available. This preserves the
    history of why the approximation was retried and why it had to be
    reverted to handle OOM kills on 32-bit systems.

    Link: http://lkml.kernel.org/r/1469110261-7365-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Minchan Kim
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The number of LRU pages, dirty pages and writeback pages must be
    accounted for on both zones and nodes because of the reclaim retry
    logic, compaction retry logic and highmem calculations all depending on
    per-zone stats.

    Many lowmem allocations are immune from OOM kill due to a check in
    __alloc_pages_may_oom for (ac->high_zoneidx < ZONE_NORMAL) since commit
    03668b3ceb0c ("oom: avoid oom killer for lowmem allocations"). The
    exception is costly high-order allocations or allocations that cannot
    fail. If the __alloc_pages_may_oom avoids OOM-kill for low-order lowmem
    allocations then it would fall through to __alloc_pages_direct_compact.

    This patch will blindly retry reclaim for zone-constrained allocations
    in should_reclaim_retry up to MAX_RECLAIM_RETRIES. This is not ideal
    but without per-zone stats there are not many alternatives. The impact
    it that zone-constrained allocations may delay before considering the
    OOM killer.

    As there is no guarantee enough memory can ever be freed to satisfy
    compaction, this patch avoids retrying compaction for zone-contrained
    allocations.

    In combination, that means that the per-node stats can be used when
    deciding whether to continue reclaim using a rough approximation. While
    it is possible this will make the wrong decision on occasion, it will
    not infinite loop as the number of reclaim attempts is capped by
    MAX_RECLAIM_RETRIES.

    The final step is calculating the number of dirtyable highmem pages. As
    those calculations only care about the global count of file pages in
    highmem. This patch uses a global counter used instead of per-zone
    stats as it is sufficient.

    In combination, this allows the per-zone LRU and dirty state counters to
    be removed.

    [mgorman@techsingularity.net: fix acct_highmem_file_pages()]
    Link: http://lkml.kernel.org/r/1468853426-12858-4-git-send-email-mgorman@techsingularity.netLink: http://lkml.kernel.org/r/1467970510-21195-35-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Suggested by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • As reclaim is now per-node based, convert zone_reclaim to be
    node_reclaim. It is possible that a node will be reclaimed multiple
    times if it has multiple zones but this is unavoidable without caching
    all nodes traversed so far. The documentation and interface to
    userspace is the same from a configuration perspective and will will be
    similar in behaviour unless the node-local allocation requests were also
    limited to lower zones.

    Link: http://lkml.kernel.org/r/1467970510-21195-24-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Memcg needs adjustment after moving LRUs to the node. Limits are
    tracked per memcg but the soft-limit excess is tracked per zone. As
    global page reclaim is based on the node, it is easy to imagine a
    situation where a zone soft limit is exceeded even though the memcg
    limit is fine.

    This patch moves the soft limit tree the node. Technically, all the
    variable names should also change but people are already familiar by the
    meaning of "mz" even if "mn" would be a more appropriate name now.

    Link: http://lkml.kernel.org/r/1467970510-21195-15-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Earlier patches focused on having direct reclaim and kswapd use data
    that is node-centric for reclaiming but shrink_node() itself still uses
    too much zone information. This patch removes unnecessary zone-based
    information with the most important decision being whether to continue
    reclaim or not. Some memcg APIs are adjusted as a result even though
    memcg itself still uses some zone information.

    [mgorman@techsingularity.net: optimization]
    Link: http://lkml.kernel.org/r/1468588165-12461-2-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-14-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The balance gap was introduced to apply equal pressure to all zones when
    reclaiming for a higher zone. With node-based LRU, the need for the
    balance gap is removed and the code is dead so remove it.

    [vbabka@suse.cz: Also remove KSWAPD_ZONE_BALANCE_GAP_RATIO]
    Link: http://lkml.kernel.org/r/1467970510-21195-9-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

21 May, 2016

1 commit

  • __alloc_pages_slowpath has traditionally relied on the direct reclaim
    and did_some_progress as an indicator that it makes sense to retry
    allocation rather than declaring OOM. shrink_zones had to rely on
    zone_reclaimable if shrink_zone didn't make any progress to prevent from
    a premature OOM killer invocation - the LRU might be full of dirty or
    writeback pages and direct reclaim cannot clean those up.

    zone_reclaimable allows to rescan the reclaimable lists several times
    and restart if a page is freed. This is really subtle behavior and it
    might lead to a livelock when a single freed page keeps allocator
    looping but the current task will not be able to allocate that single
    page. OOM killer would be more appropriate than looping without any
    progress for unbounded amount of time.

    This patch changes OOM detection logic and pulls it out from shrink_zone
    which is too low to be appropriate for any high level decisions such as
    OOM which is per zonelist property. It is __alloc_pages_slowpath which
    knows how many attempts have been done and what was the progress so far
    therefore it is more appropriate to implement this logic.

    The new heuristic is implemented in should_reclaim_retry helper called
    from __alloc_pages_slowpath. It tries to be more deterministic and
    easier to follow. It builds on an assumption that retrying makes sense
    only if the currently reclaimable memory + free pages would allow the
    current allocation request to succeed (as per __zone_watermark_ok) at
    least for one zone in the usable zonelist.

    This alone wouldn't be sufficient, though, because the writeback might
    get stuck and reclaimable pages might be pinned for a really long time
    or even depend on the current allocation context. Therefore there is a
    backoff mechanism implemented which reduces the reclaim target after
    each reclaim round without any progress. This means that we should
    eventually converge to only NR_FREE_PAGES as the target and fail on the
    wmark check and proceed to OOM. The backoff is simple and linear with
    1/16 of the reclaimable pages for each round without any progress. We
    are optimistic and reset counter for successful reclaim rounds.

    Costly high order pages mostly preserve their semantic and those without
    __GFP_REPEAT fail right away while those which have the flag set will
    back off after the amount of reclaimable pages reaches equivalent of the
    requested order. The only difference is that if there was no progress
    during the reclaim we rely on zone watermark check. This is more
    logical thing to do than previous 1<
    Acked-by: Hillf Danton
    Cc: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Tetsuo Handa
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

13 May, 2016

1 commit

  • This will provide fully accuracy to the mapcount calculation in the
    write protect faults, so page pinning will not get broken by false
    positive copy-on-writes.

    total_mapcount() isn't the right calculation needed in
    reuse_swap_page(), so this introduces a page_trans_huge_mapcount()
    that is effectively the full accurate return value for page_mapcount()
    if dealing with Transparent Hugepages, however we only use the
    page_trans_huge_mapcount() during COW faults where it strictly needed,
    due to its higher runtime cost.

    This also provide at practical zero cost the total_mapcount
    information which is needed to know if we can still relocate the page
    anon_vma to the local vma. If page_trans_huge_mapcount() returns 1 we
    can reuse the page no matter if it's a pte or a pmd_trans_huge
    triggering the fault, but we can only relocate the page anon_vma to
    the local vma->anon_vma if we're sure it's only this "vma" mapping the
    whole THP physical range.

    Kirill A. Shutemov discovered the problem with moving the page
    anon_vma to the local vma->anon_vma in a previous version of this
    patch and another problem in the way page_move_anon_rmap() was called.

    Andrew Morton discovered that CONFIG_SWAP=n wouldn't build in a
    previous version, because reuse_swap_page must be a macro to call
    page_trans_huge_mapcount from swap.h, so this uses a macro again
    instead of an inline function. With this change at least it's a less
    dangerous usage than it was before, because "page" is used only once
    now, while with the previous code reuse_swap_page(page++) would have
    called page_mapcount on page+1 and it would have increased page twice
    instead of just once.

    Dean Luick noticed an uninitialized variable that could result in a
    rmap inefficiency for the non-THP case in a previous version.

    Mike Marciniszyn said:

    : Our RDMA tests are seeing an issue with memory locking that bisects to
    : commit 61f5d698cc97 ("mm: re-enable THP")
    :
    : The test program registers two rather large MRs (512M) and RDMA
    : writes data to a passive peer using the first and RDMA reads it back
    : into the second MR and compares that data. The sizes are chosen randomly
    : between 0 and 1024 bytes.
    :
    : The test will get through a few (
    Reviewed-by: "Kirill A. Shutemov"
    Reviewed-by: Dean Luick
    Tested-by: Alex Williamson
    Tested-by: Mike Marciniszyn
    Tested-by: Josh Collier
    Cc: Marc Haber
    Cc: [4.5]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

06 May, 2016

1 commit

  • Cgroup2 currently doesn't have a per-cgroup swappiness setting. We
    might want to add one later - that's a different discussion - but until
    we do, the cgroups should always follow the system setting. Otherwise
    it will be unchangeably set to whatever the ancestor inherited from the
    system setting at the time of cgroup creation.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Vladimir Davydov
    Cc: [4.5]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

05 Apr, 2016

2 commits

  • Mostly direct substitution with occasional adjustment or removing
    outdated comments.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

21 Jan, 2016

4 commits

  • Swap cache pages are freed aggressively if swap is nearly full (>50%
    currently), because otherwise we are likely to stop scanning anonymous
    when we near the swap limit even if there is plenty of freeable swap cache
    pages. We should follow the same trend in case of memory cgroup, which
    has its own swap limit.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • We don't scan anonymous memory if we ran out of swap, neither should we do
    it in case memcg swap limit is hit, because swap out is impossible anyway.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • The following patches will add more functions to the memcg section of
    include/linux/swap.h. Some of them will need values defined below the
    current location of the section. So let's move the section to the end of
    the file. No functional changes intended.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • This patchset introduces swap accounting to cgroup2.

    This patch (of 7):

    In the legacy hierarchy we charge memsw, which is dubious, because:

    - memsw.limit must be >= memory.limit, so it is impossible to limit
    swap usage less than memory usage. Taking into account the fact that
    the primary limiting mechanism in the unified hierarchy is
    memory.high while memory.limit is either left unset or set to a very
    large value, moving memsw.limit knob to the unified hierarchy would
    effectively make it impossible to limit swap usage according to the
    user preference.

    - memsw.usage != memory.usage + swap.usage, because a page occupying
    both swap entry and a swap cache page is charged only once to memsw
    counter. As a result, it is possible to effectively eat up to
    memory.limit of memory pages *and* memsw.limit of swap entries, which
    looks unexpected.

    That said, we should provide a different swap limiting mechanism for
    cgroup2.

    This patch adds mem_cgroup->swap counter, which charges the actual number
    of swap entries used by a cgroup. It is only charged in the unified
    hierarchy, while the legacy hierarchy memsw logic is left intact.

    The swap usage can be monitored using new memory.swap.current file and
    limited using memory.swap.max.

    Note, to charge swap resource properly in the unified hierarchy, we have
    to make swap_entry_free uncharge swap only when ->usage reaches zero, not
    just ->count, i.e. when all references to a swap entry, including the one
    taken by swap cache, are gone. This is necessary, because otherwise
    swap-in could result in uncharging swap even if the page is still in swap
    cache and hence still occupies a swap entry. At the same time, this
    shouldn't break memsw counter logic, where a page is never charged twice
    for using both memory and swap, because in case of legacy hierarchy we
    uncharge swap on commit (see mem_cgroup_commit_charge).

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

16 Jan, 2016

2 commits

  • MADV_FREE is a hint that it's okay to discard pages if there is memory
    pressure and we use reclaimers(ie, kswapd and direct reclaim) to free
    them so there is no value keeping them in the active anonymous LRU so
    this patch moves them to inactive LRU list's head.

    This means that MADV_FREE-ed pages which were living on the inactive
    list are reclaimed first because they are more likely to be cold rather
    than recently active pages.

    An arguable issue for the approach would be whether we should put the
    page to the head or tail of the inactive list. I chose head because the
    kernel cannot make sure it's really cold or warm for every MADV_FREE
    usecase but at least we know it's not *hot*, so landing of inactive head
    would be a comprimise for various usecases.

    This fixes suboptimal behavior of MADV_FREE when pages living on the
    active list will sit there for a long time even under memory pressure
    while the inactive list is reclaimed heavily. This basically breaks the
    whole purpose of using MADV_FREE to help the system to free memory which
    is might not be used.

    Signed-off-by: Minchan Kim
    Acked-by: Hugh Dickins
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: "James E.J. Bottomley"
    Cc: "Kirill A. Shutemov"
    Cc:
    Cc: Andrea Arcangeli
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Chen Gang
    Cc: Chris Zankel
    Cc: Daniel Micay
    Cc: Darrick J. Wong
    Cc: David S. Miller
    Cc: Helge Deller
    Cc: Ivan Kokshaysky
    Cc: Jason Evans
    Cc: KOSAKI Motohiro
    Cc: Kirill A. Shutemov
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Kerrisk
    Cc: Mika Penttil
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Roland Dreier
    Cc: Russell King
    Cc: Will Deacon
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • With new refcounting we will be able map the same compound page with
    PTEs and PMDs. It requires adjustment to conditions when we can reuse
    the page on write-protection fault.

    For PTE fault we can't reuse the page if it's part of huge page.

    For PMD we can only reuse the page if nobody else maps the huge page or
    it's part. We can do it by checking page_mapcount() on each sub-page,
    but it's expensive.

    The cheaper way is to check page_count() to be equal 1: every mapcount
    takes page reference, so this way we can guarantee, that the PMD is the
    only mapping.

    This approach can give false negative if somebody pinned the page, but
    that doesn't affect correctness.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Jan, 2016

1 commit

  • The dirty balance reserve that dirty throttling has to consider is
    merely memory not available to userspace allocations. There is nothing
    writeback-specific about it. Generalize the name so that it's reusable
    outside of that context.

    Signed-off-by: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

09 Sep, 2015

3 commits

  • zswap_get_swap_cache_page and read_swap_cache_async have pretty much the
    same code with only significant difference in return value and usage of
    swap_readpage.

    I a helper __read_swap_cache_async() with the common code. Behavior
    change: now zswap_get_swap_cache_page will use radix_tree_maybe_preload
    instead radix_tree_preload. Looks like, this wasn't changed only by the
    reason of code duplication.

    Signed-off-by: Dmitry Safonov
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Tejun Heo
    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Cc: David Herrmann
    Cc: Seth Jennings
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Safonov
     
  • mem_cgroup structure is defined in mm/memcontrol.c currently which means
    that the code outside of this file has to use external API even for
    trivial access stuff.

    This patch exports mm_struct with its dependencies and makes some of the
    exported functions inlines. This even helps to reduce the code size a bit
    (make defconfig + CONFIG_MEMCG=y)

    text data bss dec hex filename
    12355346 1823792 1089536 15268674 e8fb42 vmlinux.before
    12354970 1823792 1089536 15268298 e8f9ca vmlinux.after

    This is not much (370B) but better than nothing.

    We also save a function call in some hot paths like callers of
    mem_cgroup_count_vm_event which is used for accounting.

    The patch doesn't introduce any functional changes.

    [vdavykov@parallels.com: inline memcg_kmem_is_active]
    [vdavykov@parallels.com: do not expose type outside of CONFIG_MEMCG]
    [akpm@linux-foundation.org: memcontrol.h needs eventfd.h for eventfd_ctx]
    [akpm@linux-foundation.org: export mem_cgroup_from_task() to modules]
    Signed-off-by: Michal Hocko
    Reviewed-by: Vladimir Davydov
    Suggested-by: Johannes Weiner
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • We want to know per-process workingset size for smart memory management
    on userland and we use swap(ex, zram) heavily to maximize memory
    efficiency so workingset includes swap as well as RSS.

    On such system, if there are lots of shared anonymous pages, it's really
    hard to figure out exactly how many each process consumes memory(ie, rss
    + wap) if the system has lots of shared anonymous memory(e.g, android).

    This patch introduces SwapPss field on /proc//smaps so we can get
    more exact workingset size per process.

    Bongkyu tested it. Result is below.

    1. 50M used swap
    SwapTotal: 461976 kB
    SwapFree: 411192 kB

    $ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}';
    48236
    $ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}';
    141184

    2. 240M used swap
    SwapTotal: 461976 kB
    SwapFree: 216808 kB

    $ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}';
    230315
    $ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}';
    1387744

    [akpm@linux-foundation.org: simplify kunmap_atomic() call]
    Signed-off-by: Minchan Kim
    Reported-by: Bongkyu Kim
    Tested-by: Bongkyu Kim
    Cc: Hugh Dickins
    Cc: Sergey Senozhatsky
    Cc: Jonathan Corbet
    Cc: Jerome Marchand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

29 Jul, 2015

1 commit

  • Currently we have two different ways to signal an I/O error on a BIO:

    (1) by clearing the BIO_UPTODATE flag
    (2) by returning a Linux errno value to the bi_end_io callback

    The first one has the drawback of only communicating a single possible
    error (-EIO), and the second one has the drawback of not beeing persistent
    when bios are queued up, and are not passed along from child to parent
    bio in the ever more popular chaining scenario. Having both mechanisms
    available has the additional drawback of utterly confusing driver authors
    and introducing bugs where various I/O submitters only deal with one of
    them, and the others have to add boilerplate code to deal with both kinds
    of error returns.

    So add a new bi_error field to store an errno value directly in struct
    bio and remove the existing mechanisms to clean all this up.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: NeilBrown
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

19 May, 2015

1 commit

  • Stop abusing struct page functionality and the swap end_io handler, and
    instead add a modified version of the blk-lib.c bio_batch helpers.

    Also move the block I/O code into swap.c as they are directly tied into
    each other.

    Signed-off-by: Christoph Hellwig
    Tested-by: Pavel Machek
    Tested-by: Ming Lin
    Acked-by: Pavel Machek
    Acked-by: Rafael J. Wysocki
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

16 Apr, 2015

1 commit

  • "deactivate_page" was created for file invalidation so it has too
    specific logic for file-backed pages. So, let's change the name of the
    function and date to a file-specific one and yield the generic name.

    Signed-off-by: Minchan Kim
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Wang, Yalin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

12 Feb, 2015

1 commit


14 Dec, 2014

1 commit

  • swp_entry_t being defined in include/linux/swap.h instead of
    include/linux/mm_types.h causes cyclic include dependency later when
    include/linux/page_cgroup.h is included from writeback path. Move the
    definition to include/linux/mm_types.h.

    While at it, reformat the comment above it.

    Signed-off-by: Tejun Heo
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

10 Oct, 2014

2 commits

  • In a memcg with even just moderate cache pressure, success rates for
    transparent huge page allocations drop to zero, wasting a lot of effort
    that the allocator puts into assembling these pages.

    The reason for this is that the memcg reclaim code was never designed for
    higher-order charges. It reclaims in small batches until there is room
    for at least one page. Huge page charges only succeed when these batches
    add up over a series of huge faults, which is unlikely under any
    significant load involving order-0 allocations in the group.

    Remove that loop on the memcg side in favor of passing the actual reclaim
    goal to direct reclaim, which is already set up and optimized to meet
    higher-order goals efficiently.

    This brings memcg's THP policy in line with the system policy: if the
    allocator painstakingly assembles a hugepage, memcg will at least make an
    honest effort to charge it. As a result, transparent hugepage allocation
    rates amid cache activity are drastically improved:

    vanilla patched
    pgalloc 4717530.80 ( +0.00%) 4451376.40 ( -5.64%)
    pgfault 491370.60 ( +0.00%) 225477.40 ( -54.11%)
    pgmajfault 2.00 ( +0.00%) 1.80 ( -6.67%)
    thp_fault_alloc 0.00 ( +0.00%) 531.60 (+100.00%)
    thp_fault_fallback 749.00 ( +0.00%) 217.40 ( -70.88%)

    [ Note: this may in turn increase memory consumption from internal
    fragmentation, which is an inherent risk of transparent hugepages.
    Some setups may have to adjust the memcg limits accordingly to
    accomodate this - or, if the machine is already packed to capacity,
    disable the transparent huge page feature. ]

    Signed-off-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Cc: Michal Hocko
    Cc: Dave Hansen
    Cc: Greg Thelen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The deprecation warnings for the scan_unevictable interface triggers by
    scripts doing `sysctl -a | grep something else'. This is annoying and not
    helpful.

    The interface has been defunct since 264e56d8247e ("mm: disable user
    interface to manually rescue unevictable pages"), which was in 2011, and
    there haven't been any reports of usecases for it, only reports that the
    deprecation warnings are annying. It's unlikely that anybody is using
    this interface specifically at this point, so remove it.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

09 Aug, 2014

2 commits

  • The memcg uncharging code that is involved towards the end of a page's
    lifetime - truncation, reclaim, swapout, migration - is impressively
    complicated and fragile.

    Because anonymous and file pages were always charged before they had their
    page->mapping established, uncharges had to happen when the page type
    could still be known from the context; as in unmap for anonymous, page
    cache removal for file and shmem pages, and swap cache truncation for swap
    pages. However, these operations happen well before the page is actually
    freed, and so a lot of synchronization is necessary:

    - Charging, uncharging, page migration, and charge migration all need
    to take a per-page bit spinlock as they could race with uncharging.

    - Swap cache truncation happens during both swap-in and swap-out, and
    possibly repeatedly before the page is actually freed. This means
    that the memcg swapout code is called from many contexts that make
    no sense and it has to figure out the direction from page state to
    make sure memory and memory+swap are always correctly charged.

    - On page migration, the old page might be unmapped but then reused,
    so memcg code has to prevent untimely uncharging in that case.
    Because this code - which should be a simple charge transfer - is so
    special-cased, it is not reusable for replace_page_cache().

    But now that charged pages always have a page->mapping, introduce
    mem_cgroup_uncharge(), which is called after the final put_page(), when we
    know for sure that nobody is looking at the page anymore.

    For page migration, introduce mem_cgroup_migrate(), which is called after
    the migration is successful and the new page is fully rmapped. Because
    the old page is no longer uncharged after migration, prevent double
    charges by decoupling the page's memcg association (PCG_USED and
    pc->mem_cgroup) from the page holding an actual charge. The new bits
    PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
    to the new page during migration.

    mem_cgroup_migrate() is suitable for replace_page_cache() as well,
    which gets rid of mem_cgroup_replace_page_cache(). However, care
    needs to be taken because both the source and the target page can
    already be charged and on the LRU when fuse is splicing: grab the page
    lock on the charge moving side to prevent changing pc->mem_cgroup of a
    page under migration. Also, the lruvecs of both pages change as we
    uncharge the old and charge the new during migration, and putback may
    race with us, so grab the lru lock and isolate the pages iff on LRU to
    prevent races and ensure the pages are on the right lruvec afterward.

    Swap accounting is massively simplified: because the page is no longer
    uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
    transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
    before the final put_page() in page reclaim.

    Finally, page_cgroup changes are now protected by whatever protection the
    page itself offers: anonymous pages are charged under the page table lock,
    whereas page cache insertions, swapin, and migration hold the page lock.
    Uncharging happens under full exclusion with no outstanding references.
    Charging and uncharging also ensure that the page is off-LRU, which
    serializes against charge migration. Remove the very costly page_cgroup
    lock and set pc->flags non-atomically.

    [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
    [vdavydov@parallels.com: fix flags definition]
    Signed-off-by: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Tested-by: Jet Chen
    Acked-by: Michal Hocko
    Tested-by: Felipe Balbi
    Signed-off-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • These patches rework memcg charge lifetime to integrate more naturally
    with the lifetime of user pages. This drastically simplifies the code and
    reduces charging and uncharging overhead. The most expensive part of
    charging and uncharging is the page_cgroup bit spinlock, which is removed
    entirely after this series.

    Here are the top-10 profile entries of a stress test that reads a 128G
    sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
    executing in the root memcg). Before:

    15.36% cat [kernel.kallsyms] [k] copy_user_generic_string
    13.31% cat [kernel.kallsyms] [k] memset
    11.48% cat [kernel.kallsyms] [k] do_mpage_readpage
    4.23% cat [kernel.kallsyms] [k] get_page_from_freelist
    2.38% cat [kernel.kallsyms] [k] put_page
    2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge
    2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common
    1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list
    1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup
    1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn

    After:

    15.67% cat [kernel.kallsyms] [k] copy_user_generic_string
    13.48% cat [kernel.kallsyms] [k] memset
    11.42% cat [kernel.kallsyms] [k] do_mpage_readpage
    3.98% cat [kernel.kallsyms] [k] get_page_from_freelist
    2.46% cat [kernel.kallsyms] [k] put_page
    2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list
    1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup
    1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
    1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
    1.30% cat [kernel.kallsyms] [k] kfree

    As you can see, the memcg footprint has shrunk quite a bit.

    text data bss dec hex filename
    37970 9892 400 48262 bc86 mm/memcontrol.o.old
    35239 9892 400 45531 b1db mm/memcontrol.o

    This patch (of 4):

    The memcg charge API charges pages before they are rmapped - i.e. have an
    actual "type" - and so every callsite needs its own set of charge and
    uncharge functions to know what type is being operated on. Worse,
    uncharge has to happen from a context that is still type-specific, rather
    than at the end of the page's lifetime with exclusive access, and so
    requires a lot of synchronization.

    Rewrite the charge API to provide a generic set of try_charge(),
    commit_charge() and cancel_charge() transaction operations, much like
    what's currently done for swap-in:

    mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
    pages from the memcg if necessary.

    mem_cgroup_commit_charge() commits the page to the charge once it
    has a valid page->mapping and PageAnon() reliably tells the type.

    mem_cgroup_cancel_charge() aborts the transaction.

    This reduces the charge API and enables subsequent patches to
    drastically simplify uncharging.

    As pages need to be committed after rmap is established but before they
    are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
    additions again. Revive lru_cache_add_active_or_unevictable().

    [hughd@google.com: fix shmem_unuse]
    [hughd@google.com: Add comments on the private use of -EAGAIN]
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Signed-off-by: Hugh Dickins
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

07 Aug, 2014

1 commit

  • Do we really need an exported alias for __SetPageReferenced()? Its
    callers better know what they're doing, in which case the page would not
    be already marked referenced. Kill init_page_accessed(), just
    __SetPageReferenced() inline.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Dave Hansen
    Cc: Prabhakar Lad
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

05 Jun, 2014

6 commits

  • Currently, we use (zone->managed_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1)
    / KSWAPD_ZONE_BALANCE_GAP_RATIO to avoid a zero gap value. It's better to
    use DIV_ROUND_UP macro for neater code and clear meaning.

    Besides, the gap value is calculated against the per-zone "managed pages",
    not "present pages". This patch also corrects the comment and do some
    rephrasing.

    Signed-off-by: Jianyu Zhan
    Acked-by: Rik van Riel
    Acked-by: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianyu Zhan
     
  • aops->write_begin may allocate a new page and make it visible only to have
    mark_page_accessed called almost immediately after. Once the page is
    visible the atomic operations are necessary which is noticable overhead
    when writing to an in-memory filesystem like tmpfs but should also be
    noticable with fast storage. The objective of the patch is to initialse
    the accessed information with non-atomic operations before the page is
    visible.

    The bulk of filesystems directly or indirectly use
    grab_cache_page_write_begin or find_or_create_page for the initial
    allocation of a page cache page. This patch adds an init_page_accessed()
    helper which behaves like the first call to mark_page_accessed() but may
    called before the page is visible and can be done non-atomically.

    The primary APIs of concern in this care are the following and are used
    by most filesystems.

    find_get_page
    find_lock_page
    find_or_create_page
    grab_cache_page_nowait
    grab_cache_page_write_begin

    All of them are very similar in detail to the patch creates a core helper
    pagecache_get_page() which takes a flags parameter that affects its
    behavior such as whether the page should be marked accessed or not. Then
    old API is preserved but is basically a thin wrapper around this core
    function.

    Each of the filesystems are then updated to avoid calling
    mark_page_accessed when it is known that the VM interfaces have already
    done the job. There is a slight snag in that the timing of the
    mark_page_accessed() has now changed so in rare cases it's possible a page
    gets to the end of the LRU as PageReferenced where as previously it might
    have been repromoted. This is expected to be rare but it's worth the
    filesystem people thinking about it in case they see a problem with the
    timing change. It is also the case that some filesystems may be marking
    pages accessed that previously did not but it makes sense that filesystems
    have consistent behaviour in this regard.

    The test case used to evaulate this is a simple dd of a large file done
    multiple times with the file deleted on each iterations. The size of the
    file is 1/10th physical memory to avoid dirty page balancing. In the
    async case it will be possible that the workload completes without even
    hitting the disk and will have variable results but highlight the impact
    of mark_page_accessed for async IO. The sync results are expected to be
    more stable. The exception is tmpfs where the normal case is for the "IO"
    to not hit the disk.

    The test machine was single socket and UMA to avoid any scheduling or NUMA
    artifacts. Throughput and wall times are presented for sync IO, only wall
    times are shown for async as the granularity reported by dd and the
    variability is unsuitable for comparison. As async results were variable
    do to writback timings, I'm only reporting the maximum figures. The sync
    results were stable enough to make the mean and stddev uninteresting.

    The performance results are reported based on a run with no profiling.
    Profile data is based on a separate run with oprofile running.

    async dd
    3.15.0-rc3 3.15.0-rc3
    vanilla accessed-v2
    ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%)
    tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%)
    btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%)
    ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%)
    xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%)

    The XFS figure is a bit strange as it managed to avoid a worst case by
    sheer luck but the average figures looked reasonable.

    samples percentage
    ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
    tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed
    tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
    tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed

    [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer]
    Signed-off-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Peter Zijlstra
    Tested-by: Prabhakar Lad
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • cold is a bool, make it one. Make the likely case the "if" part of the
    block instead of the else as according to the optimisation manual this is
    preferred.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Originally get_swap_page() started iterating through the singly-linked
    list of swap_info_structs using swap_list.next or highest_priority_index,
    which both were intended to point to the highest priority active swap
    target that was not full. The first patch in this series changed the
    singly-linked list to a doubly-linked list, and removed the logic to start
    at the highest priority non-full entry; it starts scanning at the highest
    priority entry each time, even if the entry is full.

    Replace the manually ordered swap_list_head with a plist, swap_active_head.
    Add a new plist, swap_avail_head. The original swap_active_head plist
    contains all active swap_info_structs, as before, while the new
    swap_avail_head plist contains only swap_info_structs that are active and
    available, i.e. not full. Add a new spinlock, swap_avail_lock, to protect
    the swap_avail_head list.

    Mel Gorman suggested using plists since they internally handle ordering
    the list entries based on priority, which is exactly what swap was doing
    manually. All the ordering code is now removed, and swap_info_struct
    entries and simply added to their corresponding plist and automatically
    ordered correctly.

    Using a new plist for available swap_info_structs simplifies and
    optimizes get_swap_page(), which no longer has to iterate over full
    swap_info_structs. Using a new spinlock for swap_avail_head plist
    allows each swap_info_struct to add or remove themselves from the
    plist when they become full or not-full; previously they could not
    do so because the swap_info_struct->lock is held when they change
    from fullnot-full, and the swap_lock protecting the main
    swap_active_head must be ordered before any swap_info_struct->lock.

    Signed-off-by: Dan Streetman
    Acked-by: Mel Gorman
    Cc: Shaohua Li
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Dan Streetman
    Cc: Michal Hocko
    Cc: Christian Ehrhardt
    Cc: Weijie Yang
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Bob Liu
    Cc: Paul Gortmaker
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • The logic controlling the singly-linked list of swap_info_struct entries
    for all active, i.e. swapon'ed, swap targets is rather complex, because:

    - it stores the entries in priority order
    - there is a pointer to the highest priority entry
    - there is a pointer to the highest priority not-full entry
    - there is a highest_priority_index variable set outside the swap_lock
    - swap entries of equal priority should be used equally

    this complexity leads to bugs such as: https://lkml.org/lkml/2014/2/13/181
    where different priority swap targets are incorrectly used equally.

    That bug probably could be solved with the existing singly-linked lists,
    but I think it would only add more complexity to the already difficult to
    understand get_swap_page() swap_list iteration logic.

    The first patch changes from a singly-linked list to a doubly-linked list
    using list_heads; the highest_priority_index and related code are removed
    and get_swap_page() starts each iteration at the highest priority
    swap_info entry, even if it's full. While this does introduce unnecessary
    list iteration (i.e. Schlemiel the painter's algorithm) in the case where
    one or more of the highest priority entries are full, the iteration and
    manipulation code is much simpler and behaves correctly re: the above bug;
    and the fourth patch removes the unnecessary iteration.

    The second patch adds some minor plist helper functions; nothing new
    really, just functions to match existing regular list functions. These
    are used by the next two patches.

    The third patch adds plist_requeue(), which is used by get_swap_page() in
    the next patch - it performs the requeueing of same-priority entries
    (which moves the entry to the end of its priority in the plist), so that
    all equal-priority swap_info_structs get used equally.

    The fourth patch converts the main list into a plist, and adds a new plist
    that contains only swap_info entries that are both active and not full.
    As Mel suggested using plists allows removing all the ordering code from
    swap - plists handle ordering automatically. The list naming is also
    clarified now that there are two lists, with the original list changed
    from swap_list_head to swap_active_head and the new list named
    swap_avail_head. A new spinlock is also added for the new list, so
    swap_info entries can be added or removed from the new list immediately as
    they become full or not full.

    This patch (of 4):

    Replace the singly-linked list tracking active, i.e. swapon'ed,
    swap_info_struct entries with a doubly-linked list using struct
    list_heads. Simplify the logic iterating and manipulating the list of
    entries, especially get_swap_page(), by using standard list_head
    functions, and removing the highest priority iteration logic.

    The change fixes the bug:
    https://lkml.org/lkml/2014/2/13/181
    in which different priority swap entries after the highest priority entry
    are incorrectly used equally in pairs. The swap behavior is now as
    advertised, i.e. different priority swap entries are used in order, and
    equal priority swap targets are used concurrently.

    Signed-off-by: Dan Streetman
    Acked-by: Mel Gorman
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Dan Streetman
    Cc: Michal Hocko
    Cc: Christian Ehrhardt
    Cc: Weijie Yang
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Bob Liu
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Paul Gortmaker
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Streetman
     
  • In mm/swap.c, __lru_cache_add() is exported, but actually there are no
    users outside this file.

    This patch unexports __lru_cache_add(), and makes it static. It also
    exports lru_cache_add_file(), as it is use by cifs and fuse, which can
    loaded as modules.

    Signed-off-by: Jianyu Zhan
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Cc: Shaohua Li
    Cc: Bob Liu
    Cc: Seth Jennings
    Cc: Joonsoo Kim
    Cc: Rafael Aquini
    Cc: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Khalid Aziz
    Cc: Christoph Hellwig
    Reviewed-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jianyu Zhan
     

04 Apr, 2014

2 commits

  • Previously, page cache radix tree nodes were freed after reclaim emptied
    out their page pointers. But now reclaim stores shadow entries in their
    place, which are only reclaimed when the inodes themselves are
    reclaimed. This is problematic for bigger files that are still in use
    after they have a significant amount of their cache reclaimed, without
    any of those pages actually refaulting. The shadow entries will just
    sit there and waste memory. In the worst case, the shadow entries will
    accumulate until the machine runs out of memory.

    To get this under control, the VM will track radix tree nodes
    exclusively containing shadow entries on a per-NUMA node list. Per-NUMA
    rather than global because we expect the radix tree nodes themselves to
    be allocated node-locally and we want to reduce cross-node references of
    otherwise independent cache workloads. A simple shrinker will then
    reclaim these nodes on memory pressure.

    A few things need to be stored in the radix tree node to implement the
    shadow node LRU and allow tree deletions coming from the list:

    1. There is no index available that would describe the reverse path
    from the node up to the tree root, which is needed to perform a
    deletion. To solve this, encode in each node its offset inside the
    parent. This can be stored in the unused upper bits of the same
    member that stores the node's height at no extra space cost.

    2. The number of shadow entries needs to be counted in addition to the
    regular entries, to quickly detect when the node is ready to go to
    the shadow node LRU list. The current entry count is an unsigned
    int but the maximum number of entries is 64, so a shadow counter
    can easily be stored in the unused upper bits.

    3. Tree modification needs tree lock and tree root, which are located
    in the address space, so store an address_space backpointer in the
    node. The parent pointer of the node is in a union with the 2-word
    rcu_head, so the backpointer comes at no extra cost as well.

    4. The node needs to be linked to an LRU list, which requires a list
    head inside the node. This does increase the size of the node, but
    it does not change the number of objects that fit into a slab page.

    [akpm@linux-foundation.org: export the right function]
    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Bob Liu
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The VM maintains cached filesystem pages on two types of lists. One
    list holds the pages recently faulted into the cache, the other list
    holds pages that have been referenced repeatedly on that first list.
    The idea is to prefer reclaiming young pages over those that have shown
    to benefit from caching in the past. We call the recently usedbut
    ultimately was not significantly better than a FIFO policy and still
    thrashed cache based on eviction speed, rather than actual demand for
    cache.

    This patch solves one half of the problem by decoupling the ability to
    detect working set changes from the inactive list size. By maintaining
    a history of recently evicted file pages it can detect frequently used
    pages with an arbitrarily small inactive list size, and subsequently
    apply pressure on the active list based on actual demand for cache, not
    just overall eviction speed.

    Every zone maintains a counter that tracks inactive list aging speed.
    When a page is evicted, a snapshot of this counter is stored in the
    now-empty page cache radix tree slot. On refault, the minimum access
    distance of the page can be assessed, to evaluate whether the page
    should be part of the active list or not.

    This fixes the VM's blindness towards working set changes in excess of
    the inactive list. And it's the foundation to further improve the
    protection ability and reduce the minimum inactive list size of 50%.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Reviewed-by: Bob Liu
    Cc: Andrea Arcangeli
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: KOSAKI Motohiro
    Cc: Luigi Semenzato
    Cc: Mel Gorman
    Cc: Metin Doslu
    Cc: Michel Lespinasse
    Cc: Ozgun Erdogan
    Cc: Peter Zijlstra
    Cc: Roman Gushchin
    Cc: Ryan Mallon
    Cc: Tejun Heo
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner