13 Jan, 2012

17 commits

  • There is sometimes confusion between the global putback_lru_pages() in
    migrate.c and the static putback_lru_pages() in vmscan.c: rename the
    latter putback_inactive_pages(): it helps shrink_inactive_list() rather as
    move_active_pages_to_lru() helps shrink_active_list().

    Remove unused scan_control arg from putback_inactive_pages() and from
    update_isolated_counts(). Move clear_active_flags() inside
    update_isolated_counts(). Move NR_ISOLATED accounting up into
    shrink_inactive_list() itself, so the balance is clearer.

    Do the spin_lock_irq() before calling putback_inactive_pages() and
    spin_unlock_irq() after return from it, so that it better matches
    update_isolated_counts() and move_active_pages_to_lru().

    Signed-off-by: Hugh Dickins
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The isolate_pages() level in vmscan.c offers little but indirection: merge
    it into isolate_lru_pages() as the compiler does, and use the names
    nr_to_scan and nr_scanned in each case.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Mostly we use "enum lru_list lru": change those few "l"s to "lru"s.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Replace pagevecs in putback_lru_pages() and move_active_pages_to_lru()
    by lists of pages_to_free: then apply Konstantin Khlebnikov's
    free_hot_cold_page_list() to them instead of pagevec_release().

    Which simplifies the flow (no need to drop and retake lock whenever
    pagevec fills up) and reduces stale addresses in stack backtraces
    (which often showed through the pagevecs); but more importantly,
    removes another 120 bytes from the deepest stacks in page reclaim.
    Although I've not recently seen an actual stack overflow here with
    a vanilla kernel, move_active_pages_to_lru() has often featured in
    deep backtraces.

    However, free_hot_cold_page_list() does not handle compound pages
    (nor need it: a Transparent HugePage would have been split by the
    time it reaches the call in shrink_page_list()), but it is possible
    for putback_lru_pages() or move_active_pages_to_lru() to be left
    holding the last reference on a THP, so must exclude the unlikely
    compound case before putting on pages_to_free.

    Remove pagevec_strip(), its work now done in move_active_pages_to_lru().
    The pagevec in scan_mapping_unevictable_pages() remains in mm/vmscan.c,
    but that is never on the reclaim path, and cannot be replaced by a list.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Konstantin Khlebnikov
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • If compaction can proceed for a given zone, shrink_zones() does not
    reclaim any more pages from it. After commit [e0c2327: vmscan: abort
    reclaim/compaction if compaction can proceed], do_try_to_free_pages()
    tries to finish as soon as possible once one zone can compact.

    This was intended to prevent slabs being shrunk unnecessarily but there
    are side-effects. One is that a small zone that is ready for compaction
    will abort reclaim even if the chances of successfully allocating a THP
    from that zone is small. It also means that reclaim can return too early
    even though sc->nr_to_reclaim pages were not reclaimed.

    This partially reverts the commit until it is proven that slabs are really
    being shrunk unnecessarily but preserves the check to return 1 to avoid
    OOM if reclaim was aborted prematurely.

    [aarcange@redhat.com: This patch replaces a revert from Andrea]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • In commit e0887c19 ("vmscan: limit direct reclaim for higher order
    allocations"), Rik noted that reclaim was too aggressive when THP was
    enabled. In his initial patch he used the number of free pages to decide
    if reclaim should abort for compaction. My feedback was that reclaim and
    compaction should be using the same logic when deciding if reclaim should
    be aborted.

    Unfortunately, this had the effect of reducing THP success rates when the
    workload included something like streaming reads that continually
    allocated pages. The window during which compaction could run and return
    a THP was too small.

    This patch combines Rik's two patches together. compaction_suitable() is
    still used to decide if reclaim should be aborted to allow compaction is
    used. However, it will also ensure that there is a reasonable buffer of
    free pages available. This improves upon the THP allocation success rates
    but bounds the number of pages that are freed for compaction.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware")
    noted that compaction does not migrate dirty or writeback pages and that
    is was meaningless to pick the page and re-add it to the LRU list. This
    had to be partially reverted because some dirty pages can be migrated by
    compaction without blocking.

    This patch updates "mm: compaction: make isolate_lru_page" by skipping
    over pages that migration has no possibility of migrating to minimise LRU
    disruption.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Reviewed-by: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • During direct reclaim it is possible that reclaim will be aborted so that
    compaction can be attempted to satisfy a high-order allocation. If this
    decision is made before any pages are reclaimed, it is possible that 0 is
    returned to the page allocator potentially triggering an OOM. This has
    not been observed but it is a possibility so this patch addresses it.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Properly take into account if we isolated a compound page during the lumpy
    scan in reclaim and skip over the tail pages when encountered. This
    corrects the values given to the tracepoint for number of lumpy pages
    isolated and will avoid breaking the loop early if compound pages smaller
    than the requested allocation size are requested.

    [mgorman@suse.de: Updated changelog]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • In trace_mm_vmscan_lru_isolate(), we don't output 'file' information to
    the trace event and it is a bit inconvenient for the user to get the
    real information(like pasted below). mm_vmscan_lru_isolate:
    isolate_mode=2 order=0 nr_requested=32 nr_scanned=32 nr_taken=32
    contig_taken=0 contig_dirty=0 contig_failed=0

    'active' can be obtained by analyzing mode(Thanks go to Minchan and
    Mel), So this patch adds 'file' to the trace event and it now looks
    like: mm_vmscan_lru_isolate: isolate_mode=2 order=0 nr_requested=32
    nr_scanned=32 nr_taken=32 contig_taken=0 contig_dirty=0 contig_failed=0
    file=0

    Signed-off-by: Tao Ma
    Acked-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Reviewed-by: Minchan Kim
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tao Ma
     
  • Signed-off-by: Johannes Weiner
    Acked-by: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Now that all code that operated on global per-zone LRU lists is
    converted to operate on per-memory cgroup LRU lists instead, there is no
    reason to keep the double-LRU scheme around any longer.

    The pc->lru member is removed and page->lru is linked directly to the
    per-memory cgroup LRU lists, which removes two pointers from a
    descriptor that exists for every page frame in the system.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Hugh Dickins
    Signed-off-by: Ying Han
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Having a unified structure with a LRU list set for both global zones and
    per-memcg zones allows to keep that code simple which deals with LRU
    lists and does not care about the container itself.

    Once the per-memcg LRU lists directly link struct pages, the isolation
    function and all other list manipulations are shared between the memcg
    case and the global LRU case.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The global per-zone LRU lists are about to go away on memcg-enabled
    kernels, global reclaim must be able to find its pages on the per-memcg
    LRU lists.

    Since the LRU pages of a zone are distributed over all existing memory
    cgroups, a scan target for a zone is complete when all memory cgroups
    are scanned for their proportional share of a zone's memory.

    The forced scanning of small scan targets from kswapd is limited to
    zones marked unreclaimable, otherwise kswapd can quickly overreclaim by
    force-scanning the LRU lists of multiple memory cgroups.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memory cgroup limit reclaim and traditional global pressure reclaim will
    soon share the same code to reclaim from a hierarchical tree of memory
    cgroups.

    In preparation of this, move the two right next to each other in
    shrink_zone().

    The mem_cgroup_hierarchical_reclaim() polymath is split into a soft
    limit reclaim function, which still does hierarchy walking on its own,
    and a limit (shrinking) reclaim function, which relies on generic
    reclaim code to walk the hierarchy.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memory cgroup hierarchies are currently handled completely outside of
    the traditional reclaim code, which is invoked with a single memory
    cgroup as an argument for the whole call stack.

    Subsequent patches will switch this code to do hierarchical reclaim, so
    there needs to be a distinction between a) the memory cgroup that is
    triggering reclaim due to hitting its limit and b) the memory cgroup
    that is being scanned as a child of a).

    This patch introduces a struct mem_cgroup_zone that contains the
    combination of the memory cgroup and the zone being scanned, which is
    then passed down the stack instead of the zone argument.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The traditional zone reclaim code is scanning the per-zone LRU lists
    during direct reclaim and kswapd, and the per-zone per-memory cgroup LRU
    lists when reclaiming on behalf of a memory cgroup limit.

    Subsequent patches will convert the traditional reclaim code to reclaim
    exclusively from the per-memory cgroup LRU lists. As a result, using
    the predicate for which LRU list is scanned will no longer be
    appropriate to tell global reclaim from limit reclaim.

    This patch adds a global_reclaim() predicate to tell direct/kswapd
    reclaim from memory cgroup limit reclaim and substitutes it in all
    places where currently scanning_global_lru() is used for that.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

11 Jan, 2012

7 commits

  • It is not the tag page but the cursor page that we should process, and it
    looks a typo.

    Signed-off-by: Hillf Danton
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Hugh Dickins
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     
  • Lumpy reclaim does well to stop at a PageAnon when there's no swap, but
    better is to stop at any PageSwapBacked, which includes shmem/tmpfs too.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • It's pointless to continue reclaiming when we have no swap space and lots
    of anon pages in the inactive list.

    Without this patch, it is possible when swap is disabled to continue
    trying to reclaim when there are only anonymous pages in the system even
    though that will not make any progress.

    Signed-off-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • If we need to know a usecase, caller program name is critical important.
    Show it.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Johannes Weiner
    David Rientjes
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • This patch adds helper free_hot_cold_page_list() to free list of 0-order
    pages. It frees pages directly from list without temporary page-vector.
    It also calls trace_mm_pagevec_free() to simulate pagevec_free()
    behaviour.

    bloat-o-meter:

    add/remove: 1/1 grow/shrink: 1/3 up/down: 267/-295 (-28)
    function old new delta
    free_hot_cold_page_list - 264 +264
    get_page_from_freelist 2129 2132 +3
    __pagevec_free 243 239 -4
    split_free_page 380 373 -7
    release_pages 606 510 -96
    free_page_list 188 - -188

    Signed-off-by: Konstantin Khlebnikov
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Acked-by: Minchan Kim
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Logic added in commit 8cab4754d24a0 ("vmscan: make mapped executable pages
    the first class citizen") was noticeably weakened in commit
    645747462435d84 ("vmscan: detect mapped file pages used only once").

    Currently these pages can become "first class citizens" only after second
    usage. After this patch page_check_references() will activate they after
    first usage, and executable code gets yet better chance to stay in memory.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Pekka Enberg
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Johannes Weiner
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Shaohua Li
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Commit 645747462435 ("vmscan: detect mapped file pages used only once")
    greatly decreases lifetime of single-used mapped file pages.
    Unfortunately it also decreases life time of all shared mapped file
    pages. Because after commit bf3f3bc5e7347 ("mm: don't mark_page_accessed
    in fault path") page-fault handler does not mark page active or even
    referenced.

    Thus page_check_references() activates file page only if it was used twice
    while it stays in inactive list, meanwhile it activates anon pages after
    first access. Inactive list can be small enough, this way reclaimer can
    accidentally throw away any widely used page if it wasn't used twice in
    short period.

    After this patch page_check_references() also activate file mapped page at
    first inactive list scan if this page is already used multiple times via
    several ptes.

    I found this while trying to fix degragation in rhel6 (~2.6.32) from rhel5
    (~2.6.18). There a complete mess with >100 web/mail/spam/ftp containers,
    they share all their files but there a lot of anonymous pages: ~500mb
    shared file mapped memory and 15-20Gb non-shared anonymous memory. In
    this situation major-pagefaults are very costly, because all containers
    share the same page. In my load kernel created a disproportionate
    pressure on the file memory, compared with the anonymous, they equaled
    only if I raise swappiness up to 150 =)

    These patches actually wasn't helped a lot in my problem, but I saw
    noticable (10-20 times) reduce in count and average time of
    major-pagefault in file-mapped areas.

    Actually both patches are fixes for commit v2.6.33-5448-g6457474, because
    it was aimed at one scenario (singly used pages), but it breaks the logic
    in other scenarios (shared and/or executable pages)

    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Pekka Enberg
    Acked-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Johannes Weiner
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Shaohua Li
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

07 Jan, 2012

1 commit

  • This resolves the conflict in the arch/arm/mach-s3c64xx/s3c6400.c file,
    and it fixes the build error in the arch/x86/kernel/microcode_core.c
    file, that the merge did not catch.

    The microcode_core.c patch was provided by Stephen Rothwell
    who was invaluable in the merge issues involved
    with the large sysdev removal process in the driver-core tree.

    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

22 Dec, 2011

1 commit

  • This moves the 'memory sysdev_class' over to a regular 'memory' subsystem
    and converts the devices to regular devices. The sysdev drivers are
    implemented as subsystem interfaces now.

    After all sysdev classes are ported to regular driver core entities, the
    sysdev implementation will be entirely removed from the kernel.

    Signed-off-by: Kay Sievers
    Signed-off-by: Greg Kroah-Hartman

    Kay Sievers
     

09 Dec, 2011

2 commits

  • Use atomic-long operations instead of looping around cmpxchg().

    [akpm@linux-foundation.org: massage atomic.h inclusions]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • A shrinker function can return -1, means that it cannot do anything
    without a risk of deadlock. For example prune_super() does this if it
    cannot grab a superblock refrence, even if nr_to_scan=0. Currently we
    interpret this -1 as a ULONG_MAX size shrinker and evaluate `total_scan'
    according to this. So the next time around this shrinker can cause
    really big pressure. Let's skip such shrinkers instead.

    Also make total_scan signed, otherwise the check (total_scan < 0) below
    never works.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

07 Nov, 2011

1 commit

  • * 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: Add a 'reason' to wb_writeback_work
    writeback: send work item to queue_io, move_expired_inodes
    writeback: trace event balance_dirty_pages
    writeback: trace event bdi_dirty_ratelimit
    writeback: fix ppc compile warnings on do_div(long long, unsigned long)
    writeback: per-bdi background threshold
    writeback: dirty position control - bdi reserve area
    writeback: control dirty pause time
    writeback: limit max dirty pause time
    writeback: IO-less balance_dirty_pages()
    writeback: per task dirty rate limit
    writeback: stabilize bdi->dirty_ratelimit
    writeback: dirty rate control
    writeback: add bg_threshold parameter to __bdi_update_bandwidth()
    writeback: dirty position control
    writeback: account per-bdi accumulated dirtied pages

    Linus Torvalds
     

03 Nov, 2011

1 commit

  • Reclaim decides to skip scanning an active list when the corresponding
    inactive list is above a certain size in comparison to leave the assumed
    working set alone while there are still enough reclaim candidates around.

    The memcg implementation of comparing those lists instead reports whether
    the whole memcg is low on the requested type of inactive pages,
    considering all nodes and zones.

    This can lead to an oversized active list not being scanned because of the
    state of the other lists in the memcg, as well as an active list being
    scanned while its corresponding inactive list has enough pages.

    Not only is this wrong, it's also a scalability hazard, because the global
    memory state over all nodes and zones has to be gathered for each memcg
    and zone scanned.

    Make these calculations purely based on the size of the two LRU lists
    that are actually affected by the outcome of the decision.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Reviewed-by: Minchan Kim
    Reviewed-by: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

01 Nov, 2011

10 commits

  • If compaction can proceed, shrink_zones() stops doing any work but its
    callers still call shrink_slab() which raises the priority and potentially
    sleeps. This is unnecessary and wasteful so this patch aborts direct
    reclaim/compaction entirely if compaction can proceed.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Acked-by: Johannes Weiner
    Cc: Josh Boyer
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When suffering from memory fragmentation due to unfreeable pages, THP page
    faults will repeatedly try to compact memory. Due to the unfreeable
    pages, compaction fails.

    Needless to say, at that point page reclaim also fails to create free
    contiguous 2MB areas. However, that doesn't stop the current code from
    trying, over and over again, and freeing a minimum of 4MB (2UL <<
    sc->order pages) at every single invocation.

    This resulted in my 12GB system having 2-3GB free memory, a corresponding
    amount of used swap and very sluggish response times.

    This can be avoided by having the direct reclaim code not reclaim from
    zones that already have plenty of free memory available for compaction.

    If compaction still fails due to unmovable memory, doing additional
    reclaim will only hurt the system, not help.

    [jweiner@redhat.com: change comment to explain the order check]
    Signed-off-by: Rik van Riel
    Acked-by: Johannes Weiner
    Acked-by: Mel Gorman
    Cc: Andrea Arcangeli
    Reviewed-by: Minchan Kim
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • When a race between putback_lru_page() and shmem_lock with lock=0 happens,
    progrom execution order is as follows, but clear_bit in processor #1 could
    be reordered right before spin_unlock of processor #1. Then, the page
    would be stranded on the unevictable list.

    spin_lock
    SetPageLRU
    spin_unlock
    clear_bit(AS_UNEVICTABLE)
    spin_lock
    if PageLRU()
    if !test_bit(AS_UNEVICTABLE)
    move evictable list
    smp_mb
    if !test_bit(AS_UNEVICTABLE)
    move evictable list
    spin_unlock

    But, pagevec_lookup() in scan_mapping_unevictable_pages() has
    rcu_read_[un]lock() so it could protect reordering before reaching
    test_bit(AS_UNEVICTABLE) on processor #1 so this problem never happens.
    But it's a unexpected side effect and we should solve this problem
    properly.

    This patch adds a barrier after mapping_clear_unevictable.

    I didn't meet this problem but just found during review.

    Signed-off-by: Minchan Kim
    Acked-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • At one point, anonymous pages were supposed to go on the unevictable list
    when no swap space was configured, and the idea was to manually rescue
    those pages after adding swap and making them evictable again. But
    nowadays, swap-backed pages on the anon LRU list are not scanned without
    available swap space anyway, so there is no point in moving them to a
    separate list anymore.

    The manual rescue could also be used in case pages were stranded on the
    unevictable list due to race conditions. But the code has been around for
    a while now and newly discovered bugs should be properly reported and
    dealt with instead of relying on such a manual fixup.

    In addition to the lack of a usecase, the sysfs interface to rescue pages
    from a specific NUMA node has been broken since its introduction, so it's
    unlikely that anybody ever relied on that.

    This patch removes the functionality behind the sysctl and the
    node-interface and emits a one-time warning when somebody tries to access
    either of them.

    Signed-off-by: Johannes Weiner
    Reported-by: Kautuk Consul
    Reviewed-by: Minchan Kim
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • write_scan_unevictable_node() checks the value req returned by
    strict_strtoul() and returns 1 if req is 0.

    However, when strict_strtoul() returns 0, it means successful conversion
    of buf to unsigned long.

    Due to this, the function was not proceeding to scan the zones for
    unevictable pages even though we write a valid value to the
    scan_unevictable_pages sys file.

    Change this check slightly to check for invalid value in buf as well as 0
    value stored in res after successful conversion via strict_strtoul. In
    both cases, we do not perform the scanning of this node's zones.

    Signed-off-by: Kautuk Consul
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kautuk Consul
     
  • There 2 places to read pgdat in kswapd. One is return from a successful
    balance, another is waked up from kswapd sleeping. The new_order and
    new_classzone_idx represent the balance input order and classzone_idx.

    But current new_order and new_classzone_idx are not assigned after
    kswapd_try_to_sleep(), that will cause a bug in the following scenario.

    1: after a successful balance, kswapd goes to sleep, and new_order = 0;
    new_classzone_idx = __MAX_NR_ZONES - 1;

    2: kswapd waked up with order = 3 and classzone_idx = ZONE_NORMAL

    3: in the balance_pgdat() running, a new balance wakeup happened with
    order = 5, and classzone_idx = ZONE_NORMAL

    4: the first wakeup(order = 3) finished successufly, return order = 3
    but, the new_order is still 0, so, this balancing will be treated as a
    failed balance. And then the second tighter balancing will be missed.

    So, to avoid the above problem, the new_order and new_classzone_idx need
    to be assigned for later successful comparison.

    Signed-off-by: Alex Shi
    Acked-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Tested-by: Pádraig Brady
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex,Shi
     
  • In commit 215ddd66 ("mm: vmscan: only read new_classzone_idx from pgdat
    when reclaiming successfully") , Mel Gorman said kswapd is better to sleep
    after a unsuccessful balancing if there is tighter reclaim request pending
    in the balancing. But in the following scenario, kswapd do something that
    is not matched our expectation. The patch fixes this issue.

    1, Read pgdat request A (classzone_idx, order = 3)
    2, balance_pgdat()
    3, During pgdat, a new pgdat request B (classzone_idx, order = 5) is placed
    4, balance_pgdat() returns but failed since returned order = 0
    5, pgdat of request A assigned to balance_pgdat(), and do balancing again.
    While the expectation behavior of kswapd should try to sleep.

    Signed-off-by: Alex Shi
    Reviewed-by: Tim Chen
    Acked-by: Mel Gorman
    Tested-by: Pádraig Brady
    Cc: Rik van Riel
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex,Shi
     
  • It's possible a zone watermark is ok when entering the balance_pgdat()
    loop, while the zone is within the requested classzone_idx. Count pages
    from this zone into `balanced'. In this way, we can skip shrinking zones
    too much for high order allocation.

    Signed-off-by: Shaohua Li
    Acked-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • When direct reclaim encounters a dirty page, it gets recycled around the
    LRU for another cycle. This patch marks the page PageReclaim similar to
    deactivate_page() so that the page gets reclaimed almost immediately after
    the page gets cleaned. This is to avoid reclaiming clean pages that are
    younger than a dirty page encountered at the end of the LRU that might
    have been something like a use-once page.

    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Cc: Wu Fengguang
    Cc: Jan Kara
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Alex Elder
    Cc: Theodore Ts'o
    Cc: Chris Mason
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Workloads that are allocating frequently and writing files place a large
    number of dirty pages on the LRU. With use-once logic, it is possible for
    them to reach the end of the LRU quickly requiring the reclaimer to scan
    more to find clean pages. Ordinarily, processes that are dirtying memory
    will get throttled by dirty balancing but this is a global heuristic and
    does not take into account that LRUs are maintained on a per-zone basis.
    This can lead to a situation whereby reclaim is scanning heavily, skipping
    over a large number of pages under writeback and recycling them around the
    LRU consuming CPU.

    This patch checks how many of the number of pages isolated from the LRU
    were dirty and under writeback. If a percentage of them under writeback,
    the process will be throttled if a backing device or the zone is
    congested. Note that this applies whether it is anonymous or file-backed
    pages that are under writeback meaning that swapping is potentially
    throttled. This is intentional due to the fact if the swap device is
    congested, scanning more pages and dispatching more IO is not going to
    help matters.

    The percentage that must be in writeback depends on the priority. At
    default priority, all of them must be dirty. At DEF_PRIORITY-1, 50% of
    them must be, DEF_PRIORITY-2, 25% etc. i.e. as pressure increases the
    greater the likelihood the process will get throttled to allow the flusher
    threads to make some progress.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Acked-by: Johannes Weiner
    Cc: Dave Chinner
    Cc: Christoph Hellwig
    Cc: Wu Fengguang
    Cc: Jan Kara
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Alex Elder
    Cc: Theodore Ts'o
    Cc: Chris Mason
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman