25 Mar, 2012

1 commit


23 Mar, 2012

1 commit

  • Adjusting cc715d99e529 "mm: vmscan: forcibly scan highmem if there are
    too many buffer_heads pinning highmem" for -stable reveals that it was
    slightly wrong once on top of fe2c2a106663 "vmscan: reclaim at order 0
    when compaction is enabled", which specifically adds testorder for the
    zone_watermark_ok_safe() test.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

22 Mar, 2012

10 commits

  • Reset the reclaim mode in shrink_active_list() to RECLAIM_MODE_SINGLE |
    RECLAIM_MODE_ASYNC. (sync/async sign is used only in shrink_page_list
    and does not affect shrink_active_list)

    Currenly shrink_active_list() sometimes works in lumpy-reclaim mode, if
    RECLAIM_MODE_LUMPYRECLAIM is left over from an earlier
    shrink_inactive_list(). Meanwhile, in age_active_anon()
    sc->reclaim_mode is totally zero. So the current behavior is too
    complex and confusing, and this looks like bug.

    In general, shrink_active_list() populates the inactive list for the
    next shrink_inactive_list(). Lumpy shring_inactive_list() isolates
    pages around the chosen one from both the active and inactive lists.
    So, there is no reason for lumpy isolation in shrink_active_list().

    See also: https://lkml.org/lkml/2012/3/15/583

    Signed-off-by: Konstantin Khlebnikov
    Proposed-by: Hugh Dickins
    Acked-by: Johannes Weiner
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
    changing cpuset's mems") wins a super prize for the largest number of
    memory barriers entered into fast paths for one commit.

    [get|put]_mems_allowed is incredibly heavy with pairs of full memory
    barriers inserted into a number of hot paths. This was detected while
    investigating at large page allocator slowdown introduced some time
    after 2.6.32. The largest portion of this overhead was shown by
    oprofile to be at an mfence introduced by this commit into the page
    allocator hot path.

    For extra style points, the commit introduced the use of yield() in an
    implementation of what looks like a spinning mutex.

    This patch replaces the full memory barriers on both read and write
    sides with a sequence counter with just read barriers on the fast path
    side. This is much cheaper on some architectures, including x86. The
    main bulk of the patch is the retry logic if the nodemask changes in a
    manner that can cause a false failure.

    While updating the nodemask, a check is made to see if a false failure
    is a risk. If it is, the sequence number gets bumped and parallel
    allocators will briefly stall while the nodemask update takes place.

    In a page fault test microbenchmark, oprofile samples from
    __alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The
    actual results were

    3.3.0-rc3 3.3.0-rc3
    rc3-vanilla nobarrier-v2r1
    Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%)
    Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%)
    Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%)
    Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%)
    Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%)
    Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%)
    Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%)
    Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%)
    Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%)
    Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%)
    Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%)
    Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%)
    Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%)
    Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%)
    Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%)
    MMTests Statistics: duration
    Sys Time Running Test (seconds) 135.68 132.17
    User+Sys Time Running Test (seconds) 164.2 160.13
    Total Elapsed Time (seconds) 123.46 120.87

    The overall improvement is small but the System CPU time is much
    improved and roughly in correlation to what oprofile reported (these
    performance figures are without profiling so skew is expected). The
    actual number of page faults is noticeably improved.

    For benchmarks like kernel builds, the overall benefit is marginal but
    the system CPU time is slightly reduced.

    To test the actual bug the commit fixed I opened two terminals. The
    first ran within a cpuset and continually ran a small program that
    faulted 100M of anonymous data. In a second window, the nodemask of the
    cpuset was continually randomised in a loop.

    Without the commit, the program would fail every so often (usually
    within 10 seconds) and obviously with the commit everything worked fine.
    With this patch applied, it also worked fine so the fix should be
    functionally equivalent.

    Signed-off-by: Mel Gorman
    Cc: Miao Xie
    Cc: David Rientjes
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • s/noticable/noticeable/

    Signed-off-by: Copot Alexandru
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Copot Alexandru
     
  • When shrinking inactive lru list, isolated pages are queued on locally
    private list, so the lock-hold time could be reduced if pages are counted
    without lock protection.

    To achieve that, firstly updating reclaim stat is delayed until the
    putback stage, after reacquiring the lru lock.

    Secondly, operations related to vm and zone stats are now proteced with
    preemption disabled as they are per-cpu operations.

    Signed-off-by: Hillf Danton
    Acked-by: Hugh Dickins
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     
  • Stuart Foster reported on bugzilla that copying large amounts of data
    from NTFS caused an OOM kill on 32-bit X86 with 16G of memory. Andrew
    Morton correctly identified that the problem was NTFS was using 512
    blocks meaning each page had 8 buffer_heads in low memory pinning it.

    In the past, direct reclaim used to scan highmem even if the allocating
    process did not specify __GFP_HIGHMEM but not any more. kswapd no longer
    will reclaim from zones that are above the high watermark. The intention
    in both cases was to minimise unnecessary reclaim. The downside is on
    machines with large amounts of highmem that lowmem can be fully consumed
    by buffer_heads with nothing trying to free them.

    The following patch is based on a suggestion by Andrew Morton to extend
    the buffer_heads_over_limit case to force kswapd and direct reclaim to
    scan the highmem zone regardless of the allocation request or watermarks.

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=42578

    [hughd@google.com: move buffer_heads_over_limit check up]
    [akpm@linux-foundation.org: buffer_heads_over_limit is unlikely]
    Reported-by: Stuart Foster
    Tested-by: Stuart Foster
    Signed-off-by: Mel Gorman
    Signed-off-by: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Christoph Lameter
    Cc: stable
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently a failed order-9 (transparent hugepage) compaction can lead to
    memory compaction being temporarily disabled for a memory zone. Even if
    we only need compaction for an order 2 allocation, eg. for jumbo frames
    networking.

    The fix is relatively straightforward: keep track of the highest order at
    which compaction is succeeding, and only defer compaction for orders at
    which compaction is failing.

    Signed-off-by: Rik van Riel
    Cc: Andrea Arcangeli
    Acked-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • With CONFIG_COMPACTION enabled, kswapd does not try to free contiguous
    free pages, even when it is woken for a higher order request.

    This could be bad for eg. jumbo frame network allocations, which are done
    from interrupt context and cannot compact memory themselves. Higher than
    before allocation failure rates in the network receive path have been
    observed in kernels with compaction enabled.

    Teach kswapd to defragment the memory zones in a node, but only if
    required and compaction is not deferred in a zone.

    [akpm@linux-foundation.org: reduce scope of zones_need_compaction]
    Signed-off-by: Rik van Riel
    Acked-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • When built with CONFIG_COMPACTION, kswapd should not try to free
    contiguous pages, because it is not trying hard enough to have a real
    chance at being successful, but still disrupts the LRU enough to break
    other things.

    Do not do higher order page isolation unless we really are in lumpy
    reclaim mode.

    Stop reclaiming pages once we have enough free pages that compaction can
    deal with things, and we hit the normal order 0 watermarks used by kswapd.

    Also remove a line of code that increments balanced right before exiting
    the function.

    Signed-off-by: Rik van Riel
    Cc: Andrea Arcangeli
    Acked-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • The value of nr_reclaimed is the number of pages reclaimed in the current
    round of the loop, whereas nr_to_reclaim should be compared with the
    number of pages reclaimed in all rounds.

    In each round of the loop, reclaimed pages are cut off from the reclaim
    goal, and the loop stops once the goal achieved.

    Signed-off-by: Hillf Danton
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     
  • With tons of reclaim_mode (defined as one field of struct scan_control)
    already in the file, it is clearer to rename the local reclaim_mode when
    setting up the isolation mode.

    Signed-off-by: Hillf Danton
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: KOSAKI Motohiro
    Reviewed-by: Rik van Riel
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     

24 Jan, 2012

2 commits

  • Commit cc39c6a9bbde ("mm: account skipped entries to avoid looping in
    find_get_pages") correctly fixed an infinite loop; but left a problem
    that find_get_pages() on shmem would return 0 (appearing to callers to
    mean end of tree) when it meets a run of nr_pages swap entries.

    The only uses of find_get_pages() on shmem are via pagevec_lookup(),
    called from invalidate_mapping_pages(), and from shmctl SHM_UNLOCK's
    scan_mapping_unevictable_pages(). The first is already commented, and
    not worth worrying about; but the second can leave pages on the
    Unevictable list after an unusual sequence of swapping and locking.

    Fix that by using shmem_find_get_pages_and_swap() (then ignoring the
    swap) instead of pagevec_lookup().

    But I don't want to contaminate vmscan.c with shmem internals, nor
    shmem.c with LRU locking. So move scan_mapping_unevictable_pages() into
    shmem.c, renaming it shmem_unlock_mapping(); and rename
    check_move_unevictable_page() to check_move_unevictable_pages(), looping
    down an array of pages, oftentimes under the same lock.

    Leave out the "rotate unevictable list" block: that's a leftover from
    when this was used for /proc/sys/vm/scan_unevictable_pages, whose flawed
    handling involved looking at pages at tail of LRU.

    Was there significance to the sequence first ClearPageUnevictable, then
    test page_evictable, then SetPageUnevictable here? I think not, we're
    under LRU lock, and have no barriers between those.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Eric Dumazet
    Cc: Johannes Weiner
    Cc: Michel Lespinasse
    Cc: [back to 3.1 but will need respins]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • scan_mapping_unevictable_pages() is used to make SysV SHM_LOCKed pages
    evictable again once the shared memory is unlocked. It does this with
    pagevec_lookup()s across the whole object (which might occupy most of
    memory), and takes 300ms to unlock 7GB here. A cond_resched() every
    PAGEVEC_SIZE pages would be good.

    However, KOSAKI-san points out that this is called under shmem.c's
    info->lock, and it's also under shm.c's shm_lock(), both spinlocks.
    There is no strong reason for that: we need to take these pages off the
    unevictable list soonish, but those locks are not required for it.

    So move the call to scan_mapping_unevictable_pages() from shmem.c's
    unlock handling up to shm.c's unlock handling. Remove the recently
    added barrier, not needed now we have spin_unlock() before the scan.

    Use get_file(), with subsequent fput(), to make sure we have a reference
    to mapping throughout scan_mapping_unevictable_pages(): that's something
    that was previously guaranteed by the shm_lock().

    Remove shmctl's lru_add_drain_all(): we don't fault in pages at SHM_LOCK
    time, and we lazily discover them to be Unevictable later, so it serves
    no purpose for SHM_LOCK; and serves no purpose for SHM_UNLOCK, since
    pages still on pagevec are not marked Unevictable.

    The original code avoided redundant rescans by checking VM_LOCKED flag
    at its level: now avoid them by checking shp's SHM_LOCKED.

    The original code called scan_mapping_unevictable_pages() on a locked
    area at shm_destroy() time: perhaps we once had accounting cross-checks
    which required that, but not now, so skip the overhead and just let
    inode eviction deal with them.

    Put check_move_unevictable_page() and scan_mapping_unevictable_pages()
    under CONFIG_SHMEM (with stub for the TINY case when ramfs is used),
    more as comment than to save space; comment them used for SHM_UNLOCK.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Eric Dumazet
    Cc: Johannes Weiner
    Cc: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

13 Jan, 2012

17 commits

  • There is sometimes confusion between the global putback_lru_pages() in
    migrate.c and the static putback_lru_pages() in vmscan.c: rename the
    latter putback_inactive_pages(): it helps shrink_inactive_list() rather as
    move_active_pages_to_lru() helps shrink_active_list().

    Remove unused scan_control arg from putback_inactive_pages() and from
    update_isolated_counts(). Move clear_active_flags() inside
    update_isolated_counts(). Move NR_ISOLATED accounting up into
    shrink_inactive_list() itself, so the balance is clearer.

    Do the spin_lock_irq() before calling putback_inactive_pages() and
    spin_unlock_irq() after return from it, so that it better matches
    update_isolated_counts() and move_active_pages_to_lru().

    Signed-off-by: Hugh Dickins
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The isolate_pages() level in vmscan.c offers little but indirection: merge
    it into isolate_lru_pages() as the compiler does, and use the names
    nr_to_scan and nr_scanned in each case.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Mostly we use "enum lru_list lru": change those few "l"s to "lru"s.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Replace pagevecs in putback_lru_pages() and move_active_pages_to_lru()
    by lists of pages_to_free: then apply Konstantin Khlebnikov's
    free_hot_cold_page_list() to them instead of pagevec_release().

    Which simplifies the flow (no need to drop and retake lock whenever
    pagevec fills up) and reduces stale addresses in stack backtraces
    (which often showed through the pagevecs); but more importantly,
    removes another 120 bytes from the deepest stacks in page reclaim.
    Although I've not recently seen an actual stack overflow here with
    a vanilla kernel, move_active_pages_to_lru() has often featured in
    deep backtraces.

    However, free_hot_cold_page_list() does not handle compound pages
    (nor need it: a Transparent HugePage would have been split by the
    time it reaches the call in shrink_page_list()), but it is possible
    for putback_lru_pages() or move_active_pages_to_lru() to be left
    holding the last reference on a THP, so must exclude the unlikely
    compound case before putting on pages_to_free.

    Remove pagevec_strip(), its work now done in move_active_pages_to_lru().
    The pagevec in scan_mapping_unevictable_pages() remains in mm/vmscan.c,
    but that is never on the reclaim path, and cannot be replaced by a list.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Konstantin Khlebnikov
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • If compaction can proceed for a given zone, shrink_zones() does not
    reclaim any more pages from it. After commit [e0c2327: vmscan: abort
    reclaim/compaction if compaction can proceed], do_try_to_free_pages()
    tries to finish as soon as possible once one zone can compact.

    This was intended to prevent slabs being shrunk unnecessarily but there
    are side-effects. One is that a small zone that is ready for compaction
    will abort reclaim even if the chances of successfully allocating a THP
    from that zone is small. It also means that reclaim can return too early
    even though sc->nr_to_reclaim pages were not reclaimed.

    This partially reverts the commit until it is proven that slabs are really
    being shrunk unnecessarily but preserves the check to return 1 to avoid
    OOM if reclaim was aborted prematurely.

    [aarcange@redhat.com: This patch replaces a revert from Andrea]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • In commit e0887c19 ("vmscan: limit direct reclaim for higher order
    allocations"), Rik noted that reclaim was too aggressive when THP was
    enabled. In his initial patch he used the number of free pages to decide
    if reclaim should abort for compaction. My feedback was that reclaim and
    compaction should be using the same logic when deciding if reclaim should
    be aborted.

    Unfortunately, this had the effect of reducing THP success rates when the
    workload included something like streaming reads that continually
    allocated pages. The window during which compaction could run and return
    a THP was too small.

    This patch combines Rik's two patches together. compaction_suitable() is
    still used to decide if reclaim should be aborted to allow compaction is
    used. However, it will also ensure that there is a reasonable buffer of
    free pages available. This improves upon the THP allocation success rates
    but bounds the number of pages that are freed for compaction.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware")
    noted that compaction does not migrate dirty or writeback pages and that
    is was meaningless to pick the page and re-add it to the LRU list. This
    had to be partially reverted because some dirty pages can be migrated by
    compaction without blocking.

    This patch updates "mm: compaction: make isolate_lru_page" by skipping
    over pages that migration has no possibility of migrating to minimise LRU
    disruption.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Reviewed-by: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • During direct reclaim it is possible that reclaim will be aborted so that
    compaction can be attempted to satisfy a high-order allocation. If this
    decision is made before any pages are reclaimed, it is possible that 0 is
    returned to the page allocator potentially triggering an OOM. This has
    not been observed but it is a possibility so this patch addresses it.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Properly take into account if we isolated a compound page during the lumpy
    scan in reclaim and skip over the tail pages when encountered. This
    corrects the values given to the tracepoint for number of lumpy pages
    isolated and will avoid breaking the loop early if compound pages smaller
    than the requested allocation size are requested.

    [mgorman@suse.de: Updated changelog]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • In trace_mm_vmscan_lru_isolate(), we don't output 'file' information to
    the trace event and it is a bit inconvenient for the user to get the
    real information(like pasted below). mm_vmscan_lru_isolate:
    isolate_mode=2 order=0 nr_requested=32 nr_scanned=32 nr_taken=32
    contig_taken=0 contig_dirty=0 contig_failed=0

    'active' can be obtained by analyzing mode(Thanks go to Minchan and
    Mel), So this patch adds 'file' to the trace event and it now looks
    like: mm_vmscan_lru_isolate: isolate_mode=2 order=0 nr_requested=32
    nr_scanned=32 nr_taken=32 contig_taken=0 contig_dirty=0 contig_failed=0
    file=0

    Signed-off-by: Tao Ma
    Acked-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Reviewed-by: Minchan Kim
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tao Ma
     
  • Signed-off-by: Johannes Weiner
    Acked-by: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Now that all code that operated on global per-zone LRU lists is
    converted to operate on per-memory cgroup LRU lists instead, there is no
    reason to keep the double-LRU scheme around any longer.

    The pc->lru member is removed and page->lru is linked directly to the
    per-memory cgroup LRU lists, which removes two pointers from a
    descriptor that exists for every page frame in the system.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Hugh Dickins
    Signed-off-by: Ying Han
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Having a unified structure with a LRU list set for both global zones and
    per-memcg zones allows to keep that code simple which deals with LRU
    lists and does not care about the container itself.

    Once the per-memcg LRU lists directly link struct pages, the isolation
    function and all other list manipulations are shared between the memcg
    case and the global LRU case.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The global per-zone LRU lists are about to go away on memcg-enabled
    kernels, global reclaim must be able to find its pages on the per-memcg
    LRU lists.

    Since the LRU pages of a zone are distributed over all existing memory
    cgroups, a scan target for a zone is complete when all memory cgroups
    are scanned for their proportional share of a zone's memory.

    The forced scanning of small scan targets from kswapd is limited to
    zones marked unreclaimable, otherwise kswapd can quickly overreclaim by
    force-scanning the LRU lists of multiple memory cgroups.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memory cgroup limit reclaim and traditional global pressure reclaim will
    soon share the same code to reclaim from a hierarchical tree of memory
    cgroups.

    In preparation of this, move the two right next to each other in
    shrink_zone().

    The mem_cgroup_hierarchical_reclaim() polymath is split into a soft
    limit reclaim function, which still does hierarchy walking on its own,
    and a limit (shrinking) reclaim function, which relies on generic
    reclaim code to walk the hierarchy.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memory cgroup hierarchies are currently handled completely outside of
    the traditional reclaim code, which is invoked with a single memory
    cgroup as an argument for the whole call stack.

    Subsequent patches will switch this code to do hierarchical reclaim, so
    there needs to be a distinction between a) the memory cgroup that is
    triggering reclaim due to hitting its limit and b) the memory cgroup
    that is being scanned as a child of a).

    This patch introduces a struct mem_cgroup_zone that contains the
    combination of the memory cgroup and the zone being scanned, which is
    then passed down the stack instead of the zone argument.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The traditional zone reclaim code is scanning the per-zone LRU lists
    during direct reclaim and kswapd, and the per-zone per-memory cgroup LRU
    lists when reclaiming on behalf of a memory cgroup limit.

    Subsequent patches will convert the traditional reclaim code to reclaim
    exclusively from the per-memory cgroup LRU lists. As a result, using
    the predicate for which LRU list is scanned will no longer be
    appropriate to tell global reclaim from limit reclaim.

    This patch adds a global_reclaim() predicate to tell direct/kswapd
    reclaim from memory cgroup limit reclaim and substitutes it in all
    places where currently scanning_global_lru() is used for that.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Reviewed-by: Kirill A. Shutemov
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

11 Jan, 2012

7 commits

  • It is not the tag page but the cursor page that we should process, and it
    looks a typo.

    Signed-off-by: Hillf Danton
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Hugh Dickins
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     
  • Lumpy reclaim does well to stop at a PageAnon when there's no swap, but
    better is to stop at any PageSwapBacked, which includes shmem/tmpfs too.

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • It's pointless to continue reclaiming when we have no swap space and lots
    of anon pages in the inactive list.

    Without this patch, it is possible when swap is disabled to continue
    trying to reclaim when there are only anonymous pages in the system even
    though that will not make any progress.

    Signed-off-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Acked-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • If we need to know a usecase, caller program name is critical important.
    Show it.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Johannes Weiner
    David Rientjes
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • This patch adds helper free_hot_cold_page_list() to free list of 0-order
    pages. It frees pages directly from list without temporary page-vector.
    It also calls trace_mm_pagevec_free() to simulate pagevec_free()
    behaviour.

    bloat-o-meter:

    add/remove: 1/1 grow/shrink: 1/3 up/down: 267/-295 (-28)
    function old new delta
    free_hot_cold_page_list - 264 +264
    get_page_from_freelist 2129 2132 +3
    __pagevec_free 243 239 -4
    split_free_page 380 373 -7
    release_pages 606 510 -96
    free_page_list 188 - -188

    Signed-off-by: Konstantin Khlebnikov
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Acked-by: Minchan Kim
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Logic added in commit 8cab4754d24a0 ("vmscan: make mapped executable pages
    the first class citizen") was noticeably weakened in commit
    645747462435d84 ("vmscan: detect mapped file pages used only once").

    Currently these pages can become "first class citizens" only after second
    usage. After this patch page_check_references() will activate they after
    first usage, and executable code gets yet better chance to stay in memory.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Pekka Enberg
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Johannes Weiner
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Shaohua Li
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Commit 645747462435 ("vmscan: detect mapped file pages used only once")
    greatly decreases lifetime of single-used mapped file pages.
    Unfortunately it also decreases life time of all shared mapped file
    pages. Because after commit bf3f3bc5e7347 ("mm: don't mark_page_accessed
    in fault path") page-fault handler does not mark page active or even
    referenced.

    Thus page_check_references() activates file page only if it was used twice
    while it stays in inactive list, meanwhile it activates anon pages after
    first access. Inactive list can be small enough, this way reclaimer can
    accidentally throw away any widely used page if it wasn't used twice in
    short period.

    After this patch page_check_references() also activate file mapped page at
    first inactive list scan if this page is already used multiple times via
    several ptes.

    I found this while trying to fix degragation in rhel6 (~2.6.32) from rhel5
    (~2.6.18). There a complete mess with >100 web/mail/spam/ftp containers,
    they share all their files but there a lot of anonymous pages: ~500mb
    shared file mapped memory and 15-20Gb non-shared anonymous memory. In
    this situation major-pagefaults are very costly, because all containers
    share the same page. In my load kernel created a disproportionate
    pressure on the file memory, compared with the anonymous, they equaled
    only if I raise swappiness up to 150 =)

    These patches actually wasn't helped a lot in my problem, but I saw
    noticable (10-20 times) reduce in count and average time of
    major-pagefault in file-mapped areas.

    Actually both patches are fixes for commit v2.6.33-5448-g6457474, because
    it was aimed at one scenario (singly used pages), but it breaks the logic
    in other scenarios (shared and/or executable pages)

    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Pekka Enberg
    Acked-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Wu Fengguang
    Cc: Johannes Weiner
    Cc: Nick Piggin
    Cc: Mel Gorman
    Cc: Shaohua Li
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

07 Jan, 2012

1 commit

  • This resolves the conflict in the arch/arm/mach-s3c64xx/s3c6400.c file,
    and it fixes the build error in the arch/x86/kernel/microcode_core.c
    file, that the merge did not catch.

    The microcode_core.c patch was provided by Stephen Rothwell
    who was invaluable in the merge issues involved
    with the large sysdev removal process in the driver-core tree.

    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

22 Dec, 2011

1 commit

  • This moves the 'memory sysdev_class' over to a regular 'memory' subsystem
    and converts the devices to regular devices. The sysdev drivers are
    implemented as subsystem interfaces now.

    After all sysdev classes are ported to regular driver core entities, the
    sysdev implementation will be entirely removed from the kernel.

    Signed-off-by: Kay Sievers
    Signed-off-by: Greg Kroah-Hartman

    Kay Sievers