25 May, 2010

6 commits

  • For now, we have global isolation vs. memory control group isolation, do
    not allow the reclaim entry function to set an arbitrary page isolation
    callback, we do not need that flexibility.

    And since we already pass around the group descriptor for the memory
    control group isolation case, just use it to decide which one of the two
    isolator functions to use.

    The decisions can be merged into nearby branches, so no extra cost there.
    In fact, we save the indirect calls.

    Signed-off-by: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • This scan control is abused to communicate a return value from
    shrink_zones(). Write this idiomatically and remove the knob.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • If vmscan is under lumpy reclaim mode, it have to ignore referenced bit
    for making contenious free pages. but current page_check_references()
    doesn't.

    Fix it.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Cc: Rik van Riel
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • get_scan_ratio() calculates percentage and if the percentage is < 1%, it
    will round percentage down to 0% and cause we completely ignore scanning
    anon/file pages to reclaim memory even the total anon/file pages are very
    big.

    To avoid underflow, we don't use percentage, instead we directly calculate
    how many pages should be scaned. In this way, we should get several
    scanned pages for < 1% percent.

    This has some benefits:

    1. increase our calculation precision

    2. making our scan more smoothly. Without this, if percent[x] is
    underflow, shrink_zone() doesn't scan any pages and suddenly it scans
    all pages when priority is zero. With this, even priority isn't zero,
    shrink_zone() gets chance to scan some pages.

    Note, this patch doesn't really change logics, but just increase
    precision. For system with a lot of memory, this might slightly changes
    behavior. For example, in a sequential file read workload, without the
    patch, we don't swap any anon pages. With it, if anon memory size is
    bigger than 16G, we will see one anon page swapped. The 16G is calculated
    as PAGE_SIZE * priority(4096) * (fp/ap). fp/ap is assumed to be 1024
    which is common in this workload. So the impact sounds not a big deal.

    Signed-off-by: Shaohua Li
    Cc: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Currently, vmscan.c defines the isolation modes for __isolate_lru_page().
    Memory compaction needs access to these modes for isolating pages for
    migration. This patch exports them.

    Signed-off-by: Mel Gorman
    Acked-by: Christoph Lameter
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Before applying this patch, cpuset updates task->mems_allowed and
    mempolicy by setting all new bits in the nodemask first, and clearing all
    old unallowed bits later. But in the way, the allocator may find that
    there is no node to alloc memory.

    The reason is that cpuset rebinds the task's mempolicy, it cleans the
    nodes which the allocater can alloc pages on, for example:

    (mpol: mempolicy)
    task1 task1's mpol task2
    alloc page 1
    alloc on node0? NO 1
    1 change mems from 1 to 0
    1 rebind task1's mpol
    0-1 set new bits
    0 clear disallowed bits
    alloc on node1? NO 0
    ...
    can't alloc page
    goto oom

    This patch fixes this problem by expanding the nodes range first(set newly
    allowed bits) and shrink it lazily(clear newly disallowed bits). So we
    use a variable to tell the write-side task that read-side task is reading
    nodemask, and the write-side task clears newly disallowed nodes after
    read-side task ends the current memory allocation.

    [akpm@linux-foundation.org: fix spello]
    Signed-off-by: Miao Xie
    Cc: David Rientjes
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     

07 Apr, 2010

1 commit

  • Shaohua Li reported his tmpfs streaming I/O test can lead to make oom.
    The test uses a 6G tmpfs in a system with 3G memory. In the tmpfs, there
    are 6 copies of kernel source and the test does kbuild for each copy. His
    investigation shows the test has a lot of rotated anon pages and quite few
    file pages, so get_scan_ratio calculates percent[0] (i.e. scanning
    percent for anon) to be zero. Actually the percent[0] shoule be a big
    value, but our calculation round it to zero.

    Although before commit 84b18490 ("vmscan: get_scan_ratio() cleanup") , we
    have the same problem too. But the old logic can rescue percent[0]==0
    case only when priority==0. It had hided the real issue. I didn't think
    merely streaming io can makes percent[0]==0 && priority==0 situation. but
    I was wrong.

    So, definitely we have to fix such tmpfs streaming io issue. but anyway I
    revert the regression commit at first.

    This reverts commit 84b18490d1f1bc7ed5095c929f78bc002eb70f26.

    Signed-off-by: KOSAKI Motohiro
    Reported-by: Shaohua Li
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

07 Mar, 2010

7 commits

  • The VM currently assumes that an inactive, mapped and referenced file page
    is in use and promotes it to the active list.

    However, every mapped file page starts out like this and thus a problem
    arises when workloads create a stream of such pages that are used only for
    a short time. By flooding the active list with those pages, the VM
    quickly gets into trouble finding eligible reclaim canditates. The result
    is long allocation latencies and eviction of the wrong pages.

    This patch reuses the PG_referenced page flag (used for unmapped file
    pages) to implement a usage detection that scales with the speed of LRU
    list cycling (i.e. memory pressure).

    If the scanner encounters those pages, the flag is set and the page cycled
    again on the inactive list. Only if it returns with another page table
    reference it is activated. Otherwise it is reclaimed as 'not recently
    used cache'.

    This effectively changes the minimum lifetime of a used-once mapped file
    page from a full memory cycle to an inactive list cycle, which allows it
    to occur in linear streams without affecting the stable working set of the
    system.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Minchan Kim
    Cc: OSAKI Motohiro
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • page_mapping_inuse() is a historic predicate function for pages that are
    about to be reclaimed or deactivated.

    According to it, a page is in use when it is mapped into page tables OR
    part of swap cache OR backing an mmapped file.

    This function is used in combination with page_referenced(), which checks
    for young bits in ptes and the page descriptor itself for the
    PG_referenced bit. Thus, checking for unmapped swap cache pages is
    meaningless as PG_referenced is not set for anonymous pages and unmapped
    pages do not have young ptes. The test makes no difference.

    Protecting file pages that are not by themselves mapped but are part of a
    mapped file is also a historic leftover for short-lived things like the
    exec() code in libc. However, the VM now does reference accounting and
    activation of pages at unmap time and thus the special treatment on
    reclaim is obsolete.

    This patch drops page_mapping_inuse() and switches the two callsites to
    use page_mapped() directly.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Minchan Kim
    Cc: OSAKI Motohiro
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The used-once mapped file page detection patchset.

    It is meant to help workloads with large amounts of shortly used file
    mappings, like rtorrent hashing a file or git when dealing with loose
    objects (git gc on a bigger site?).

    Right now, the VM activates referenced mapped file pages on first
    encounter on the inactive list and it takes a full memory cycle to
    reclaim them again. When those pages dominate memory, the system
    no longer has a meaningful notion of 'working set' and is required
    to give up the active list to make reclaim progress. Obviously,
    this results in rather bad scanning latencies and the wrong pages
    being reclaimed.

    This patch makes the VM be more careful about activating mapped file
    pages in the first place. The minimum granted lifetime without
    another memory access becomes an inactive list cycle instead of the
    full memory cycle, which is more natural given the mentioned loads.

    This test resembles a hashing rtorrent process. Sequentially, 32MB
    chunks of a file are mapped into memory, hashed (sha1) and unmapped
    again. While this happens, every 5 seconds a process is launched and
    its execution time taken:

    python2.4 -c 'import pydoc'
    old: max=2.31s mean=1.26s (0.34)
    new: max=1.25s mean=0.32s (0.32)

    find /etc -type f
    old: max=2.52s mean=1.44s (0.43)
    new: max=1.92s mean=0.12s (0.17)

    vim -c ':quit'
    old: max=6.14s mean=4.03s (0.49)
    new: max=3.48s mean=2.41s (0.25)

    mplayer --help
    old: max=8.08s mean=5.74s (1.02)
    new: max=3.79s mean=1.32s (0.81)

    overall hash time (stdev):
    old: time=1192.30 (12.85) thruput=25.78mb/s (0.27)
    new: time=1060.27 (32.58) thruput=29.02mb/s (0.88) (-11%)

    I also tested kernbench with regular IO streaming in the background to
    see whether the delayed activation of frequently used mapped file
    pages had a negative impact on performance in the presence of pressure
    on the inactive list. The patch made no significant difference in
    timing, neither for kernbench nor for the streaming IO throughput.

    The first patch submission raised concerns about the cost of the extra
    faults for actually activated pages on machines that have no hardware
    support for young page table entries.

    I created an artificial worst case scenario on an ARM machine with
    around 300MHz and 64MB of memory to figure out the dimensions
    involved. The test would mmap a file of 20MB, then

    1. touch all its pages to fault them in
    2. force one full scan cycle on the inactive file LRU
    -- old: mapping pages activated
    -- new: mapping pages inactive
    3. touch the mapping pages again
    -- old and new: fault exceptions to set the young bits
    4. force another full scan cycle on the inactive file LRU
    5. touch the mapping pages one last time
    -- new: fault exceptions to set the young bits

    The test showed an overall increase of 6% in time over 100 iterations
    of the above (old: ~212sec, new: ~225sec). 13 secs total overhead /
    (100 * 5k pages), ignoring the execution time of the test itself,
    makes for about 25us overhead for every page that gets actually
    activated. Note:

    1. File mapping the size of one third of main memory, _completely_
    in active use across memory pressure - i.e., most pages referenced
    within one LRU cycle. This should be rare to non-existant,
    especially on such embedded setups.

    2. Many huge activation batches. Those batches only occur when the
    working set fluctuates. If it changes completely between every full
    LRU cycle, you have problematic reclaim overhead anyway.

    3. Access of activated pages at maximum speed: sequential loads from
    every single page without doing anything in between. In reality,
    the extra faults will get distributed between actual operations on
    the data.

    So even if a workload manages to get the VM into the situation of
    activating a third of memory in one go on such a setup, it will take
    2.2 seconds instead 2.1 without the patch.

    Comparing the numbers (and my user-experience over several months),
    I think this change is an overall improvement to the VM.

    Patch 1 is only refactoring to break up that ugly compound conditional
    in shrink_page_list() and make it easy to document and add new checks
    in a readable fashion.

    Patch 2 gets rid of the obsolete page_mapping_inuse(). It's not
    strictly related to #3, but it was in the original submission and is a
    net simplification, so I kept it.

    Patch 3 implements used-once detection of mapped file pages.

    This patch:

    Moving the big conditional into its own predicate function makes the code
    a bit easier to read and allows for better commenting on the checks
    one-by-one.

    This is just cleaning up, no semantics should have been changed.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Cc: Minchan Kim
    Cc: OSAKI Motohiro
    Cc: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • commit e815af95 ("change all_unreclaimable zone member to flags") changed
    all_unreclaimable member to bit flag. But it had an undesireble side
    effect. free_one_page() is one of most hot path in linux kernel and
    increasing atomic ops in it can reduce kernel performance a bit.

    Thus, this patch revert such commit partially. at least
    all_unreclaimable shouldn't share memory word with other zone flags.

    [akpm@linux-foundation.org: fix patch interaction]
    Signed-off-by: KOSAKI Motohiro
    Cc: David Rientjes
    Cc: Wu Fengguang
    Cc: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Huang Shijie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Commit cf40bd16fd ("lockdep: annotate reclaim context") introduced reclaim
    context annotation. But it didn't annotate zone reclaim. This patch do
    it.

    The point is, commit cf40bd16fd annotate __alloc_pages_direct_reclaim but
    zone-reclaim doesn't use __alloc_pages_direct_reclaim.

    current call graph is

    __alloc_pages_nodemask
    get_page_from_freelist
    zone_reclaim()
    __alloc_pages_slowpath
    __alloc_pages_direct_reclaim
    try_to_free_pages

    Actually, if zone_reclaim_mode=1, VM never call
    __alloc_pages_direct_reclaim in usual VM pressure.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Acked-by: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The get_scan_ratio() should have all scan-ratio related calculations.
    Thus, this patch move some calculation into get_scan_ratio.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Rik van Riel
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Kswapd checks that zone has sufficient pages free via zone_watermark_ok().

    If any zone doesn't have enough pages, we set all_zones_ok to zero.
    !all_zone_ok makes kswapd retry rather than sleeping.

    I think the watermark check before shrink_zone() is pointless. Only after
    kswapd has tried to shrink the zone is the check meaningful.

    Move the check to after the call to shrink_zone().

    [akpm@linux-foundation.org: fix comment, layout]
    Signed-off-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Rik van Riel
    Reviewed-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

17 Jan, 2010

1 commit

  • Commit f50de2d3 (vmscan: have kswapd sleep for a short interval and double
    check it should be asleep) can cause kswapd to enter an infinite loop if
    running on a single-CPU system. If all zones are unreclaimble,
    sleeping_prematurely return 1 and kswapd will call balance_pgdat() again.
    but it's totally meaningless, balance_pgdat() doesn't anything against
    unreclaimable zone!

    Signed-off-by: KOSAKI Motohiro
    Cc: Mel Gorman
    Reported-by: Will Newton
    Reviewed-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Tested-by: Will Newton
    Reviewed-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

16 Dec, 2009

11 commits

  • Simplify the code for shrink_inactive_list().

    Signed-off-by: Huang Shijie
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie
     
  • In AIM7 runs, recent kernels start swapping out anonymous pages well
    before they should. This is due to shrink_list falling through to
    shrink_inactive_list if !inactive_anon_is_low(zone, sc), when all we
    really wanted to do is pre-age some anonymous pages to give them extra
    time to be referenced while on the inactive list.

    The obvious fix is to make sure that shrink_list does not fall through to
    scanning/reclaiming inactive pages when we called it to scan one of the
    active lists.

    This change should be safe because the loop in shrink_zone ensures that we
    will still shrink the anon and file inactive lists whenever we should.

    [kosaki.motohiro@jp.fujitsu.com: inactive_file_is_low() should be inactive_anon_is_low()]
    Reported-by: Larry Woodman
    Signed-off-by: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Tomasz Chmielewski
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Fix small inconsistent of ">" and ">=".

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Now, All caller of reclaim use swap_cluster_max as SWAP_CLUSTER_MAX.
    Then, we can remove it perfectly.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • In old days, we didn't have sc.nr_to_reclaim and it brought
    sc.swap_cluster_max misuse.

    huge sc.swap_cluster_max might makes unnecessary OOM risk and no
    performance benefit.

    Now, we can stop its insane thing.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • shrink_all_zone() was introduced by commit d6277db4ab (swsusp: rework
    memory shrinker) for hibernate performance improvement. and
    sc.swap_cluster_max was introduced by commit a06fe4d307 (Speed freeing
    memory for suspend).

    commit a06fe4d307 said

    Without the patch:
    Freed 14600 pages in 1749 jiffies = 32.61 MB/s (Anomolous!)
    Freed 88563 pages in 14719 jiffies = 23.50 MB/s
    Freed 205734 pages in 32389 jiffies = 24.81 MB/s

    With the patch:
    Freed 68252 pages in 496 jiffies = 537.52 MB/s
    Freed 116464 pages in 569 jiffies = 798.54 MB/s
    Freed 209699 pages in 705 jiffies = 1161.89 MB/s

    At that time, their patch was pretty worth. However, Modern Hardware
    trend and recent VM improvement broke its worth. From several reason, I
    think we should remove shrink_all_zones() at all.

    detail:

    1) Old days, shrink_zone()'s slowness was mainly caused by stupid io-throttle
    at no i/o congestion.
    but current shrink_zone() is sane, not slow.

    2) shrink_all_zone() try to shrink all pages at a time. but it doesn't works
    fine on numa system.
    example)
    System has 4GB memory and each node have 2GB. and hibernate need 1GB.

    optimal)
    steal 500MB from each node.
    shrink_all_zones)
    steal 1GB from node-0.

    Oh, Cache balancing logic was broken. ;)
    Unfortunately, Desktop system moved ahead NUMA at nowadays.
    (Side note, if hibernate require 2GB, shrink_all_zones() never success
    on above machine)

    3) if the node has several I/O flighting pages, shrink_all_zones() makes
    pretty bad result.

    schenario) hibernate need 1GB

    1) shrink_all_zones() try to reclaim 1GB from Node-0
    2) but it only reclaimed 990MB
    3) stupidly, shrink_all_zones() try to reclaim 1GB from Node-1
    4) it reclaimed 990MB

    Oh, well. it reclaimed twice much than required.
    In the other hand, current shrink_zone() has sane baling out logic.
    then, it doesn't make overkill reclaim. then, we lost shrink_zones()'s risk.

    4) SplitLRU VM always keep active/inactive ratio very carefully. inactive list only
    shrinking break its assumption. it makes unnecessary OOM risk. it obviously suboptimal.

    Now, shrink_all_memory() is only the wrapper function of do_try_to_free_pages().
    it bring good reviewability and debuggability, and solve above problems.

    side note: Reclaim logic unificication makes two good side effect.
    - Fix recursive reclaim bug on shrink_all_memory().
    it did forgot to use PF_MEMALLOC. it mean the system be able to stuck into deadlock.
    - Now, shrink_all_memory() got lockdep awareness. it bring good debuggability.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Rik van Riel
    Acked-by: Rafael J. Wysocki
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Currently, sc.scap_cluster_max has double meanings.

    1) reclaim batch size as isolate_lru_pages()'s argument
    2) reclaim baling out thresolds

    The two meanings pretty unrelated. Thus, Let's separate it.
    this patch doesn't change any behavior.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • If reclaim fails to make sufficient progress, the priority is raised.
    Once the priority is higher, kswapd starts waiting on congestion.
    However, if the zone is below the min watermark then kswapd needs to
    continue working without delay as there is a danger of an increased rate
    of GFP_ATOMIC allocation failure.

    This patch changes the conditions under which kswapd waits on congestion
    by only going to sleep if the min watermarks are being met.

    [mel@csn.ul.ie: add stats to track how relevant the logic is]
    [mel@csn.ul.ie: make kswapd only check its own zones and rename the relevant counters]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • After kswapd balances all zones in a pgdat, it goes to sleep. In the
    event of no IO congestion, kswapd can go to sleep very shortly after the
    high watermark was reached. If there are a constant stream of allocations
    from parallel processes, it can mean that kswapd went to sleep too quickly
    and the high watermark is not being maintained for sufficient length time.

    This patch makes kswapd go to sleep as a two-stage process. It first
    tries to sleep for HZ/10. If it is woken up by another process or the
    high watermark is no longer met, it's considered a premature sleep and
    kswapd continues work. Otherwise it goes fully to sleep.

    This adds more counters to distinguish between fast and slow breaches of
    watermarks. A "fast" premature sleep is one where the low watermark was
    hit in a very short time after kswapd going to sleep. A "slow" premature
    sleep indicates that the high watermark was breached after a very short
    interval.

    Signed-off-by: Mel Gorman
    Cc: Frans Pop
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Commit 543ade1fc9 ("Streamline generic_file_* interfaces and filemap
    cleanups") removed generic_file_write() in filemap. Change the comment in
    vmscan pageout() to __generic_file_aio_write().

    Signed-off-by: Vincent Li
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vincent Li
     
  • When memory is hot-removed, its node must be cleared in N_HIGH_MEMORY if
    there are no present pages left.

    In such a situation, kswapd must also be stopped since it has nothing left
    to do.

    Signed-off-by: David Rientjes
    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: Yasunori Goto
    Cc: Mel Gorman
    Cc: Rafael J. Wysocki
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Mel Gorman
    Cc: Randy Dunlap
    Cc: Nishanth Aravamudan
    Cc: Andi Kleen
    Cc: David Rientjes
    Cc: Adam Litke
    Cc: Andy Whitcroft
    Cc: Eric Whitney
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

29 Oct, 2009

3 commits

  • Isolators putting a page back to the LRU do not hold the page lock, and if
    the page is mlocked, another thread might munlock it concurrently.

    Expecting this, the putback code re-checks the evictability of a page when
    it just moved it to the unevictable list in order to correct its decision.

    The problem, however, is that ordering is not garuanteed between setting
    PG_lru when moving the page to the list and checking PG_mlocked
    afterwards:

    #0: #1

    spin_lock()
    if (TestClearPageMlocked())
    if (PageLRU())
    move to evictable list
    SetPageLRU()
    spin_unlock()
    if (!PageMlocked())
    move to evictable list

    The PageMlocked() check may get reordered before SetPageLRU() in #0,
    resulting in #0 not moving the still mlocked page, and in #1 failing to
    isolate and move the page as well. The page is now stranded on the
    unevictable list.

    The race condition is very unlikely. The consequence currently is one
    page falling off the reclaim grid and eventually getting freed with
    PG_unevictable set, which triggers a warning in the page allocator.

    TestClearPageMlocked() in #1 already provides full memory barrier
    semantics.

    This patch adds an explicit full barrier to force ordering between
    SetPageLRU() and PageMlocked() so that either one of the competitors
    rescues the page.

    Signed-off-by: Johannes Weiner
    Reviewed-by: KOSAKI Motohiro
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Lee Schermerhorn
    Cc: Peter Zijlstra
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • It is possible to have !Anon but SwapBacked pages, and some apps could
    create huge number of such pages with MAP_SHARED|MAP_ANONYMOUS. These
    pages go into the ANON lru list, and hence shall not be protected: we only
    care mapped executable files. Failing to do so may trigger OOM.

    Tested-by: Christian Borntraeger
    Reviewed-by: Rik van Riel
    Signed-off-by: Wu Fengguang
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • commit 8aa7e847d (Fix congestion_wait() sync/async vs read/write
    confusion) replace WRITE with BLK_RW_ASYNC. Unfortunately, concurrent mm
    development made the unchanged place accidentally.

    This patch fixes it too.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Jens Axboe
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

26 Sep, 2009

2 commits

  • * 'writeback' of git://git.kernel.dk/linux-2.6-block:
    writeback: writeback_inodes_sb() should use bdi_start_writeback()
    writeback: don't delay inodes redirtied by a fast dirtier
    writeback: make the super_block pinning more efficient
    writeback: don't resort for a single super_block in move_expired_inodes()
    writeback: move inodes from one super_block together
    writeback: get rid to incorrect references to pdflush in comments
    writeback: improve readability of the wb_writeback() continue/break logic
    writeback: cleanup writeback_single_inode()
    writeback: kupdate writeback shall not stop when more io is possible
    writeback: stop background writeback when below background threshold
    writeback: balance_dirty_pages() shall write more than dirtied pages
    fs: Fix busyloop in wb_writeback()

    Linus Torvalds
     
  • Signed-off-by: Jens Axboe

    Jens Axboe
     

24 Sep, 2009

3 commits

  • * 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
    HWPOISON: Enable error_remove_page on btrfs
    HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
    HWPOISON: Add madvise() based injector for hardware poisoned pages v4
    HWPOISON: Enable error_remove_page for NFS
    HWPOISON: Enable .remove_error_page for migration aware file systems
    HWPOISON: The high level memory error handler in the VM v7
    HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
    HWPOISON: shmem: call set_page_dirty() with locked page
    HWPOISON: Define a new error_remove_page address space op for async truncation
    HWPOISON: Add invalidate_inode_page
    HWPOISON: Refactor truncate to allow direct truncating of page v2
    HWPOISON: check and isolate corrupted free pages v2
    HWPOISON: Handle hardware poisoned pages in try_to_unmap
    HWPOISON: Use bitmask/action code for try_to_unmap behaviour
    HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
    HWPOISON: Add poison check to page fault handling
    HWPOISON: Add basic support for poisoned pages in fault handler v3
    HWPOISON: Add new SIGBUS error codes for hardware poison signals
    HWPOISON: Add support for poison swap entries v2
    HWPOISON: Export some rmap vma locking to outside world
    ...

    Linus Torvalds
     
  • It's unused.

    It isn't needed -- read or write flag is already passed and sysctl
    shouldn't care about the rest.

    It _was_ used in two places at arch/frv for some reason.

    Signed-off-by: Alexey Dobriyan
    Cc: David Howells
    Cc: "Eric W. Biederman"
    Cc: Al Viro
    Cc: Ralf Baechle
    Cc: Martin Schwidefsky
    Cc: Ingo Molnar
    Cc: "David S. Miller"
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Implement reclaim from groups over their soft limit

    Permit reclaim from memory cgroups on contention (via the direct reclaim
    path).

    memory cgroup soft limit reclaim finds the group that exceeds its soft
    limit by the largest number of pages and reclaims pages from it and then
    reinserts the cgroup into its correct place in the rbtree.

    Add additional checks to mem_cgroup_hierarchical_reclaim() to detect long
    loops in case all swap is turned off. The code has been refactored and
    the loop check (loop < 2) has been enhanced for soft limits. For soft
    limits, we try to do more targetted reclaim. Instead of bailing out after
    two loops, the routine now reclaims memory proportional to the size by
    which the soft limit is exceeded. The proportion has been empirically
    determined.

    [akpm@linux-foundation.org: build fix]
    [kamezawa.hiroyu@jp.fujitsu.com: fix softlimit css refcnt handling]
    [nishimura@mxp.nes.nec.co.jp: refcount of the "victim" should be decremented before exiting the loop]
    Signed-off-by: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Acked-by: KOSAKI Motohiro
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     

22 Sep, 2009

5 commits

  • Commit 084f71ae5c(kill page_queue_congested()) removed
    page_queue_congested(). Remove the page_queue_congested() comment in
    vmscan pageout() too.

    Signed-off-by: Vincent Li
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vincent Li
     
  • For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1, in
    which case shrink_list() _still_ calls isolate_pages() with the much
    larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list scan
    rate by up to 32 times.

    For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
    So when shrink_zone() expects to scan 4 pages in the active/inactive list,
    the active list will be scanned 4 pages, while the inactive list will be
    (over) scanned SWAP_CLUSTER_MAX=32 pages in effect. And that could break
    the balance between the two lists.

    It can further impact the scan of anon active list, due to the anon
    active/inactive ratio rebalance logic in balance_pgdat()/shrink_zone():

    inactive anon list over scanned => inactive_anon_is_low() == TRUE
    => shrink_active_list()
    => active anon list over scanned

    So the end result may be

    - anon inactive => over scanned
    - anon active => over scanned (maybe not as much)
    - file inactive => over scanned
    - file active => under scanned (relatively)

    The accesses to nr_saved_scan are not lock protected and so not 100%
    accurate, however we can tolerate small errors and the resulted small
    imbalanced scan rates between zones.

    Cc: Rik van Riel
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Balbir Singh
    Reviewed-by: Minchan Kim
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • The name `zone_nr_pages' can be mis-read as zone's (total) number pages,
    but it actually returns zone's LRU list number pages.

    Signed-off-by: Vincent Li
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vincent Li
     
  • Enlighten the reader of this code about what reference count makes a page
    cache page freeable.

    Signed-off-by: Johannes Weiner
    Cc: Christoph Lameter
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Make page_has_private() return a true boolean value and remove the double
    negations from the two callsites using it for arithmetic.

    Signed-off-by: Johannes Weiner
    Cc: Christoph Lameter
    Reviewed-by: Christoph Lameter
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner