27 Jul, 2011

4 commits

  • The commit log of 0ae5e89c60c9 ("memcg: count the soft_limit reclaim
    in...") says it adds scanning stats to memory.stat file. But it doesn't
    because we considered we needed to make a concensus for such new APIs.

    This patch is a trial to add memory.scan_stat. This shows
    - the number of scanned pages(total, anon, file)
    - the number of rotated pages(total, anon, file)
    - the number of freed pages(total, anon, file)
    - the number of elaplsed time (including sleep/pause time)

    for both of direct/soft reclaim.

    The biggest difference with oringinal Ying's one is that this file
    can be reset by some write, as

    # echo 0 ...../memory.scan_stat

    Example of output is here. This is a result after make -j 6 kernel
    under 300M limit.

    [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.scan_stat
    [kamezawa@bluextal ~]$ cat /cgroup/memory/A/memory.vmscan_stat
    scanned_pages_by_limit 9471864
    scanned_anon_pages_by_limit 6640629
    scanned_file_pages_by_limit 2831235
    rotated_pages_by_limit 4243974
    rotated_anon_pages_by_limit 3971968
    rotated_file_pages_by_limit 272006
    freed_pages_by_limit 2318492
    freed_anon_pages_by_limit 962052
    freed_file_pages_by_limit 1356440
    elapsed_ns_by_limit 351386416101
    scanned_pages_by_system 0
    scanned_anon_pages_by_system 0
    scanned_file_pages_by_system 0
    rotated_pages_by_system 0
    rotated_anon_pages_by_system 0
    rotated_file_pages_by_system 0
    freed_pages_by_system 0
    freed_anon_pages_by_system 0
    freed_file_pages_by_system 0
    elapsed_ns_by_system 0
    scanned_pages_by_limit_under_hierarchy 9471864
    scanned_anon_pages_by_limit_under_hierarchy 6640629
    scanned_file_pages_by_limit_under_hierarchy 2831235
    rotated_pages_by_limit_under_hierarchy 4243974
    rotated_anon_pages_by_limit_under_hierarchy 3971968
    rotated_file_pages_by_limit_under_hierarchy 272006
    freed_pages_by_limit_under_hierarchy 2318492
    freed_anon_pages_by_limit_under_hierarchy 962052
    freed_file_pages_by_limit_under_hierarchy 1356440
    elapsed_ns_by_limit_under_hierarchy 351386416101
    scanned_pages_by_system_under_hierarchy 0
    scanned_anon_pages_by_system_under_hierarchy 0
    scanned_file_pages_by_system_under_hierarchy 0
    rotated_pages_by_system_under_hierarchy 0
    rotated_anon_pages_by_system_under_hierarchy 0
    rotated_file_pages_by_system_under_hierarchy 0
    freed_pages_by_system_under_hierarchy 0
    freed_anon_pages_by_system_under_hierarchy 0
    freed_file_pages_by_system_under_hierarchy 0
    elapsed_ns_by_system_under_hierarchy 0

    total_xxxx is for hierarchy management.

    This will be useful for further memcg developments and need to be
    developped before we do some complicated rework on LRU/softlimit
    management.

    This patch adds a new struct memcg_scanrecord into scan_control struct.
    sc->nr_scanned at el is not designed for exporting information. For
    example, nr_scanned is reset frequentrly and incremented +2 at scanning
    mapped pages.

    To avoid complexity, I added a new param in scan_control which is for
    exporting scanning score.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Michal Hocko
    Cc: Ying Han
    Cc: Andrew Bresticker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Commit 246e87a93934 ("memcg: fix get_scan_count() for small targets")
    fixes the memcg/kswapd behavior against small targets and prevent vmscan
    priority too high.

    But the implementation is too naive and adds another problem to small
    memcg. It always force scan to 32 pages of file/anon and doesn't handle
    swappiness and other rotate_info. It makes vmscan to scan anon LRU
    regardless of swappiness and make reclaim bad. This patch fixes it by
    adjusting scanning count with regard to swappiness at el.

    At a test "cat 1G file under 300M limit." (swappiness=20)
    before patch
    scanned_pages_by_limit 360919
    scanned_anon_pages_by_limit 180469
    scanned_file_pages_by_limit 180450
    rotated_pages_by_limit 31
    rotated_anon_pages_by_limit 25
    rotated_file_pages_by_limit 6
    freed_pages_by_limit 180458
    freed_anon_pages_by_limit 19
    freed_file_pages_by_limit 180439
    elapsed_ns_by_limit 429758872
    after patch
    scanned_pages_by_limit 180674
    scanned_anon_pages_by_limit 24
    scanned_file_pages_by_limit 180650
    rotated_pages_by_limit 35
    rotated_anon_pages_by_limit 24
    rotated_file_pages_by_limit 11
    freed_pages_by_limit 180634
    freed_anon_pages_by_limit 0
    freed_file_pages_by_limit 180634
    elapsed_ns_by_limit 367119089
    scanned_pages_by_system 0

    the numbers of scanning anon are decreased(as expected), and elapsed time
    reduced. By this patch, small memcgs will work better.
    (*) Because the amount of file-cache is much bigger than anon,
    recalaim_stat's rotate-scan counter make scanning files more.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Michal Hocko
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • In mm/memcontrol.c, there are many lru stat functions as..

    mem_cgroup_zone_nr_lru_pages
    mem_cgroup_node_nr_file_lru_pages
    mem_cgroup_nr_file_lru_pages
    mem_cgroup_node_nr_anon_lru_pages
    mem_cgroup_nr_anon_lru_pages
    mem_cgroup_node_nr_unevictable_lru_pages
    mem_cgroup_nr_unevictable_lru_pages
    mem_cgroup_node_nr_lru_pages
    mem_cgroup_nr_lru_pages
    mem_cgroup_get_local_zonestat

    Some of them are under #ifdef MAX_NUMNODES >1 and others are not.
    This seems bad. This patch consolidates all functions into

    mem_cgroup_zone_nr_lru_pages()
    mem_cgroup_node_nr_lru_pages()
    mem_cgroup_nr_lru_pages()

    For these functions, "which LRU?" information is passed by a mask.

    example:
    mem_cgroup_nr_lru_pages(mem, BIT(LRU_ACTIVE_ANON))

    And I added some macro as ALL_LRU, ALL_LRU_FILE, ALL_LRU_ANON.

    example:
    mem_cgroup_nr_lru_pages(mem, ALL_LRU)

    BTW, considering layout of NUMA memory placement of counters, this patch seems
    to be better.

    Now, when we gather all LRU information, we scan in following orer
    for_each_lru -> for_each_node -> for_each_zone.

    This means we'll touch cache lines in different node in turn.

    After patch, we'll scan
    for_each_node -> for_each_zone -> for_each_lru(mask)

    Then, we'll gather information in the same cacheline at once.

    [akpm@linux-foundation.org: fix warnigns, build error]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Michal Hocko
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Each memory cgroup has a 'swappiness' value which can be accessed by
    get_swappiness(memcg). The major user is try_to_free_mem_cgroup_pages()
    and swappiness is passed by argument. It's propagated by scan_control.

    get_swappiness() is a static function but some planned updates will need
    to get swappiness from files other than memcontrol.c This patch exports
    get_swappiness() as mem_cgroup_swappiness(). With this, we can remove the
    argument of swapiness from try_to_free... and drop swappiness from
    scan_control. only memcg uses it.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Michal Hocko
    Cc: Ying Han
    Cc: Shaohua Li
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

20 Jul, 2011

5 commits

  • For shrinkers that have their own cond_resched* calls, having
    shrink_slab break the work down into small batches is not
    paticularly efficient. Add a custom batchsize field to the struct
    shrinker so that shrinkers can use a larger batch size if they
    desire.

    A value of zero (uninitialised) means "use the default", so
    behaviour is unchanged by this patch.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • When a shrinker returns -1 to shrink_slab() to indicate it cannot do
    any work given the current memory reclaim requirements, it adds the
    entire total_scan count to shrinker->nr. The idea ehind this is that
    whenteh shrinker is next called and can do work, it will do the work
    of the previously aborted shrinker call as well.

    However, if a filesystem is doing lots of allocation with GFP_NOFS
    set, then we get many, many more aborts from the shrinkers than we
    do successful calls. The result is that shrinker->nr winds up to
    it's maximum permissible value (twice the current cache size) and
    then when the next shrinker call that can do work is issued, it
    has enough scan count built up to free the entire cache twice over.

    This manifests itself in the cache going from full to empty in a
    matter of seconds, even when only a small part of the cache is
    needed to be emptied to free sufficient memory.

    Under metadata intensive workloads on ext4 and XFS, I'm seeing the
    VFS caches increase memory consumption up to 75% of memory (no page
    cache pressure) over a period of 30-60s, and then the shrinker
    empties them down to zero in the space of 2-3s. This cycle repeats
    over and over again, with the shrinker completely trashing the inode
    and dentry caches every minute or so the workload continues.

    This behaviour was made obvious by the shrink_slab tracepoints added
    earlier in the series, and made worse by the patch that corrected
    the concurrent accounting of shrinker->nr.

    To avoid this problem, stop repeated small increments of the total
    scan value from winding shrinker->nr up to a value that can cause
    the entire cache to be freed. We still need to allow it to wind up,
    so use the delta as the "large scan" threshold check - if the delta
    is more than a quarter of the entire cache size, then it is a large
    scan and allowed to cause lots of windup because we are clearly
    needing to free lots of memory.

    If it isn't a large scan then limit the total scan to half the size
    of the cache so that windup never increases to consume the whole
    cache. Reducing the total scan limit further does not allow enough
    wind-up to maintain the current levels of performance, whilst a
    higher threshold does not prevent the windup from freeing the entire
    cache under sustained workloads.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • shrink_slab() allows shrinkers to be called in parallel so the
    struct shrinker can be updated concurrently. It does not provide any
    exclusio for such updates, so we can get the shrinker->nr value
    increasing or decreasing incorrectly.

    As a result, when a shrinker repeatedly returns a value of -1 (e.g.
    a VFS shrinker called w/ GFP_NOFS), the shrinker->nr goes haywire,
    sometimes updating with the scan count that wasn't used, sometimes
    losing it altogether. Worse is when a shrinker does work and that
    update is lost due to racy updates, which means the shrinker will do
    the work again!

    Fix this by making the total_scan calculations independent of
    shrinker->nr, and making the shrinker->nr updates atomic w.r.t. to
    other updates via cmpxchg loops.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • It is impossible to understand what the shrinkers are actually doing
    without instrumenting the code, so add a some tracepoints to allow
    insight to be gained.

    Signed-off-by: Dave Chinner
    Signed-off-by: Al Viro

    Dave Chinner
     
  • I'm running a workload which triggers a lot of swap in a machine with 4
    nodes. After I kill the workload, I found a kswapd livelock. Sometimes
    kswapd3 or kswapd2 are keeping running and I can't access filesystem,
    but most memory is free.

    This looks like a regression since commit 08951e545918c159 ("mm: vmscan:
    correct check for kswapd sleeping in sleeping_prematurely").

    Node 2 and 3 have only ZONE_NORMAL, but balance_pgdat() will return 0
    for classzone_idx. The reason is end_zone in balance_pgdat() is 0 by
    default, if all zones have watermark ok, end_zone will keep 0.

    Later sleeping_prematurely() always returns true. Because this is an
    order 3 wakeup, and if classzone_idx is 0, both balanced_pages and
    present_pages in pgdat_balanced() are 0. We add a special case here.
    If a zone has no page, we think it's balanced. This fixes the livelock.

    Signed-off-by: Shaohua Li
    Acked-by: Mel Gorman
    Cc: Minchan Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

09 Jul, 2011

4 commits

  • During allocator-intensive workloads, kswapd will be woken frequently
    causing free memory to oscillate between the high and min watermark. This
    is expected behaviour. Unfortunately, if the highest zone is small, a
    problem occurs.

    When balance_pgdat() returns, it may be at a lower classzone_idx than it
    started because the highest zone was unreclaimable. Before checking if it
    should go to sleep though, it checks pgdat->classzone_idx which when there
    is no other activity will be MAX_NR_ZONES-1. It interprets this as it has
    been woken up while reclaiming, skips scheduling and reclaims again. As
    there is no useful reclaim work to do, it enters into a loop of shrinking
    slab consuming loads of CPU until the highest zone becomes reclaimable for
    a long period of time.

    There are two problems here. 1) If the returned classzone or order is
    lower, it'll continue reclaiming without scheduling. 2) if the highest
    zone was marked unreclaimable but balance_pgdat() returns immediately at
    DEF_PRIORITY, the new lower classzone is not communicated back to kswapd()
    for sleeping.

    This patch does two things that are related. If the end_zone is
    unreclaimable, this information is communicated back. Second, if the
    classzone or order was reduced due to failing to reclaim, new information
    is not read from pgdat and instead an attempt is made to go to sleep. Due
    to this, it is also necessary that pgdat->classzone_idx be initialised
    each time to pgdat->nr_zones - 1 to avoid re-reads being interpreted as
    wakeups.

    Signed-off-by: Mel Gorman
    Reported-by: Pádraig Brady
    Tested-by: Pádraig Brady
    Tested-by: Andrew Lutomirski
    Acked-by: Rik van Riel
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When deciding if kswapd is sleeping prematurely, the classzone is taken
    into account but this is different to what balance_pgdat() and the
    allocator are doing. Specifically, the DMA zone will be checked based on
    the classzone used when waking kswapd which could be for a GFP_KERNEL or
    GFP_HIGHMEM request. The lowmem reserve limit kicks in, the watermark is
    not met and kswapd thinks it's sleeping prematurely keeping kswapd awake in
    error.

    Signed-off-by: Mel Gorman
    Reported-by: Pádraig Brady
    Tested-by: Pádraig Brady
    Tested-by: Andrew Lutomirski
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • During allocator-intensive workloads, kswapd will be woken frequently
    causing free memory to oscillate between the high and min watermark. This
    is expected behaviour.

    When kswapd applies pressure to zones during node balancing, it checks if
    the zone is above a high+balance_gap threshold. If it is, it does not
    apply pressure but it unconditionally shrinks slab on a global basis which
    is excessive. In the event kswapd is being kept awake due to a high small
    unreclaimable zone, it skips zone shrinking but still calls shrink_slab().

    Once pressure has been applied, the check for zone being unreclaimable is
    being made before the check is made if all_unreclaimable should be set.
    This miss of unreclaimable can cause has_under_min_watermark_zone to be
    set due to an unreclaimable zone preventing kswapd backing off on
    congestion_wait().

    Signed-off-by: Mel Gorman
    Reported-by: Pádraig Brady
    Tested-by: Pádraig Brady
    Tested-by: Andrew Lutomirski
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • During allocator-intensive workloads, kswapd will be woken frequently
    causing free memory to oscillate between the high and min watermark. This
    is expected behaviour. Unfortunately, if the highest zone is small, a
    problem occurs.

    This seems to happen most with recent sandybridge laptops but it's
    probably a co-incidence as some of these laptops just happen to have a
    small Normal zone. The reproduction case is almost always during copying
    large files that kswapd pegs at 100% CPU until the file is deleted or
    cache is dropped.

    The problem is mostly down to sleeping_prematurely() keeping kswapd awake
    when the highest zone is small and unreclaimable and compounded by the
    fact we shrink slabs even when not shrinking zones causing a lot of time
    to be spent in shrinkers and a lot of memory to be reclaimed.

    Patch 1 corrects sleeping_prematurely to check the zones matching
    the classzone_idx instead of all zones.

    Patch 2 avoids shrinking slab when we are not shrinking a zone.

    Patch 3 notes that sleeping_prematurely is checking lower zones against
    a high classzone which is not what allocators or balance_pgdat()
    is doing leading to an artifical belief that kswapd should be
    still awake.

    Patch 4 notes that when balance_pgdat() gives up on a high zone that the
    decision is not communicated to sleeping_prematurely()

    This problem affects 2.6.38.8 for certain and is expected to affect 2.6.39
    and 3.0-rc4 as well. If accepted, they need to go to -stable to be picked
    up by distros and this series is against 3.0-rc4. I've cc'd people that
    reported similar problems recently to see if they still suffer from the
    problem and if this fixes it.

    This patch: correct the check for kswapd sleeping in sleeping_prematurely()

    During allocator-intensive workloads, kswapd will be woken frequently
    causing free memory to oscillate between the high and min watermark. This
    is expected behaviour.

    A problem occurs if the highest zone is small. balance_pgdat() only
    considers unreclaimable zones when priority is DEF_PRIORITY but
    sleeping_prematurely considers all zones. It's possible for this sequence
    to occur

    1. kswapd wakes up and enters balance_pgdat()
    2. At DEF_PRIORITY, marks highest zone unreclaimable
    3. At DEF_PRIORITY-1, ignores highest zone setting end_zone
    4. At DEF_PRIORITY-1, calls shrink_slab freeing memory from
    highest zone, clearing all_unreclaimable. Highest zone
    is still unbalanced
    5. kswapd returns and calls sleeping_prematurely
    6. sleeping_prematurely looks at *all* zones, not just the ones
    being considered by balance_pgdat. The highest small zone
    has all_unreclaimable cleared but the zone is not
    balanced. all_zones_ok is false so kswapd stays awake

    This patch corrects the behaviour of sleeping_prematurely to check the
    zones balance_pgdat() checked.

    Signed-off-by: Mel Gorman
    Reported-by: Pádraig Brady
    Tested-by: Pádraig Brady
    Tested-by: Andrew Lutomirski
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

28 Jun, 2011

1 commit

  • Commit d149e3b25d7c ("memcg: add the soft_limit reclaim in global direct
    reclaim") adds a softlimit hook to shrink_zones(). By this, soft limit
    is called as

    try_to_free_pages()
    do_try_to_free_pages()
    shrink_zones()
    mem_cgroup_soft_limit_reclaim()

    Then, direct reclaim is memcg softlimit hint aware, now.

    But, the memory cgroup's "limit" path can call softlimit shrinker.

    try_to_free_mem_cgroup_pages()
    do_try_to_free_pages()
    shrink_zones()
    mem_cgroup_soft_limit_reclaim()

    This will cause a global reclaim when a memcg hits limit.

    This is bug. soft_limit_reclaim() should be called when
    scanning_global_lru(sc) == true.

    And the commit adds a variable "total_scanned" for counting softlimit
    scanned pages....it's not "total". This patch removes the variable and
    update sc->nr_scanned instead of it. This will affect shrink_slab()'s
    scan condition but, global LRU is scanned by softlimit and I think this
    change makes sense.

    TODO: avoid too much scanning of a zone when softlimit did enough work.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Ying Han
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

16 Jun, 2011

2 commits

  • It is unsafe to run page_count during the physical pfn scan because
    compound_head could trip on a dangling pointer when reading
    page->first_page if the compound page is being freed by another CPU.

    [mgorman@suse.de: split out patch]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Mel Gorman
    Reviewed-by: Michal Hocko
    Reviewed-by: Minchan Kim

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Currently, memcg reclaim can disable swap token even if the swap token mm
    doesn't belong in its memory cgroup. It's slightly risky. If an admin
    creates very small mem-cgroup and silly guy runs contentious heavy memory
    pressure workload, every tasks are going to lose swap token and then
    system may become unresponsive. That's bad.

    This patch adds 'memcg' parameter into disable_swap_token(). and if the
    parameter doesn't match swap token, VM doesn't disable it.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

27 May, 2011

5 commits

  • The caller of the function has been renamed to zone_nr_lru_pages(), and
    this is just fixing up in the memcg code. The current name is easily to
    be mis-read as zone's total number of pages.

    Signed-off-by: Ying Han
    Acked-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • During memory reclaim we determine the number of pages to be scanned per
    zone as

    (anon + file) >> priority.
    Assume
    scan = (anon + file) >> priority.

    If scan < SWAP_CLUSTER_MAX, the scan will be skipped for this time and
    priority gets higher. This has some problems.

    1. This increases priority as 1 without any scan.
    To do scan in this priority, amount of pages should be larger than 512M.
    If pages>>priority < SWAP_CLUSTER_MAX, it's recorded and scan will be
    batched, later. (But we lose 1 priority.)
    If memory size is below 16M, pages >> priority is 0 and no scan in
    DEF_PRIORITY forever.

    2. If zone->all_unreclaimabe==true, it's scanned only when priority==0.
    So, x86's ZONE_DMA will never be recoverred until the user of pages
    frees memory by itself.

    3. With memcg, the limit of memory can be small. When using small memcg,
    it gets priority < DEF_PRIORITY-2 very easily and need to call
    wait_iff_congested().
    For doing scan before priorty=9, 64MB of memory should be used.

    Then, this patch tries to scan SWAP_CLUSTER_MAX of pages in force...when

    1. the target is enough small.
    2. it's kswapd or memcg reclaim.

    Then we can avoid rapid priority drop and may be able to recover
    all_unreclaimable in a small zones. And this patch removes nr_saved_scan.
    This will allow scanning in this priority even when pages >> priority is
    very small.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Ying Han
    Cc: Balbir Singh
    Cc: KOSAKI Motohiro
    Cc: Daisuke Nishimura
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Presently, memory cgroup's direct reclaim frees memory from the current
    node. But this has some troubles. Usually when a set of threads works in
    a cooperative way, they tend to operate on the same node. So if they hit
    limits under memcg they will reclaim memory from themselves, damaging the
    active working set.

    For example, assume 2 node system which has Node 0 and Node 1 and a memcg
    which has 1G limit. After some work, file cache remains and the usages
    are

    Node 0: 1M
    Node 1: 998M.

    and run an application on Node 0, it will eat its foot before freeing
    unnecessary file caches.

    This patch adds round-robin for NUMA and adds equal pressure to each node.
    When using cpuset's spread memory feature, this will work very well.

    But yes, a better algorithm is needed.

    [akpm@linux-foundation.org: comment editing]
    [kamezawa.hiroyu@jp.fujitsu.com: fix time comparisons]
    Signed-off-by: Ying Han
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: KOSAKI Motohiro
    Cc: Daisuke Nishimura
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • We recently added the change in global background reclaim which counts the
    return value of soft_limit reclaim. Now this patch adds the similar logic
    on global direct reclaim.

    We should skip scanning global LRU on shrink_zone if soft_limit reclaim
    does enough work. This is the first step where we start with counting the
    nr_scanned and nr_reclaimed from soft_limit reclaim into global
    scan_control.

    Signed-off-by: Ying Han
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • The global kswapd scans per-zone LRU and reclaims pages regardless of the
    cgroup. It breaks memory isolation since one cgroup can end up reclaiming
    pages from another cgroup. Instead we should rely on memcg-aware target
    reclaim including per-memcg kswapd and soft_limit hierarchical reclaim under
    memory pressure.

    In the global background reclaim, we do soft reclaim before scanning the
    per-zone LRU. However, the return value is ignored. This patch is the first
    step to skip shrink_zone() if soft_limit reclaim does enough work.

    This is part of the effort which tries to reduce reclaiming pages in global
    LRU in memcg. The per-memcg background reclaim patchset further enhances the
    per-cgroup targetting reclaim, which I should have V4 posted shortly.

    Try running multiple memory intensive workloads within seperate memcgs. Watch
    the counters of soft_steal in memory.stat.

    $ cat /dev/cgroup/A/memory.stat | grep 'soft'
    soft_steal 240000
    soft_scan 240000
    total_soft_steal 240000
    total_soft_scan 240000

    This patch:

    In the global background reclaim, we do soft reclaim before scanning the
    per-zone LRU. However, the return value is ignored.

    We would like to skip shrink_zone() if soft_limit reclaim does enough
    work. Also, we need to make the memory pressure balanced across per-memcg
    zones, like the logic vm-core. This patch is the first step where we
    start with counting the nr_scanned and nr_reclaimed from soft_limit
    reclaim into the global scan_control.

    Signed-off-by: Ying Han
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Acked-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     

25 May, 2011

5 commits

  • Change each shrinker's API by consolidating the existing parameters into
    shrink_control struct. This will simplify any further features added w/o
    touching each file of shrinker.

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: fix warning]
    [kosaki.motohiro@jp.fujitsu.com: fix up new shrinker API]
    [akpm@linux-foundation.org: fix xfs warning]
    [akpm@linux-foundation.org: update gfs2]
    Signed-off-by: Ying Han
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Acked-by: Pavel Emelyanov
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Steven Whitehouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • Consolidate the existing parameters to shrink_slab() into a new
    shrink_control struct. This is needed later to pass the same struct to
    shrinkers.

    Signed-off-by: Ying Han
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Acked-by: Pavel Emelyanov
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     
  • isolate_lru_page() must be called only with stable reference to the page,
    this is what is written in the comment above it, this is reasonable.

    current isolate_lru_page() users and its page extra reference sources:

    mm/huge_memory.c:
    __collapse_huge_page_isolate() - reference from pte

    mm/memcontrol.c:
    mem_cgroup_move_parent() - get_page_unless_zero()
    mem_cgroup_move_charge_pte_range() - reference from pte

    mm/memory-failure.c:
    soft_offline_page() - fixed, reference from get_any_page()
    delete_from_lru_cache() - reference from caller or get_page_unless_zero()
    [ seems like there bug, because __memory_failure() can call
    page_action() for hpages tail, but it is ok for
    isolate_lru_page(), tail getted and not in lru]

    mm/memory_hotplug.c:
    do_migrate_range() - fixed, get_page_unless_zero()

    mm/mempolicy.c:
    migrate_page_add() - reference from pte

    mm/migrate.c:
    do_move_page_to_node_array() - reference from follow_page()

    mlock.c: - various external references

    mm/vmscan.c:
    putback_lru_page() - reference from isolate_lru_page()

    It seems that all isolate_lru_page() users are ready now for this
    restriction. So, let's replace redundant get_page_unless_zero() with
    get_page() and add page initial reference count check with VM_BUG_ON()

    Signed-off-by: Konstantin Khlebnikov
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Lee Schermerhorn
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • It has been reported on some laptops that kswapd is consuming large
    amounts of CPU and not being scheduled when SLUB is enabled during large
    amounts of file copying. It is expected that this is due to kswapd
    missing every cond_resched() point because;

    shrink_page_list() calls cond_resched() if inactive pages were isolated
    which in turn may not happen if all_unreclaimable is set in
    shrink_zones(). If for whatver reason, all_unreclaimable is
    set on all zones, we can miss calling cond_resched().

    balance_pgdat() only calls cond_resched if the zones are not
    balanced. For a high-order allocation that is balanced, it
    checks order-0 again. During that window, order-0 might have
    become unbalanced so it loops again for order-0 and returns
    that it was reclaiming for order-0 to kswapd(). It can then
    find that a caller has rewoken kswapd for a high-order and
    re-enters balance_pgdat() without ever calling cond_resched().

    shrink_slab only calls cond_resched() if we are reclaiming slab
    pages. If there are a large number of direct reclaimers, the
    shrinker_rwsem can be contended and prevent kswapd calling
    cond_resched().

    This patch modifies the shrink_slab() case. If the semaphore is
    contended, the caller will still check cond_resched(). After each
    successful call into a shrinker, the check for cond_resched() remains in
    case one shrinker is particularly slow.

    [mgorman@suse.de: preserve call to cond_resched after each call into shrinker]
    Signed-off-by: Mel Gorman
    Signed-off-by: Minchan Kim
    Cc: Rik van Riel
    Cc: Johannes Weiner
    Cc: Wu Fengguang
    Cc: James Bottomley
    Tested-by: Colin King
    Cc: Raghavendra D Prabhu
    Cc: Jan Kara
    Cc: Chris Mason
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: [2.6.38+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • There are a few reports of people experiencing hangs when copying large
    amounts of data with kswapd using a large amount of CPU which appear to be
    due to recent reclaim changes. SLUB using high orders is the trigger but
    not the root cause as SLUB has been using high orders for a while. The
    root cause was bugs introduced into reclaim which are addressed by the
    following two patches.

    Patch 1 corrects logic introduced by commit 1741c877 ("mm: kswapd:
    keep kswapd awake for high-order allocations until a percentage of
    the node is balanced") to allow kswapd to go to sleep when
    balanced for high orders.

    Patch 2 notes that it is possible for kswapd to miss every
    cond_resched() and updates shrink_slab() so it'll at least reach
    that scheduling point.

    Chris Wood reports that these two patches in isolation are sufficient to
    prevent the system hanging. AFAIK, they should also resolve similar hangs
    experienced by James Bottomley.

    This patch:

    Johannes Weiner poined out that the logic in commit 1741c877 ("mm: kswapd:
    keep kswapd awake for high-order allocations until a percentage of the
    node is balanced") is backwards. Instead of allowing kswapd to go to
    sleep when balancing for high order allocations, it keeps it kswapd
    running uselessly.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Signed-off-by: Johannes Weiner
    Reviewed-by: Wu Fengguang
    Cc: James Bottomley
    Tested-by: Colin King
    Cc: Raghavendra D Prabhu
    Cc: Jan Kara
    Cc: Chris Mason
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Reviewed-by: Minchan Kim
    Reviewed-by: Wu Fengguang
    Cc: [2.6.38+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

21 May, 2011

1 commit

  • Commit e66eed651fd1 ("list: remove prefetching from regular list
    iterators") removed the include of prefetch.h from list.h, which
    uncovered several cases that had apparently relied on that rather
    obscure header file dependency.

    So this fixes things up a bit, using

    grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
    grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')

    to guide us in finding files that either need
    inclusion, or have it despite not needing it.

    There are more of them around (mostly network drivers), but this gets
    many core ones.

    Reported-by: Stephen Rothwell
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

18 May, 2011

1 commit

  • ZONE_CONGESTED should be a state of global memory reclaim. If not, a busy
    memcg sets this and give unnecessary throttoling in wait_iff_congested()
    against memory recalim in other contexts. This makes system performance
    bad.

    I'll think about "memcg is congested!" flag is required or not, later.
    But this fix is required first.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Daisuke Nishimura
    Acked-by: Ying Han
    Cc: Balbir Singh
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

15 Apr, 2011

1 commit

  • all_unreclaimable check in direct reclaim has been introduced at 2.6.19
    by following commit.

    2006 Sep 25; commit 408d8544; oom: use unreclaimable info

    And it went through strange history. firstly, following commit broke
    the logic unintentionally.

    2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
    costly-order allocations

    Two years later, I've found obvious meaningless code fragment and
    restored original intention by following commit.

    2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
    return value when priority==0

    But, the logic didn't works when 32bit highmem system goes hibernation
    and Minchan slightly changed the algorithm and fixed it .

    2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
    in direct reclaim path

    But, recently, Andrey Vagin found the new corner case. Look,

    struct zone {
    ..
    int all_unreclaimable;
    ..
    unsigned long pages_scanned;
    ..
    }

    zone->all_unreclaimable and zone->pages_scanned are neigher atomic
    variables nor protected by lock. Therefore zones can become a state of
    zone->page_scanned=0 and zone->all_unreclaimable=1. In this case, current
    all_unreclaimable() return false even though zone->all_unreclaimabe=1.

    This resulted in the kernel hanging up when executing a loop of the form

    1. fork
    2. mmap
    3. touch memory
    4. read memory
    5. munmmap

    as described in
    http://www.gossamer-threads.com/lists/linux/kernel/1348725#1348725

    Is this ignorable minor issue? No. Unfortunately, x86 has very small dma
    zone and it become zone->all_unreclamble=1 easily. and if it become
    all_unreclaimable=1, it never restore all_unreclaimable=0. Why? if
    all_unreclaimable=1, vmscan only try DEF_PRIORITY reclaim and
    a-few-lru-pages>>DEF_PRIORITY always makes 0. that mean no page scan at
    all!

    Eventually, oom-killer never works on such systems. That said, we can't
    use zone->pages_scanned for this purpose. This patch restore
    all_unreclaimable() use zone->all_unreclaimable as old. and in addition,
    to add oom_killer_disabled check to avoid reintroduce the issue of commit
    d1908362 ("vmscan: check all_unreclaimable in direct reclaim path").

    Reported-by: Andrey Vagin
    Signed-off-by: KOSAKI Motohiro
    Cc: Nick Piggin
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

31 Mar, 2011

1 commit


25 Mar, 2011

1 commit

  • * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
    Documentation/iostats.txt: bit-size reference etc.
    cfq-iosched: removing unnecessary think time checking
    cfq-iosched: Don't clear queue stats when preempt.
    blk-throttle: Reset group slice when limits are changed
    blk-cgroup: Only give unaccounted_time under debug
    cfq-iosched: Don't set active queue in preempt
    block: fix non-atomic access to genhd inflight structures
    block: attempt to merge with existing requests on plug flush
    block: NULL dereference on error path in __blkdev_get()
    cfq-iosched: Don't update group weights when on service tree
    fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
    block: Require subsystems to explicitly allocate bio_set integrity mempool
    jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    fs: make fsync_buffers_list() plug
    mm: make generic_writepages() use plugging
    blk-cgroup: Add unaccounted time to timeslice_used.
    block: fixup plugging stubs for !CONFIG_BLOCK
    block: remove obsolete comments for blkdev_issue_zeroout.
    blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
    ...

    Fix up conflicts in fs/{aio.c,super.c}

    Linus Torvalds
     

23 Mar, 2011

3 commits

  • When reclaiming for order-0 pages, kswapd requires that all zones be
    balanced. Each cycle through balance_pgdat() does background ageing on
    all zones if necessary and applies equal pressure on the inactive zone
    unless a lot of pages are free already.

    A "lot of free pages" is defined as a "balance gap" above the high
    watermark which is currently 7*high_watermark. Historically this was
    reasonable as min_free_kbytes was small. However, on systems using huge
    pages, it is recommended that min_free_kbytes is higher and it is tuned
    with hugeadm --set-recommended-min_free_kbytes. With the introduction of
    transparent huge page support, this recommended value is also applied. On
    X86-64 with 4G of memory, min_free_kbytes becomes 67584 so one would
    expect around 68M of memory to be free. The Normal zone is approximately
    35000 pages so under even normal memory pressure such as copying a large
    file, it gets exhausted quickly. As it is getting exhausted, kswapd
    applies pressure equally to all zones, including the DMA32 zone. DMA32 is
    approximately 700,000 pages with a high watermark of around 23,000 pages.
    In this situation, kswapd will reclaim around (23000*8 where 8 is the high
    watermark + balance gap of 7 * high watermark) pages or 718M of pages
    before the zone is ignored. What the user sees is that free memory far
    higher than it should be.

    To avoid an excessive number of pages being reclaimed from the larger
    zones, explicitely defines the "balance gap" to be either 1% of the zone
    or the low watermark for the zone, whichever is smaller. While kswapd
    will check all zones to apply pressure, it'll ignore zones that meets the
    (high_wmark + balance_gap) watermark.

    To test this, 80G were copied from a partition and the amount of memory
    being used was recorded. A comparison of a patch and unpatched kernel can
    be seen at
    http://www.csn.ul.ie/~mel/postings/minfree-20110222/memory-usage-hydra.ps
    and shows that kswapd is not reclaiming as much memory with the patch
    applied.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Shaohua Li
    Cc: "Chen, Tim C"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Now we renamed remove_from_page_cache with delete_from_page_cache. As
    consistency of __remove_from_swap_cache and remove_from_swap_cache, we
    change internal page cache handling function name, too.

    Signed-off-by: Minchan Kim
    Cc: Christoph Hellwig
    Acked-by: Hugh Dickins
    Acked-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Johannes Weiner
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • This patch reverts 5a03b051 ("thp: use compaction in kswapd for GFP_ATOMIC
    order > 0") due to reports stating that kswapd CPU usage was higher and
    IRQs were being disabled more frequently. This was reported at
    http://www.spinics.net/linux/fedora/alsa-user/msg09885.html.

    Without this patch applied, CPU usage by kswapd hovers around the 20% mark
    according to the tester (Arthur Marsh:
    http://www.spinics.net/linux/fedora/alsa-user/msg09899.html). With this
    patch applied, it's around 2%.

    The problem is not related to THP which specifies __GFP_NO_KSWAPD but is
    triggered by high-order allocations hitting the low watermark for their
    order and waking kswapd on kernels with CONFIG_COMPACTION set. The most
    common trigger for this is network cards configured for jumbo frames but
    it's also possible it'll be triggered by fork-heavy workloads (order-1)
    and some wireless cards which depend on order-1 allocations.

    The symptoms for the user will be high CPU usage by kswapd in low-memory
    situations which could be confused with another writeback problem. While
    a patch like 5a03b051 may be reintroduced in the future, this patch plays
    it safe for now and reverts it.

    [mel@csn.ul.ie: Beefed up the changelog]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Mel Gorman
    Reported-by: Arthur Marsh
    Tested-by: Arthur Marsh
    Cc: [2.6.38.1]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

10 Mar, 2011

2 commits


26 Feb, 2011

1 commit

  • should_continue_reclaim() for reclaim/compaction allows scanning to
    continue even if pages are not being reclaimed until the full list is
    scanned. In terms of allocation success, this makes sense but potentially
    it introduces unwanted latency for high-order allocations such as
    transparent hugepages and network jumbo frames that would prefer to fail
    the allocation attempt and fallback to order-0 pages. Worse, there is a
    potential that the full LRU scan will clear all the young bits, distort
    page aging information and potentially push pages into swap that would
    have otherwise remained resident.

    This patch will stop reclaim/compaction if no pages were reclaimed in the
    last SWAP_CLUSTER_MAX pages that were considered. For allocations such as
    hugetlbfs that use __GFP_REPEAT and have fewer fallback options, the full
    LRU list may still be scanned.

    Order-0 allocation should not be affected because RECLAIM_MODE_COMPACTION
    is not set so the following avoids the gfp_mask being examined:

    if (!(sc->reclaim_mode & RECLAIM_MODE_COMPACTION))
    return false;

    A tool was developed based on ftrace that tracked the latency of
    high-order allocations while transparent hugepage support was enabled and
    three benchmarks were run. The "fix-infinite" figures are 2.6.38-rc4 with
    Johannes's patch "vmscan: fix zone shrinking exit when scan work is done"
    applied.

    STREAM Highorder Allocation Latency Statistics
    fix-infinite break-early
    1 :: Count 10298 10229
    1 :: Min 0.4560 0.4640
    1 :: Mean 1.0589 1.0183
    1 :: Max 14.5990 11.7510
    1 :: Stddev 0.5208 0.4719
    2 :: Count 2 1
    2 :: Min 1.8610 3.7240
    2 :: Mean 3.4325 3.7240
    2 :: Max 5.0040 3.7240
    2 :: Stddev 1.5715 0.0000
    9 :: Count 111696 111694
    9 :: Min 0.5230 0.4110
    9 :: Mean 10.5831 10.5718
    9 :: Max 38.4480 43.2900
    9 :: Stddev 1.1147 1.1325

    Mean time for order-1 allocations is reduced. order-2 looks increased but
    with so few allocations, it's not particularly significant. THP mean
    allocation latency is also reduced. That said, allocation time varies so
    significantly that the reductions are within noise.

    Max allocation time is reduced by a significant amount for low-order
    allocations but reduced for THP allocations which presumably are now
    breaking before reclaim has done enough work.

    SysBench Highorder Allocation Latency Statistics
    fix-infinite break-early
    1 :: Count 15745 15677
    1 :: Min 0.4250 0.4550
    1 :: Mean 1.1023 1.0810
    1 :: Max 14.4590 10.8220
    1 :: Stddev 0.5117 0.5100
    2 :: Count 1 1
    2 :: Min 3.0040 2.1530
    2 :: Mean 3.0040 2.1530
    2 :: Max 3.0040 2.1530
    2 :: Stddev 0.0000 0.0000
    9 :: Count 2017 1931
    9 :: Min 0.4980 0.7480
    9 :: Mean 10.4717 10.3840
    9 :: Max 24.9460 26.2500
    9 :: Stddev 1.1726 1.1966

    Again, mean time for order-1 allocations is reduced while order-2
    allocations are too few to draw conclusions from. The mean time for THP
    allocations is also slightly reduced albeit the reductions are within
    varianes.

    Once again, our maximum allocation time is significantly reduced for
    low-order allocations and slightly increased for THP allocations.

    Anon stream mmap reference Highorder Allocation Latency Statistics
    1 :: Count 1376 1790
    1 :: Min 0.4940 0.5010
    1 :: Mean 1.0289 0.9732
    1 :: Max 6.2670 4.2540
    1 :: Stddev 0.4142 0.2785
    2 :: Count 1 -
    2 :: Min 1.9060 -
    2 :: Mean 1.9060 -
    2 :: Max 1.9060 -
    2 :: Stddev 0.0000 -
    9 :: Count 11266 11257
    9 :: Min 0.4990 0.4940
    9 :: Mean 27250.4669 24256.1919
    9 :: Max 11439211.0000 6008885.0000
    9 :: Stddev 226427.4624 186298.1430

    This benchmark creates one thread per CPU which references an amount of
    anonymous memory 1.5 times the size of physical RAM. This pounds swap
    quite heavily and is intended to exercise THP a bit.

    Mean allocation time for order-1 is reduced as before. It's also reduced
    for THP allocations but the variations here are pretty massive due to
    swap. As before, maximum allocation times are significantly reduced.

    Overall, the patch reduces the mean and maximum allocation latencies for
    the smaller high-order allocations. This was with Slab configured so it
    would be expected to be more significant with Slub which uses these size
    allocations more aggressively.

    The mean allocation times for THP allocations are also slightly reduced.
    The maximum latency was slightly increased as predicted by the comments
    due to reclaim/compaction breaking early. However, workloads care more
    about the latency of lower-order allocations than THP so it's an
    acceptable trade-off.

    Signed-off-by: Mel Gorman
    Acked-by: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Reviewed-by: Minchan Kim
    Acked-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Cc: Michal Hocko
    Cc: Kent Overstreet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

12 Feb, 2011

1 commit

  • Commit 3e7d34497067 ("mm: vmscan: reclaim order-0 and use compaction
    instead of lumpy reclaim") introduced an indefinite loop in
    shrink_zone().

    It meant to break out of this loop when no pages had been reclaimed and
    not a single page was even scanned. The way it would detect the latter
    is by taking a snapshot of sc->nr_scanned at the beginning of the
    function and comparing it against the new sc->nr_scanned after the scan
    loop. But it would re-iterate without updating that snapshot, looping
    forever if sc->nr_scanned changed at least once since shrink_zone() was
    invoked.

    This is not the sole condition that would exit that loop, but it
    requires other processes to change the zone state, as the reclaimer that
    is stuck obviously can not anymore.

    This is only happening for higher-order allocations, where reclaim is
    run back to back with compaction.

    Signed-off-by: Johannes Weiner
    Reported-by: Michal Hocko
    Tested-by: Kent Overstreet
    Reported-by: Kent Overstreet
    Acked-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

26 Jan, 2011

1 commit

  • Before 0e093d99763e ("writeback: do not sleep on the congestion queue if
    there are no congested BDIs or if significant congestion is not being
    encountered in the current zone"), preferred_zone was only used for NUMA
    statistics, to determine the zoneidx from which to allocate from given
    the type requested, and whether to utilize memory compaction.

    wait_iff_congested(), though, uses preferred_zone to determine if the
    congestion wait should be deferred because its dirty pages are backed by
    a congested bdi. This incorrectly defers the timeout and busy loops in
    the page allocator with various cond_resched() calls if preferred_zone
    is not allowed in the current context, usually consuming 100% of a cpu.

    This patch ensures preferred_zone is an allowed zone in the fastpath
    depending on whether current is constrained by its cpuset or nodes in
    its mempolicy (when the nodemask passed is non-NULL). This is correct
    since the fastpath allocation always passes ALLOC_CPUSET when trying to
    allocate memory. In the slowpath, this patch resets preferred_zone to
    the first zone of the allowed type when the allocation is not
    constrained by current's cpuset, i.e. it does not pass ALLOC_CPUSET.

    This patch also ensures preferred_zone is from the set of allowed nodes
    when called from within direct reclaim since allocations are always
    constrained by cpusets in this context (it is blockable).

    Both of these uses of cpuset_current_mems_allowed are protected by
    get_mems_allowed().

    Signed-off-by: David Rientjes
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Wu Fengguang
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

21 Jan, 2011

1 commit