24 Feb, 2013

16 commits

  • This variable is calculated from nr_free_pagecache_pages so
    change its type to unsigned long.

    Signed-off-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • Recently, Luigi reported there are lots of free swap space when OOM
    happens. It's easily reproduced on zram-over-swap, where many instance
    of memory hogs are running and laptop_mode is enabled. He said there
    was no problem when he disabled laptop_mode. The problem when I
    investigate problem is following as.

    Assumption for easy explanation: There are no page cache page in system
    because they all are already reclaimed.

    1. try_to_free_pages disable may_writepage when laptop_mode is enabled.
    2. shrink_inactive_list isolates victim pages from inactive anon lru list.
    3. shrink_page_list adds them to swapcache via add_to_swap but it doesn't
    pageout because sc->may_writepage is 0 so the page is rotated back into
    inactive anon lru list. The add_to_swap made the page Dirty by SetPageDirty.
    4. 3 couldn't reclaim any pages so do_try_to_free_pages increase priority and
    retry reclaim with higher priority.
    5. shrink_inactlive_list try to isolate victim pages from inactive anon lru list
    but got failed because it try to isolate pages with ISOLATE_CLEAN mode but
    inactive anon lru list is full of dirty pages by 3 so it just returns
    without any reclaim progress.
    6. do_try_to_free_pages doesn't set may_writepage due to zero total_scanned.
    Because sc->nr_scanned is increased by shrink_page_list but we don't call
    shrink_page_list in 5 due to short of isolated pages.

    Above loop is continued until OOM happens.

    The problem didn't happen before [1] was merged because old logic's
    isolatation in shrink_inactive_list was successful and tried to call
    shrink_page_list to pageout them but it still ends up failed to page out
    by may_writepage. But important point is that sc->nr_scanned was
    increased although we couldn't swap out them so do_try_to_free_pages
    could set may_writepages.

    Since commit f80c0673610e ("mm: zone_reclaim: make isolate_lru_page()
    filter-aware") was introduced, it's not a good idea any more to depends
    on only the number of scanned pages for setting may_writepage. So this
    patch adds new trigger point of setting may_writepage as below
    DEF_PRIOIRTY - 2 which is used to show the significant memory pressure
    in VM so it's good fit for our purpose which would be better to lose
    power saving or clickety rather than OOM killing.

    Signed-off-by: Minchan Kim
    Reported-by: Luigi Semenzato
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • An inactive file list is considered low when its active counterpart is
    bigger, regardless of whether it is a global zone LRU list or a memcg
    zone LRU list. The only difference is in how the LRU size is assessed.

    get_lru_size() does the right thing for both global and memcg reclaim
    situations.

    Get rid of inactive_file_is_low_global() and
    mem_cgroup_inactive_file_is_low() by using get_lru_size() and compare
    the numbers in common code.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • swap_lock is heavily contended when I test swap to 3 fast SSD (even
    slightly slower than swap to 2 such SSD). The main contention comes
    from swap_info_get(). This patch tries to fix the gap with adding a new
    per-partition lock.

    Global data like nr_swapfiles, total_swap_pages, least_priority and
    swap_list are still protected by swap_lock.

    nr_swap_pages is an atomic now, it can be changed without swap_lock. In
    theory, it's possible get_swap_page() finds no swap pages but actually
    there are free swap pages. But sounds not a big problem.

    Accessing partition specific data (like scan_swap_map and so on) is only
    protected by swap_info_struct.lock.

    Changing swap_info_struct.flags need hold swap_lock and
    swap_info_struct.lock, because scan_scan_map() will check it. read the
    flags is ok with either the locks hold.

    If both swap_lock and swap_info_struct.lock must be hold, we always hold
    the former first to avoid deadlock.

    swap_entry_free() can change swap_list. To delete that code, we add a
    new highest_priority_index. Whenever get_swap_page() is called, we
    check it. If it's valid, we use it.

    It's a pity get_swap_page() still holds swap_lock(). But in practice,
    swap_lock() isn't heavily contended in my test with this patch (or I can
    say there are other much more heavier bottlenecks like TLB flush). And
    BTW, looks get_swap_page() doesn't really need the lock. We never free
    swap_info[] and we check SWAP_WRITEOK flag. The only risk without the
    lock is we could swapout to some low priority swap, but we can quickly
    recover after several rounds of swap, so sounds not a big deal to me.
    But I'd prefer to fix this if it's a real problem.

    "swap: make each swap partition have one address_space" improved the
    swapout speed from 1.7G/s to 2G/s. This patch further improves the
    speed to 2.3G/s, so around 15% improvement. It's a multi-process test,
    so TLB flush isn't the biggest bottleneck before the patches.

    [arnd@arndb.de: fix it for nommu]
    [hughd@google.com: add missing unlock]
    [minchan@kernel.org: get rid of lockdep whinge on sys_swapon]
    Signed-off-by: Shaohua Li
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Greg Kroah-Hartman
    Cc: Seth Jennings
    Cc: Konrad Rzeszutek Wilk
    Cc: Xiao Guangrong
    Cc: Dan Magenheimer
    Cc: Stephen Rothwell
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Hugh Dickins
    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
    'struct task_struct'), so that the flag can be set by one task to avoid
    doing I/O inside memory allocation in the task's context.

    The patch trys to solve one deadlock problem caused by block device, and
    the problem may happen at least in the below situations:

    - during block device runtime resume, if memory allocation with
    GFP_KERNEL is called inside runtime resume callback of any one of its
    ancestors(or the block device itself), the deadlock may be triggered
    inside the memory allocation since it might not complete until the block
    device becomes active and the involed page I/O finishes. The situation
    is pointed out first by Alan Stern. It is not a good approach to
    convert all GFP_KERNEL[1] in the path into GFP_NOIO because several
    subsystems may be involved(for example, PCI, USB and SCSI may be
    involved for usb mass stoarage device, network devices involved too in
    the iSCSI case)

    - during block device runtime suspend, because runtime resume need to
    wait for completion of concurrent runtime suspend.

    - during error handling of usb mass storage deivce, USB bus reset will
    be put on the device, so there shouldn't have any memory allocation with
    GFP_KERNEL during USB bus reset, otherwise the deadlock similar with
    above may be triggered. Unfortunately, any usb device may include one
    mass storage interface in theory, so it requires all usb interface
    drivers to handle the situation. In fact, most usb drivers don't know
    how to handle bus reset on the device and don't provide .pre_set() and
    .post_reset() callback at all, so USB core has to unbind and bind driver
    for these devices. So it is still not practical to resort to GFP_NOIO
    for solving the problem.

    Also the introduced solution can be used by block subsystem or block
    drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
    actual I/O transfer.

    It is not a good idea to convert all these GFP_KERNEL in the affected
    path into GFP_NOIO because these functions doing that may be implemented
    as library and will be called in many other contexts.

    In fact, memalloc_noio_flags() can convert some of current static
    GFP_NOIO allocation into GFP_KERNEL back in other non-affected contexts,
    at least almost all GFP_NOIO in USB subsystem can be converted into
    GFP_KERNEL after applying the approach and make allocation with GFP_NOIO
    only happen in runtime resume/bus reset/block I/O transfer contexts
    generally.

    [1], several GFP_KERNEL allocation examples in runtime resume path

    - pci subsystem
    acpi_os_allocate

    Signed-off-by: Minchan Kim
    Cc: Alan Stern
    Cc: Oliver Neukum
    Cc: Jiri Kosina
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Greg KH
    Cc: Jens Axboe
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Cc: David Decotigny
    Cc: Tom Herbert
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ming Lei
     
  • From: Zlatko Calusic

    Commit 92df3a723f84 ("mm: vmscan: throttle reclaim if encountering too
    many dirty pages under writeback") introduced waiting on congested zones
    based on a sane algorithm in shrink_inactive_list().

    What this means is that there's no more need for throttling and
    additional heuristics in balance_pgdat(). So, let's remove it and tidy
    up the code.

    Signed-off-by: Zlatko Calusic
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zlatko Calusic
     
  • Now we have zone->managed_pages for "pages managed by the buddy system
    in the zone", so replace zone->present_pages with zone->managed_pages if
    what the user really wants is number of allocatable pages.

    Signed-off-by: Jiang Liu
    Cc: Wen Congyang
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Maciej Rutecki
    Cc: Chris Clayton
    Cc: "Rafael J . Wysocki"
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Jianguo Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Now that balance_pgdat() is slightly tidied up, thanks to more capable
    pgdat_balanced(), it's become obvious that pgdat_balanced() is called to
    check the status, then break the loop if pgdat is balanced, just to be
    immediately called again. The second call is completely unnecessary, of
    course.

    The patch introduces pgdat_is_balanced boolean, which helps resolve the
    above suboptimal behavior, with the added benefit of slightly better
    documenting one other place in the function where we jump and skip lots
    of code.

    Signed-off-by: Zlatko Calusic
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zlatko Calusic
     
  • Targeted (hard resp soft) reclaim has traditionally tried to scan one
    group with decreasing priority until nr_to_reclaim (SWAP_CLUSTER_MAX
    pages) is reclaimed or all priorities are exhausted. The reclaim is
    then retried until the limit is met.

    This approach, however, doesn't work well with deeper hierarchies where
    groups higher in the hierarchy do not have any or only very few pages
    (this usually happens if those groups do not have any tasks and they
    have only re-parented pages after some of their children is removed).
    Those groups are reclaimed with decreasing priority pointlessly as there
    is nothing to reclaim from them.

    An easiest fix is to break out of the memcg iteration loop in
    shrink_zone only if the whole hierarchy has been visited or sufficient
    pages have been reclaimed. This is also more natural because the
    reclaimer expects that the hierarchy under the given root is reclaimed.
    As a result we can simplify the soft limit reclaim which does its own
    iteration.

    [yinghan@google.com: break out of the hierarchy loop only if nr_reclaimed exceeded nr_to_reclaim]
    [akpm@linux-foundation.org: use conventional comparison order]
    Signed-off-by: Michal Hocko
    Reported-by: Ying Han
    Cc: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Cc: Glauber Costa
    Cc: Li Zefan
    Signed-off-by: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • "mm: vmscan: save work scanning (almost) empty LRU lists" made
    SWAP_CLUSTER_MAX an unsigned long.

    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Satoru Moriya
    Cc: Simon Jeons
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The restart logic for when reclaim operates back to back with compaction
    is currently applied on the lruvec level. But this does not make sense,
    because the container of interest for compaction is a zone as a whole,
    not the zone pages that are part of a certain memory cgroup.

    Negative impact is bounded. For one, the code checks that the lruvec
    has enough reclaim candidates, so it does not risk getting stuck on a
    condition that can not be fulfilled. And the unfairness of hammering on
    one particular memory cgroup to make progress in a zone will be
    amortized by the round robin manner in which reclaim goes through the
    memory cgroups. Still, this can lead to unnecessary allocation
    latencies when the code elects to restart on a hard to reclaim or small
    group when there are other, more reclaimable groups in the zone.

    Move this logic to the zone level and restart reclaim for all memory
    cgroups in a zone when compaction requires more free pages from it.

    [akpm@linux-foundation.org: no need for min_t]
    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Reviewed-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Satoru Moriya
    Cc: Simon Jeons
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Reclaim pressure balance between anon and file pages is calculated
    through a tuple of numerators and a shared denominator.

    Exceptional cases that want to force-scan anon or file pages configure
    the numerators and denominator such that one list is preferred, which is
    not necessarily the most obvious way:

    fraction[0] = 1;
    fraction[1] = 0;
    denominator = 1;
    goto out;

    Make this easier by making the force-scan cases explicit and use the
    fractionals only in case they are calculated from reclaim history.

    [akpm@linux-foundation.org: avoid using unintialized_var()]
    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Reviewed-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Satoru Moriya
    Cc: Simon Jeons
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Fix comment style and elaborate on why anonymous memory is force-scanned
    when file cache runs low.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Reviewed-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Satoru Moriya
    Cc: Simon Jeons
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • A swappiness of 0 has a slightly different meaning for global reclaim
    (may swap if file cache really low) and memory cgroup reclaim (never
    swap, ever).

    In addition, global reclaim at highest priority will scan all LRU lists
    equal to their size and ignore other balancing heuristics. UNLESS
    swappiness forbids swapping, then the lists are balanced based on recent
    reclaim effectiveness. UNLESS file cache is running low, then anonymous
    pages are force-scanned.

    This (total mess of a) behaviour is implicit and not obvious from the
    way the code is organized. At least make it apparent in the code flow
    and document the conditions. It will be it easier to come up with sane
    semantics later.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Satoru Moriya
    Reviewed-by: Michal Hocko
    Acked-by: Mel Gorman
    Cc: Hugh Dickins
    Cc: Simon Jeons
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • In certain cases (kswapd reclaim, memcg target reclaim), a fixed minimum
    amount of pages is scanned from the LRU lists on each iteration, to make
    progress.

    Do not make this minimum bigger than the respective LRU list size,
    however, and save some busy work trying to isolate and reclaim pages
    that are not there.

    Empty LRU lists are quite common with memory cgroups in NUMA
    environments because there exists a set of LRU lists for each zone for
    each memory cgroup, while the memory of a single cgroup is expected to
    stay on just one node. The number of expected empty LRU lists is thus

    memcgs * (nodes - 1) * lru types

    Each attempt to reclaim from an empty LRU list does expensive size
    comparisons between lists, acquires the zone's lru lock etc. Avoid
    that.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Reviewed-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Satoru Moriya
    Cc: Simon Jeons
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit e9868505987a ("mm, vmscan: only evict file pages when we have
    plenty") makes a point of not going for anonymous memory while there is
    still enough inactive cache around.

    The check was added only for global reclaim, but it is just as useful to
    reduce swapping in memory cgroup reclaim:

    200M-memcg-defconfig-j2

    vanilla patched
    Real time 454.06 ( +0.00%) 453.71 ( -0.08%)
    User time 668.57 ( +0.00%) 668.73 ( +0.02%)
    System time 128.92 ( +0.00%) 129.53 ( +0.46%)
    Swap in 1246.80 ( +0.00%) 814.40 ( -34.65%)
    Swap out 1198.90 ( +0.00%) 827.00 ( -30.99%)
    Pages allocated 16431288.10 ( +0.00%) 16434035.30 ( +0.02%)
    Major faults 681.50 ( +0.00%) 593.70 ( -12.86%)
    THP faults 237.20 ( +0.00%) 242.40 ( +2.18%)
    THP collapse 241.20 ( +0.00%) 248.50 ( +3.01%)
    THP splits 157.30 ( +0.00%) 161.40 ( +2.59%)

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Cc: Hugh Dickins
    Cc: Satoru Moriya
    Cc: Simon Jeons
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

04 Jan, 2013

1 commit

  • CONFIG_HOTPLUG is going away as an option. As a result, the __dev*
    markings need to be removed.

    This change removes the use of __devinit from the file.

    Based on patches originally written by Bill Pemberton, but redone by me
    in order to handle some of the coding style issues better, by hand.

    Cc: Bill Pemberton
    Cc: Andrew Morton
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Konstantin Khlebnikov
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

29 Dec, 2012

1 commit

  • An unintended consequence of commit 4ae0a48b5efc ("mm: modify
    pgdat_balanced() so that it also handles order-0") is that
    wait_iff_congested() can now be called with NULL 'struct zone *'
    producing kernel oops like this:

    BUG: unable to handle kernel NULL pointer dereference
    IP: [] wait_iff_congested+0x59/0x140

    This trivial patch fixes it.

    Reported-by: Zhouping Liu
    Reported-and-tested-by: Sedat Dilek
    Cc: Andrew Morton
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Zlatko Calusic
    Signed-off-by: Linus Torvalds

    Zlatko Calusic
     

24 Dec, 2012

1 commit


20 Dec, 2012

1 commit

  • On a 4GB RAM machine, where Normal zone is much smaller than DMA32 zone,
    the Normal zone gets fragmented in time. This requires relatively more
    pressure in balance_pgdat to get the zone above the required watermark.
    Unfortunately, the congestion_wait() call in there slows it down for a
    completely wrong reason, expecting that there's a lot of
    writeback/swapout, even when there's none (much more common). After a
    few days, when fragmentation progresses, this flawed logic translates to
    a very high CPU iowait times, even though there's no I/O congestion at
    all. If THP is enabled, the problem occurs sooner, but I was able to
    see it even on !THP kernels, just by giving it a bit more time to occur.

    The proper way to deal with this is to not wait, unless there's
    congestion. Thanks to Mel Gorman, we already have the function that
    perfectly fits the job. The patch was tested on a machine which nicely
    revealed the problem after only 1 day of uptime, and it's been working
    great.

    Signed-off-by: Zlatko Calusic
    Acked-by: Mel Gorman
    Signed-off-by: Linus Torvalds

    Zlatko Calusic
     

19 Dec, 2012

2 commits

  • Neil found that if too_many_isolated() returns true while performing
    direct reclaim we can end up waiting for other threads to complete their
    direct reclaim. If those threads are allowed to enter the FS or IO to
    free memory, but this thread is not, then it is possible that those
    threads will be waiting on this thread and so we get a circular deadlock.

    some task enters direct reclaim with GFP_KERNEL
    => too_many_isolated() false
    => vmscan and run into dirty pages
    => pageout()
    => take some FS lock
    => fs/block code does GFP_NOIO allocation
    => enter direct reclaim again
    => too_many_isolated() true
    => waiting for others to progress, however the other
    tasks may be circular waiting for the FS lock..

    The fix is to let !__GFP_IO and !__GFP_FS direct reclaims enjoy higher
    priority than normal ones, by lowering the throttle threshold for the
    latter.

    Allowing ~1/8 isolated pages in normal is large enough. For example, for
    a 1GB LRU list, that's ~128MB isolated pages, or 1k blocked tasks (each
    isolates 32 4KB pages), or 64 blocked tasks per logical CPU (assuming 16
    logical CPUs per NUMA node). So it's not likely some CPU goes idle
    waiting (when it could make progress) because of this limit: there are
    much more sleeping reclaim tasks than the number of CPU, so the task may
    well be blocked by some low level queue/lock anyway.

    Now !GFP_IOFS reclaims won't be waiting for GFP_IOFS reclaims to progress.
    They will be blocked only when there are too many concurrent !GFP_IOFS
    reclaims, however that's very unlikely because the IO-less direct reclaims
    is able to progress much more faster, and they won't deadlock each other.
    The threshold is raised high enough for them, so that there can be
    sufficient parallel progress of !GFP_IOFS reclaims.

    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: Wu Fengguang
    Cc: Torsten Kaiser
    Tested-by: NeilBrown
    Reviewed-by: Minchan Kim
    Acked-by: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Comment "Why it's doing so" rather than "What it does" as proposed by
    Andrew Morton.

    Signed-off-by: Wu Fengguang
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     

13 Dec, 2012

1 commit

  • N_HIGH_MEMORY stands for the nodes that has normal or high memory.
    N_MEMORY stands for the nodes that has any memory.

    The code here need to handle with the nodes which have memory, we should
    use N_MEMORY instead.

    Signed-off-by: Lai Jiangshan
    Acked-by: Hillf Danton
    Signed-off-by: Wen Congyang
    Cc: Christoph Lameter
    Cc: Hillf Danton
    Cc: Lin Feng
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     

12 Dec, 2012

3 commits

  • kswapd()->try_to_freeze() is defined to return a boolean, so it's better
    to use a bool to hold its return value.

    Signed-off-by: Jie Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Liu
     
  • If we have more inactive file pages than active file pages, we skip
    scanning the active file pages altogether, with the idea that we do not
    want to evict the working set when there is plenty of streaming IO in the
    cache.

    However, the code forgot to also skip scanning anonymous pages in that
    situation. That leads to the curious situation of keeping the active file
    pages protected from being paged out when there are lots of inactive file
    pages, while still scanning and evicting anonymous pages.

    This patch fixes that situation, by only evicting file pages when we have
    plenty of them and most are inactive.

    [akpm@linux-foundation.org: adjust comment layout]
    Signed-off-by: Rik van Riel
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • We don't need custom COMPACTION_BUILD anymore, since we have handy
    IS_ENABLED().

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Minchan Kim
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

09 Dec, 2012

1 commit

  • commit c702418f8a2f ("mm: vmscan: do not keep kswapd looping forever due
    to individual uncompactable zones") removed zone watermark checks from
    the compaction code in kswapd but left in the zone congestion clearing,
    which now happens unconditionally on higher order reclaim.

    This messes up the reclaim throttling logic for zones with
    dirty/writeback pages, where zones should only lose their congestion
    status when their watermarks have been restored.

    Remove the clearing from the zone compaction section entirely. The
    preliminary zone check and the reclaim loop in kswapd will clear it if
    the zone is considered balanced.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

07 Dec, 2012

1 commit

  • When a zone meets its high watermark and is compactable in case of
    higher order allocations, it contributes to the percentage of the node's
    memory that is considered balanced.

    This requirement, that a node be only partially balanced, came about
    when kswapd was desparately trying to balance tiny zones when all bigger
    zones in the node had plenty of free memory. Arguably, the same should
    apply to compaction: if a significant part of the node is balanced
    enough to run compaction, do not get hung up on that tiny zone that
    might never get in shape.

    When the compaction logic in kswapd is reached, we know that at least
    25% of the node's memory is balanced properly for compaction (see
    zone_balanced and pgdat_balanced). Remove the individual zone checks
    that restart the kswapd cycle.

    Otherwise, we may observe more endless looping in kswapd where the
    compaction code loops back to reclaim because of a single zone and
    reclaim does nothing because the node is considered balanced overall.

    See for example

    https://bugzilla.redhat.com/show_bug.cgi?id=866988

    Signed-off-by: Johannes Weiner
    Reported-and-tested-by: Thorsten Leemhuis
    Reported-by: Jiri Slaby
    Tested-by: John Ellson
    Tested-by: Zdenek Kabelac
    Tested-by: Bruno Wolff III
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

01 Dec, 2012

1 commit

  • Kswapd does not in all places have the same criteria for a balanced
    zone. Zones are only being reclaimed when their high watermark is
    breached, but compaction checks loop over the zonelist again when the
    zone does not meet the low watermark plus two times the size of the
    allocation. This gets kswapd stuck in an endless loop over a small
    zone, like the DMA zone, where the high watermark is smaller than the
    compaction requirement.

    Add a function, zone_balanced(), that checks the watermark, and, for
    higher order allocations, if compaction has enough free memory. Then
    use it uniformly to check for balanced zones.

    This makes sure that when the compaction watermark is not met, at least
    reclaim happens and progress is made - or the zone is declared
    unreclaimable at some point and skipped entirely.

    Signed-off-by: Johannes Weiner
    Reported-by: George Spelvin
    Reported-by: Johannes Hirte
    Reported-by: Tomas Racek
    Tested-by: Johannes Hirte
    Reviewed-by: Rik van Riel
    Cc: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

27 Nov, 2012

1 commit

  • Commit 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC
    reserves are low and swap is backed by network storage") introduced a
    check for fatal signals after a process gets throttled for network
    storage. The intention was that if a process was throttled and got
    killed that it should not trigger the OOM killer. As pointed out by
    Minchan Kim and David Rientjes, this check is in the wrong place and too
    broad. If a system is in am OOM situation and a process is exiting, it
    can loop in __alloc_pages_slowpath() and calling direct reclaim in a
    loop. As the fatal signal is pending it returns 1 as if it is making
    forward progress and can effectively deadlock.

    This patch moves the fatal_signal_pending() check after throttling to
    throttle_direct_reclaim() where it belongs. If the process is killed
    while throttled, it will return immediately without direct reclaim
    except now it will have TIF_MEMDIE set and will use the PFMEMALLOC
    reserves.

    Minchan pointed out that it may be better to direct reclaim before
    returning to avoid using the reserves because there may be pages that
    can easily reclaim that would avoid using the reserves. However, we do
    no such targetted reclaim and there is no guarantee that suitable pages
    are available. As it is expected that this throttling happens when
    swap-over-NFS is used there is a possibility that the process will
    instead swap which may allocate network buffers from the PFMEMALLOC
    reserves. Hence, in the swap-over-nfs case where a process can be
    throtted and be killed it can use the reserves to exit or it can
    potentially use reserves to swap a few pages and then exit. This patch
    takes the option of using the reserves if necessary to allow the process
    exit quickly.

    If this patch passes review it should be considered a -stable candidate
    for 3.6.

    Signed-off-by: Mel Gorman
    Cc: David Rientjes
    Cc: Luigi Semenzato
    Cc: Dan Magenheimer
    Cc: KOSAKI Motohiro
    Cc: Sonny Rao
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

17 Nov, 2012

1 commit

  • Jiri Slaby reported the following:

    (It's an effective revert of "mm: vmscan: scale number of pages
    reclaimed by reclaim/compaction based on failures".) Given kswapd
    had hours of runtime in ps/top output yesterday in the morning
    and after the revert it's now 2 minutes in sum for the last 24h,
    I would say, it's gone.

    The intention of the patch in question was to compensate for the loss of
    lumpy reclaim. Part of the reason lumpy reclaim worked is because it
    aggressively reclaimed pages and this patch was meant to be a sane
    compromise.

    When compaction fails, it gets deferred and both compaction and
    reclaim/compaction is deferred avoid excessive reclaim. However, since
    commit c654345924f7 ("mm: remove __GFP_NO_KSWAPD"), kswapd is woken up
    each time and continues reclaiming which was not taken into account when
    the patch was developed.

    Attempts to address the problem ended up just changing the shape of the
    problem instead of fixing it. The release window gets closer and while
    a THP allocation failing is not a major problem, kswapd chewing up a lot
    of CPU is.

    This patch reverts commit 83fde0f22872 ("mm: vmscan: scale number of
    pages reclaimed by reclaim/compaction based on failures") and will be
    revisited in the future.

    Signed-off-by: Mel Gorman
    Cc: Zdenek Kabelac
    Tested-by: Valdis Kletnieks
    Cc: Jiri Slaby
    Cc: Rik van Riel
    Cc: Jiri Slaby
    Cc: Johannes Hirte
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

09 Nov, 2012

1 commit

  • In kswapd(), set current->reclaim_state to NULL before returning, as
    current->reclaim_state holds reference to variable on kswapd()'s stack.

    In rare cases, while returning from kswapd() during memory offlining,
    __free_slab() and freepages() can access the dangling pointer of
    current->reclaim_state.

    Signed-off-by: Takamori Yamaguchi
    Signed-off-by: Aaditya Kumar
    Acked-by: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Takamori Yamaguchi
     

09 Oct, 2012

6 commits

  • Presently CMA cannot migrate mlocked pages so it ends up failing to allocate
    contiguous memory space.

    This patch makes mlocked pages be migrated out. Of course, it can affect
    realtime processes but in CMA usecase, contiguous memory allocation failing
    is far worse than access latency to an mlocked page being variable while
    CMA is running. If someone wants to make the system realtime, he shouldn't
    enable CMA because stalls can still happen at random times.

    [akpm@linux-foundation.org: tweak comment text, per Mel]
    Signed-off-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Marek Szyprowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • page_evictable(page, vma) is an irritant: almost all its callers pass
    NULL for vma. Remove the vma arg and use mlocked_vma_newpage(vma, page)
    explicitly in the couple of places it's needed. But in those places we
    don't even need page_evictable() itself! They're dealing with a freshly
    allocated anonymous page, which has no "mapping" and cannot be mlocked yet.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Michel Lespinasse
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Compaction caches if a pageblock was scanned and no pages were isolated so
    that the pageblocks can be skipped in the future to reduce scanning. This
    information is not cleared by the page allocator based on activity due to
    the impact it would have to the page allocator fast paths. Hence there is
    a requirement that something clear the cache or pageblocks will be skipped
    forever. Currently the cache is cleared if there were a number of recent
    allocation failures and it has not been cleared within the last 5 seconds.
    Time-based decisions like this are terrible as they have no relationship
    to VM activity and is basically a big hammer.

    Unfortunately, accurate heuristics would add cost to some hot paths so
    this patch implements a rough heuristic. There are two cases where the
    cache is cleared.

    1. If a !kswapd process completes a compaction cycle (migrate and free
    scanner meet), the zone is marked compact_blockskip_flush. When kswapd
    goes to sleep, it will clear the cache. This is expected to be the
    common case where the cache is cleared. It does not really matter if
    kswapd happens to be asleep or going to sleep when the flag is set as
    it will be woken on the next allocation request.

    2. If there have been multiple failures recently and compaction just
    finished being deferred then a process will clear the cache and start a
    full scan. This situation happens if there are multiple high-order
    allocation requests under heavy memory pressure.

    The clearing of the PG_migrate_skip bits and other scans is inherently
    racy but the race is harmless. For allocations that can fail such as THP,
    they will simply fail. For requests that cannot fail, they will retry the
    allocation. Tests indicated that scanning rates were roughly similar to
    when the time-based heuristic was used and the allocation success rates
    were similar.

    Signed-off-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Richard Davies
    Cc: Shaohua Li
    Cc: Avi Kivity
    Cc: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Drop clean cache pages instead of migration during alloc_contig_range() to
    minimise allocation latency by reducing the amount of migration that is
    necessary. It's useful for CMA because latency of migration is more
    important than evicting the background process's working set. In
    addition, as pages are reclaimed then fewer free pages for migration
    targets are required so it avoids memory reclaiming to get free pages,
    which is a contributory factor to increased latency.

    I measured elapsed time of __alloc_contig_migrate_range() which migrates
    10M in 40M movable zone in QEMU machine.

    Before - 146ms, After - 7ms

    [akpm@linux-foundation.org: fix nommu build]
    Signed-off-by: Mel Gorman
    Signed-off-by: Minchan Kim
    Reviewed-by: Mel Gorman
    Cc: Marek Szyprowski
    Acked-by: Michal Nazarewicz
    Cc: Rik van Riel
    Tested-by: Kyungmin Park
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Fix the return value while failing to create the kswapd kernel thread.
    Also, the error message is prioritized as KERN_ERR.

    Signed-off-by: Gavin Shan
    Signed-off-by: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gavin Shan
     
  • If allocation fails after compaction then compaction may be deferred for
    a number of allocation attempts. If there are subsequent failures,
    compact_defer_shift is increased to defer for longer periods. This
    patch uses that information to scale the number of pages reclaimed with
    compact_defer_shift until allocations succeed again. The rationale is
    that reclaiming the normal number of pages still allowed compaction to
    fail and its success depends on the number of pages. If it's failing,
    reclaim more pages until it succeeds again.

    Note that this is not implying that VM reclaim is not reclaiming enough
    pages or that its logic is broken. try_to_free_pages() always asks for
    SWAP_CLUSTER_MAX pages to be reclaimed regardless of order and that is
    what it does. Direct reclaim stops normally with this check.

    if (sc->nr_reclaimed >= sc->nr_to_reclaim)
    goto out;

    should_continue_reclaim delays when that check is made until a minimum
    number of pages for reclaim/compaction are reclaimed. It is possible
    that this patch could instead set nr_to_reclaim in try_to_free_pages()
    and drive it from there but that's behaves differently and not
    necessarily for the better. If driven from do_try_to_free_pages(), it
    is also possible that priorities will rise.

    When they reach DEF_PRIORITY-2, it will also start stalling and setting
    pages for immediate reclaim which is more disruptive than not desirable
    in this case. That is a more wide-reaching change that could cause
    another regression related to THP requests causing interactive jitter.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

18 Sep, 2012

1 commit

  • If kthread_run() fails, pgdat->kswapd contains errno. When we stop this
    thread, we only check whether pgdat->kswapd is NULL and access it. If
    it contains errno, it will cause page fault. Reset pgdat->kswapd to
    NULL when creating kernel thread fails can avoid this problem.

    Signed-off-by: Wen Congyang
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wen Congyang
     

01 Aug, 2012

1 commit

  • I noticed in a multi-process parallel files reading benchmark I ran on a 8
    socket machine, throughput slowed down by a factor of 8 when I ran the
    benchmark within a cgroup container. I traced the problem to the
    following code path (see below) when we are trying to reclaim memory from
    file cache. The res_counter_uncharge function is called on every page
    that's reclaimed and created heavy lock contention. The patch below
    allows the reclaimed pages to be uncharged from the resource counter in
    batch and recovered the regression.

    Tim

    40.67% usemem [kernel.kallsyms] [k] _raw_spin_lock
    |
    --- _raw_spin_lock
    |
    |--92.61%-- res_counter_uncharge
    | |
    | |--100.00%-- __mem_cgroup_uncharge_common
    | | |
    | | |--100.00%-- mem_cgroup_uncharge_cache_page
    | | | __remove_mapping
    | | | shrink_page_list
    | | | shrink_inactive_list
    | | | shrink_mem_cgroup_zone
    | | | shrink_zone
    | | | do_try_to_free_pages
    | | | try_to_free_pages
    | | | __alloc_pages_nodemask
    | | | alloc_pages_current

    Signed-off-by: Tim Chen
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Johannes Weiner
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tim Chen