14 Oct, 2013

1 commit

  • commit 117aad1e9e4d97448d1df3f84b08bd65811e6d6a upstream.

    Isolated balloon pages can wrongly end up in LRU lists when
    migrate_pages() finishes its round without draining all the isolated
    page list.

    The same issue can happen when reclaim_clean_pages_from_list() tries to
    reclaim pages from an isolated page list, before migration, in the CMA
    path. Such balloon page leak opens a race window against LRU lists
    shrinkers that leads us to the following kernel panic:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
    IP: [] shrink_page_list+0x24e/0x897
    PGD 3cda2067 PUD 3d713067 PMD 0
    Oops: 0000 [#1] SMP
    CPU: 0 PID: 340 Comm: kswapd0 Not tainted 3.12.0-rc1-22626-g4367597 #87
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    RIP: shrink_page_list+0x24e/0x897
    RSP: 0000:ffff88003da499b8 EFLAGS: 00010286
    RAX: 0000000000000000 RBX: ffff88003e82bd60 RCX: 00000000000657d5
    RDX: 0000000000000000 RSI: 000000000000031f RDI: ffff88003e82bd40
    RBP: ffff88003da49ab0 R08: 0000000000000001 R09: 0000000081121a45
    R10: ffffffff81121a45 R11: ffff88003c4a9a28 R12: ffff88003e82bd40
    R13: ffff88003da0e800 R14: 0000000000000001 R15: ffff88003da49d58
    FS: 0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00000000067d9000 CR3: 000000003ace5000 CR4: 00000000000407b0
    Call Trace:
    shrink_inactive_list+0x240/0x3de
    shrink_lruvec+0x3e0/0x566
    __shrink_zone+0x94/0x178
    shrink_zone+0x3a/0x82
    balance_pgdat+0x32a/0x4c2
    kswapd+0x2f0/0x372
    kthread+0xa2/0xaa
    ret_from_fork+0x7c/0xb0
    Code: 80 7d 8f 01 48 83 95 68 ff ff ff 00 4c 89 e7 e8 5a 7b 00 00 48 85 c0 49 89 c5 75 08 80 7d 8f 00 74 3e eb 31 48 8b 80 18 01 00 00 8b 74 0d 48 8b 78 30 be 02 00 00 00 ff d2 eb
    RIP [] shrink_page_list+0x24e/0x897
    RSP
    CR2: 0000000000000028
    ---[ end trace 703d2451af6ffbfd ]---
    Kernel panic - not syncing: Fatal exception

    This patch fixes the issue, by assuring the proper tests are made at
    putback_movable_pages() & reclaim_clean_pages_from_list() to avoid
    isolated balloon pages being wrongly reinserted in LRU lists.

    [akpm@linux-foundation.org: clarify awkward comment text]
    Signed-off-by: Rafael Aquini
    Reported-by: Luiz Capitulino
    Tested-by: Luiz Capitulino
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Rafael Aquini
     

30 Apr, 2013

3 commits

  • In page reclaim, huge page is split. split_huge_page() adds tail pages
    to LRU list. Since we are reclaiming a huge page, it's better we
    reclaim all subpages of the huge page instead of just the head page.
    This patch adds split tail pages to shrink page list so the tail pages
    can be reclaimed soon.

    Before this patch, run a swap workload:
    thp_fault_alloc 3492
    thp_fault_fallback 608
    thp_collapse_alloc 6
    thp_collapse_alloc_failed 0
    thp_split 916

    With this patch:
    thp_fault_alloc 4085
    thp_fault_fallback 16
    thp_collapse_alloc 90
    thp_collapse_alloc_failed 0
    thp_split 1272

    fallback allocation is reduced a lot.

    [akpm@linux-foundation.org: fix CONFIG_SWAP=n build]
    Signed-off-by: Shaohua Li
    Acked-by: Rik van Riel
    Acked-by: Minchan Kim
    Acked-by: Johannes Weiner
    Reviewed-by: Wanpeng Li
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • With this patch userland applications that want to maintain the
    interactivity/memory allocation cost can use the pressure level
    notifications. The levels are defined like this:

    The "low" level means that the system is reclaiming memory for new
    allocations. Monitoring this reclaiming activity might be useful for
    maintaining cache level. Upon notification, the program (typically
    "Activity Manager") might analyze vmstat and act in advance (i.e.
    prematurely shutdown unimportant services).

    The "medium" level means that the system is experiencing medium memory
    pressure, the system might be making swap, paging out active file
    caches, etc. Upon this event applications may decide to further analyze
    vmstat/zoneinfo/memcg or internal memory usage statistics and free any
    resources that can be easily reconstructed or re-read from a disk.

    The "critical" level means that the system is actively thrashing, it is
    about to out of memory (OOM) or even the in-kernel OOM killer is on its
    way to trigger. Applications should do whatever they can to help the
    system. It might be too late to consult with vmstat or any other
    statistics, so it's advisable to take an immediate action.

    The events are propagated upward until the event is handled, i.e. the
    events are not pass-through. Here is what this means: for example you
    have three cgroups: A->B->C. Now you set up an event listener on
    cgroups A, B and C, and suppose group C experiences some pressure. In
    this situation, only group C will receive the notification, i.e. groups
    A and B will not receive it. This is done to avoid excessive
    "broadcasting" of messages, which disturbs the system and which is
    especially bad if we are low on memory or thrashing. So, organize the
    cgroups wisely, or propagate the events manually (or, ask us to
    implement the pass-through events, explaining why would you need them.)

    Performance wise, the memory pressure notifications feature itself is
    lightweight and does not require much of bookkeeping, in contrast to the
    rest of memcg features. Unfortunately, as of current memcg
    implementation, pages accounting is an inseparable part and cannot be
    turned off. The good news is that there are some efforts[1] to improve
    the situation; plus, implementing the same, fully API-compatible[2]
    interface for CONFIG_MEMCG=n case (e.g. embedded) is also a viable
    option, so it will not require any changes on the userland side.

    [1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291
    [2] http://lkml.org/lkml/2013/2/21/454

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix CONFIG_CGROPUPS=n warnings]
    Signed-off-by: Anton Vorontsov
    Acked-by: Kirill A. Shutemov
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Tejun Heo
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Mel Gorman
    Cc: Glauber Costa
    Cc: Michal Hocko
    Cc: Luiz Capitulino
    Cc: Greg Thelen
    Cc: Leonid Moiseichuk
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: Bartlomiej Zolnierkiewicz
    Cc: John Stultz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Vorontsov
     
  • Local variable total_scanned is no longer used.

    Signed-off-by: Hillf Danton
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hillf Danton
     

18 Apr, 2013

1 commit

  • Fix the error return value in kswapd_run(). The bug was introduced by
    commit d5dc0ad928fb ("mm/vmscan: fix error number for failed kthread").

    Signed-off-by: Xishi Qiu
    Reviewed-by: Wanpeng Li
    Reviewed-by: Rik van Riel
    Reported-by: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     

24 Feb, 2013

16 commits

  • This variable is calculated from nr_free_pagecache_pages so
    change its type to unsigned long.

    Signed-off-by: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhang Yanfei
     
  • Recently, Luigi reported there are lots of free swap space when OOM
    happens. It's easily reproduced on zram-over-swap, where many instance
    of memory hogs are running and laptop_mode is enabled. He said there
    was no problem when he disabled laptop_mode. The problem when I
    investigate problem is following as.

    Assumption for easy explanation: There are no page cache page in system
    because they all are already reclaimed.

    1. try_to_free_pages disable may_writepage when laptop_mode is enabled.
    2. shrink_inactive_list isolates victim pages from inactive anon lru list.
    3. shrink_page_list adds them to swapcache via add_to_swap but it doesn't
    pageout because sc->may_writepage is 0 so the page is rotated back into
    inactive anon lru list. The add_to_swap made the page Dirty by SetPageDirty.
    4. 3 couldn't reclaim any pages so do_try_to_free_pages increase priority and
    retry reclaim with higher priority.
    5. shrink_inactlive_list try to isolate victim pages from inactive anon lru list
    but got failed because it try to isolate pages with ISOLATE_CLEAN mode but
    inactive anon lru list is full of dirty pages by 3 so it just returns
    without any reclaim progress.
    6. do_try_to_free_pages doesn't set may_writepage due to zero total_scanned.
    Because sc->nr_scanned is increased by shrink_page_list but we don't call
    shrink_page_list in 5 due to short of isolated pages.

    Above loop is continued until OOM happens.

    The problem didn't happen before [1] was merged because old logic's
    isolatation in shrink_inactive_list was successful and tried to call
    shrink_page_list to pageout them but it still ends up failed to page out
    by may_writepage. But important point is that sc->nr_scanned was
    increased although we couldn't swap out them so do_try_to_free_pages
    could set may_writepages.

    Since commit f80c0673610e ("mm: zone_reclaim: make isolate_lru_page()
    filter-aware") was introduced, it's not a good idea any more to depends
    on only the number of scanned pages for setting may_writepage. So this
    patch adds new trigger point of setting may_writepage as below
    DEF_PRIOIRTY - 2 which is used to show the significant memory pressure
    in VM so it's good fit for our purpose which would be better to lose
    power saving or clickety rather than OOM killing.

    Signed-off-by: Minchan Kim
    Reported-by: Luigi Semenzato
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • An inactive file list is considered low when its active counterpart is
    bigger, regardless of whether it is a global zone LRU list or a memcg
    zone LRU list. The only difference is in how the LRU size is assessed.

    get_lru_size() does the right thing for both global and memcg reclaim
    situations.

    Get rid of inactive_file_is_low_global() and
    mem_cgroup_inactive_file_is_low() by using get_lru_size() and compare
    the numbers in common code.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • swap_lock is heavily contended when I test swap to 3 fast SSD (even
    slightly slower than swap to 2 such SSD). The main contention comes
    from swap_info_get(). This patch tries to fix the gap with adding a new
    per-partition lock.

    Global data like nr_swapfiles, total_swap_pages, least_priority and
    swap_list are still protected by swap_lock.

    nr_swap_pages is an atomic now, it can be changed without swap_lock. In
    theory, it's possible get_swap_page() finds no swap pages but actually
    there are free swap pages. But sounds not a big problem.

    Accessing partition specific data (like scan_swap_map and so on) is only
    protected by swap_info_struct.lock.

    Changing swap_info_struct.flags need hold swap_lock and
    swap_info_struct.lock, because scan_scan_map() will check it. read the
    flags is ok with either the locks hold.

    If both swap_lock and swap_info_struct.lock must be hold, we always hold
    the former first to avoid deadlock.

    swap_entry_free() can change swap_list. To delete that code, we add a
    new highest_priority_index. Whenever get_swap_page() is called, we
    check it. If it's valid, we use it.

    It's a pity get_swap_page() still holds swap_lock(). But in practice,
    swap_lock() isn't heavily contended in my test with this patch (or I can
    say there are other much more heavier bottlenecks like TLB flush). And
    BTW, looks get_swap_page() doesn't really need the lock. We never free
    swap_info[] and we check SWAP_WRITEOK flag. The only risk without the
    lock is we could swapout to some low priority swap, but we can quickly
    recover after several rounds of swap, so sounds not a big deal to me.
    But I'd prefer to fix this if it's a real problem.

    "swap: make each swap partition have one address_space" improved the
    swapout speed from 1.7G/s to 2G/s. This patch further improves the
    speed to 2.3G/s, so around 15% improvement. It's a multi-process test,
    so TLB flush isn't the biggest bottleneck before the patches.

    [arnd@arndb.de: fix it for nommu]
    [hughd@google.com: add missing unlock]
    [minchan@kernel.org: get rid of lockdep whinge on sys_swapon]
    Signed-off-by: Shaohua Li
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Greg Kroah-Hartman
    Cc: Seth Jennings
    Cc: Konrad Rzeszutek Wilk
    Cc: Xiao Guangrong
    Cc: Dan Magenheimer
    Cc: Stephen Rothwell
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Hugh Dickins
    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
    'struct task_struct'), so that the flag can be set by one task to avoid
    doing I/O inside memory allocation in the task's context.

    The patch trys to solve one deadlock problem caused by block device, and
    the problem may happen at least in the below situations:

    - during block device runtime resume, if memory allocation with
    GFP_KERNEL is called inside runtime resume callback of any one of its
    ancestors(or the block device itself), the deadlock may be triggered
    inside the memory allocation since it might not complete until the block
    device becomes active and the involed page I/O finishes. The situation
    is pointed out first by Alan Stern. It is not a good approach to
    convert all GFP_KERNEL[1] in the path into GFP_NOIO because several
    subsystems may be involved(for example, PCI, USB and SCSI may be
    involved for usb mass stoarage device, network devices involved too in
    the iSCSI case)

    - during block device runtime suspend, because runtime resume need to
    wait for completion of concurrent runtime suspend.

    - during error handling of usb mass storage deivce, USB bus reset will
    be put on the device, so there shouldn't have any memory allocation with
    GFP_KERNEL during USB bus reset, otherwise the deadlock similar with
    above may be triggered. Unfortunately, any usb device may include one
    mass storage interface in theory, so it requires all usb interface
    drivers to handle the situation. In fact, most usb drivers don't know
    how to handle bus reset on the device and don't provide .pre_set() and
    .post_reset() callback at all, so USB core has to unbind and bind driver
    for these devices. So it is still not practical to resort to GFP_NOIO
    for solving the problem.

    Also the introduced solution can be used by block subsystem or block
    drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
    actual I/O transfer.

    It is not a good idea to convert all these GFP_KERNEL in the affected
    path into GFP_NOIO because these functions doing that may be implemented
    as library and will be called in many other contexts.

    In fact, memalloc_noio_flags() can convert some of current static
    GFP_NOIO allocation into GFP_KERNEL back in other non-affected contexts,
    at least almost all GFP_NOIO in USB subsystem can be converted into
    GFP_KERNEL after applying the approach and make allocation with GFP_NOIO
    only happen in runtime resume/bus reset/block I/O transfer contexts
    generally.

    [1], several GFP_KERNEL allocation examples in runtime resume path

    - pci subsystem
    acpi_os_allocate

    Signed-off-by: Minchan Kim
    Cc: Alan Stern
    Cc: Oliver Neukum
    Cc: Jiri Kosina
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: "Rafael J. Wysocki"
    Cc: Greg KH
    Cc: Jens Axboe
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Cc: David Decotigny
    Cc: Tom Herbert
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ming Lei
     
  • From: Zlatko Calusic

    Commit 92df3a723f84 ("mm: vmscan: throttle reclaim if encountering too
    many dirty pages under writeback") introduced waiting on congested zones
    based on a sane algorithm in shrink_inactive_list().

    What this means is that there's no more need for throttling and
    additional heuristics in balance_pgdat(). So, let's remove it and tidy
    up the code.

    Signed-off-by: Zlatko Calusic
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zlatko Calusic
     
  • Now we have zone->managed_pages for "pages managed by the buddy system
    in the zone", so replace zone->present_pages with zone->managed_pages if
    what the user really wants is number of allocatable pages.

    Signed-off-by: Jiang Liu
    Cc: Wen Congyang
    Cc: David Rientjes
    Cc: Jiang Liu
    Cc: Maciej Rutecki
    Cc: Chris Clayton
    Cc: "Rafael J . Wysocki"
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Jianguo Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • Now that balance_pgdat() is slightly tidied up, thanks to more capable
    pgdat_balanced(), it's become obvious that pgdat_balanced() is called to
    check the status, then break the loop if pgdat is balanced, just to be
    immediately called again. The second call is completely unnecessary, of
    course.

    The patch introduces pgdat_is_balanced boolean, which helps resolve the
    above suboptimal behavior, with the added benefit of slightly better
    documenting one other place in the function where we jump and skip lots
    of code.

    Signed-off-by: Zlatko Calusic
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zlatko Calusic
     
  • Targeted (hard resp soft) reclaim has traditionally tried to scan one
    group with decreasing priority until nr_to_reclaim (SWAP_CLUSTER_MAX
    pages) is reclaimed or all priorities are exhausted. The reclaim is
    then retried until the limit is met.

    This approach, however, doesn't work well with deeper hierarchies where
    groups higher in the hierarchy do not have any or only very few pages
    (this usually happens if those groups do not have any tasks and they
    have only re-parented pages after some of their children is removed).
    Those groups are reclaimed with decreasing priority pointlessly as there
    is nothing to reclaim from them.

    An easiest fix is to break out of the memcg iteration loop in
    shrink_zone only if the whole hierarchy has been visited or sufficient
    pages have been reclaimed. This is also more natural because the
    reclaimer expects that the hierarchy under the given root is reclaimed.
    As a result we can simplify the soft limit reclaim which does its own
    iteration.

    [yinghan@google.com: break out of the hierarchy loop only if nr_reclaimed exceeded nr_to_reclaim]
    [akpm@linux-foundation.org: use conventional comparison order]
    Signed-off-by: Michal Hocko
    Reported-by: Ying Han
    Cc: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Cc: Glauber Costa
    Cc: Li Zefan
    Signed-off-by: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • "mm: vmscan: save work scanning (almost) empty LRU lists" made
    SWAP_CLUSTER_MAX an unsigned long.

    Cc: Johannes Weiner
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Satoru Moriya
    Cc: Simon Jeons
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The restart logic for when reclaim operates back to back with compaction
    is currently applied on the lruvec level. But this does not make sense,
    because the container of interest for compaction is a zone as a whole,
    not the zone pages that are part of a certain memory cgroup.

    Negative impact is bounded. For one, the code checks that the lruvec
    has enough reclaim candidates, so it does not risk getting stuck on a
    condition that can not be fulfilled. And the unfairness of hammering on
    one particular memory cgroup to make progress in a zone will be
    amortized by the round robin manner in which reclaim goes through the
    memory cgroups. Still, this can lead to unnecessary allocation
    latencies when the code elects to restart on a hard to reclaim or small
    group when there are other, more reclaimable groups in the zone.

    Move this logic to the zone level and restart reclaim for all memory
    cgroups in a zone when compaction requires more free pages from it.

    [akpm@linux-foundation.org: no need for min_t]
    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Reviewed-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Satoru Moriya
    Cc: Simon Jeons
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Reclaim pressure balance between anon and file pages is calculated
    through a tuple of numerators and a shared denominator.

    Exceptional cases that want to force-scan anon or file pages configure
    the numerators and denominator such that one list is preferred, which is
    not necessarily the most obvious way:

    fraction[0] = 1;
    fraction[1] = 0;
    denominator = 1;
    goto out;

    Make this easier by making the force-scan cases explicit and use the
    fractionals only in case they are calculated from reclaim history.

    [akpm@linux-foundation.org: avoid using unintialized_var()]
    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Reviewed-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Satoru Moriya
    Cc: Simon Jeons
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Fix comment style and elaborate on why anonymous memory is force-scanned
    when file cache runs low.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Reviewed-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Satoru Moriya
    Cc: Simon Jeons
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • A swappiness of 0 has a slightly different meaning for global reclaim
    (may swap if file cache really low) and memory cgroup reclaim (never
    swap, ever).

    In addition, global reclaim at highest priority will scan all LRU lists
    equal to their size and ignore other balancing heuristics. UNLESS
    swappiness forbids swapping, then the lists are balanced based on recent
    reclaim effectiveness. UNLESS file cache is running low, then anonymous
    pages are force-scanned.

    This (total mess of a) behaviour is implicit and not obvious from the
    way the code is organized. At least make it apparent in the code flow
    and document the conditions. It will be it easier to come up with sane
    semantics later.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Reviewed-by: Satoru Moriya
    Reviewed-by: Michal Hocko
    Acked-by: Mel Gorman
    Cc: Hugh Dickins
    Cc: Simon Jeons
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • In certain cases (kswapd reclaim, memcg target reclaim), a fixed minimum
    amount of pages is scanned from the LRU lists on each iteration, to make
    progress.

    Do not make this minimum bigger than the respective LRU list size,
    however, and save some busy work trying to isolate and reclaim pages
    that are not there.

    Empty LRU lists are quite common with memory cgroups in NUMA
    environments because there exists a set of LRU lists for each zone for
    each memory cgroup, while the memory of a single cgroup is expected to
    stay on just one node. The number of expected empty LRU lists is thus

    memcgs * (nodes - 1) * lru types

    Each attempt to reclaim from an empty LRU list does expensive size
    comparisons between lists, acquires the zone's lru lock etc. Avoid
    that.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Reviewed-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Satoru Moriya
    Cc: Simon Jeons
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit e9868505987a ("mm, vmscan: only evict file pages when we have
    plenty") makes a point of not going for anonymous memory while there is
    still enough inactive cache around.

    The check was added only for global reclaim, but it is just as useful to
    reduce swapping in memory cgroup reclaim:

    200M-memcg-defconfig-j2

    vanilla patched
    Real time 454.06 ( +0.00%) 453.71 ( -0.08%)
    User time 668.57 ( +0.00%) 668.73 ( +0.02%)
    System time 128.92 ( +0.00%) 129.53 ( +0.46%)
    Swap in 1246.80 ( +0.00%) 814.40 ( -34.65%)
    Swap out 1198.90 ( +0.00%) 827.00 ( -30.99%)
    Pages allocated 16431288.10 ( +0.00%) 16434035.30 ( +0.02%)
    Major faults 681.50 ( +0.00%) 593.70 ( -12.86%)
    THP faults 237.20 ( +0.00%) 242.40 ( +2.18%)
    THP collapse 241.20 ( +0.00%) 248.50 ( +3.01%)
    THP splits 157.30 ( +0.00%) 161.40 ( +2.59%)

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Cc: Hugh Dickins
    Cc: Satoru Moriya
    Cc: Simon Jeons
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

04 Jan, 2013

1 commit

  • CONFIG_HOTPLUG is going away as an option. As a result, the __dev*
    markings need to be removed.

    This change removes the use of __devinit from the file.

    Based on patches originally written by Bill Pemberton, but redone by me
    in order to handle some of the coding style issues better, by hand.

    Cc: Bill Pemberton
    Cc: Andrew Morton
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Konstantin Khlebnikov
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

29 Dec, 2012

1 commit

  • An unintended consequence of commit 4ae0a48b5efc ("mm: modify
    pgdat_balanced() so that it also handles order-0") is that
    wait_iff_congested() can now be called with NULL 'struct zone *'
    producing kernel oops like this:

    BUG: unable to handle kernel NULL pointer dereference
    IP: [] wait_iff_congested+0x59/0x140

    This trivial patch fixes it.

    Reported-by: Zhouping Liu
    Reported-and-tested-by: Sedat Dilek
    Cc: Andrew Morton
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Zlatko Calusic
    Signed-off-by: Linus Torvalds

    Zlatko Calusic
     

24 Dec, 2012

1 commit


20 Dec, 2012

1 commit

  • On a 4GB RAM machine, where Normal zone is much smaller than DMA32 zone,
    the Normal zone gets fragmented in time. This requires relatively more
    pressure in balance_pgdat to get the zone above the required watermark.
    Unfortunately, the congestion_wait() call in there slows it down for a
    completely wrong reason, expecting that there's a lot of
    writeback/swapout, even when there's none (much more common). After a
    few days, when fragmentation progresses, this flawed logic translates to
    a very high CPU iowait times, even though there's no I/O congestion at
    all. If THP is enabled, the problem occurs sooner, but I was able to
    see it even on !THP kernels, just by giving it a bit more time to occur.

    The proper way to deal with this is to not wait, unless there's
    congestion. Thanks to Mel Gorman, we already have the function that
    perfectly fits the job. The patch was tested on a machine which nicely
    revealed the problem after only 1 day of uptime, and it's been working
    great.

    Signed-off-by: Zlatko Calusic
    Acked-by: Mel Gorman
    Signed-off-by: Linus Torvalds

    Zlatko Calusic
     

19 Dec, 2012

2 commits

  • Neil found that if too_many_isolated() returns true while performing
    direct reclaim we can end up waiting for other threads to complete their
    direct reclaim. If those threads are allowed to enter the FS or IO to
    free memory, but this thread is not, then it is possible that those
    threads will be waiting on this thread and so we get a circular deadlock.

    some task enters direct reclaim with GFP_KERNEL
    => too_many_isolated() false
    => vmscan and run into dirty pages
    => pageout()
    => take some FS lock
    => fs/block code does GFP_NOIO allocation
    => enter direct reclaim again
    => too_many_isolated() true
    => waiting for others to progress, however the other
    tasks may be circular waiting for the FS lock..

    The fix is to let !__GFP_IO and !__GFP_FS direct reclaims enjoy higher
    priority than normal ones, by lowering the throttle threshold for the
    latter.

    Allowing ~1/8 isolated pages in normal is large enough. For example, for
    a 1GB LRU list, that's ~128MB isolated pages, or 1k blocked tasks (each
    isolates 32 4KB pages), or 64 blocked tasks per logical CPU (assuming 16
    logical CPUs per NUMA node). So it's not likely some CPU goes idle
    waiting (when it could make progress) because of this limit: there are
    much more sleeping reclaim tasks than the number of CPU, so the task may
    well be blocked by some low level queue/lock anyway.

    Now !GFP_IOFS reclaims won't be waiting for GFP_IOFS reclaims to progress.
    They will be blocked only when there are too many concurrent !GFP_IOFS
    reclaims, however that's very unlikely because the IO-less direct reclaims
    is able to progress much more faster, and they won't deadlock each other.
    The threshold is raised high enough for them, so that there can be
    sufficient parallel progress of !GFP_IOFS reclaims.

    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: Wu Fengguang
    Cc: Torsten Kaiser
    Tested-by: NeilBrown
    Reviewed-by: Minchan Kim
    Acked-by: KOSAKI Motohiro
    Acked-by: Rik van Riel
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     
  • Comment "Why it's doing so" rather than "What it does" as proposed by
    Andrew Morton.

    Signed-off-by: Wu Fengguang
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fengguang Wu
     

13 Dec, 2012

1 commit

  • N_HIGH_MEMORY stands for the nodes that has normal or high memory.
    N_MEMORY stands for the nodes that has any memory.

    The code here need to handle with the nodes which have memory, we should
    use N_MEMORY instead.

    Signed-off-by: Lai Jiangshan
    Acked-by: Hillf Danton
    Signed-off-by: Wen Congyang
    Cc: Christoph Lameter
    Cc: Hillf Danton
    Cc: Lin Feng
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     

12 Dec, 2012

3 commits

  • kswapd()->try_to_freeze() is defined to return a boolean, so it's better
    to use a bool to hold its return value.

    Signed-off-by: Jie Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jeff Liu
     
  • If we have more inactive file pages than active file pages, we skip
    scanning the active file pages altogether, with the idea that we do not
    want to evict the working set when there is plenty of streaming IO in the
    cache.

    However, the code forgot to also skip scanning anonymous pages in that
    situation. That leads to the curious situation of keeping the active file
    pages protected from being paged out when there are lots of inactive file
    pages, while still scanning and evicting anonymous pages.

    This patch fixes that situation, by only evicting file pages when we have
    plenty of them and most are inactive.

    [akpm@linux-foundation.org: adjust comment layout]
    Signed-off-by: Rik van Riel
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • We don't need custom COMPACTION_BUILD anymore, since we have handy
    IS_ENABLED().

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Minchan Kim
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

09 Dec, 2012

1 commit

  • commit c702418f8a2f ("mm: vmscan: do not keep kswapd looping forever due
    to individual uncompactable zones") removed zone watermark checks from
    the compaction code in kswapd but left in the zone congestion clearing,
    which now happens unconditionally on higher order reclaim.

    This messes up the reclaim throttling logic for zones with
    dirty/writeback pages, where zones should only lose their congestion
    status when their watermarks have been restored.

    Remove the clearing from the zone compaction section entirely. The
    preliminary zone check and the reclaim loop in kswapd will clear it if
    the zone is considered balanced.

    Signed-off-by: Johannes Weiner
    Reviewed-by: Rik van Riel
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

07 Dec, 2012

1 commit

  • When a zone meets its high watermark and is compactable in case of
    higher order allocations, it contributes to the percentage of the node's
    memory that is considered balanced.

    This requirement, that a node be only partially balanced, came about
    when kswapd was desparately trying to balance tiny zones when all bigger
    zones in the node had plenty of free memory. Arguably, the same should
    apply to compaction: if a significant part of the node is balanced
    enough to run compaction, do not get hung up on that tiny zone that
    might never get in shape.

    When the compaction logic in kswapd is reached, we know that at least
    25% of the node's memory is balanced properly for compaction (see
    zone_balanced and pgdat_balanced). Remove the individual zone checks
    that restart the kswapd cycle.

    Otherwise, we may observe more endless looping in kswapd where the
    compaction code loops back to reclaim because of a single zone and
    reclaim does nothing because the node is considered balanced overall.

    See for example

    https://bugzilla.redhat.com/show_bug.cgi?id=866988

    Signed-off-by: Johannes Weiner
    Reported-and-tested-by: Thorsten Leemhuis
    Reported-by: Jiri Slaby
    Tested-by: John Ellson
    Tested-by: Zdenek Kabelac
    Tested-by: Bruno Wolff III
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

01 Dec, 2012

1 commit

  • Kswapd does not in all places have the same criteria for a balanced
    zone. Zones are only being reclaimed when their high watermark is
    breached, but compaction checks loop over the zonelist again when the
    zone does not meet the low watermark plus two times the size of the
    allocation. This gets kswapd stuck in an endless loop over a small
    zone, like the DMA zone, where the high watermark is smaller than the
    compaction requirement.

    Add a function, zone_balanced(), that checks the watermark, and, for
    higher order allocations, if compaction has enough free memory. Then
    use it uniformly to check for balanced zones.

    This makes sure that when the compaction watermark is not met, at least
    reclaim happens and progress is made - or the zone is declared
    unreclaimable at some point and skipped entirely.

    Signed-off-by: Johannes Weiner
    Reported-by: George Spelvin
    Reported-by: Johannes Hirte
    Reported-by: Tomas Racek
    Tested-by: Johannes Hirte
    Reviewed-by: Rik van Riel
    Cc: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

27 Nov, 2012

1 commit

  • Commit 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC
    reserves are low and swap is backed by network storage") introduced a
    check for fatal signals after a process gets throttled for network
    storage. The intention was that if a process was throttled and got
    killed that it should not trigger the OOM killer. As pointed out by
    Minchan Kim and David Rientjes, this check is in the wrong place and too
    broad. If a system is in am OOM situation and a process is exiting, it
    can loop in __alloc_pages_slowpath() and calling direct reclaim in a
    loop. As the fatal signal is pending it returns 1 as if it is making
    forward progress and can effectively deadlock.

    This patch moves the fatal_signal_pending() check after throttling to
    throttle_direct_reclaim() where it belongs. If the process is killed
    while throttled, it will return immediately without direct reclaim
    except now it will have TIF_MEMDIE set and will use the PFMEMALLOC
    reserves.

    Minchan pointed out that it may be better to direct reclaim before
    returning to avoid using the reserves because there may be pages that
    can easily reclaim that would avoid using the reserves. However, we do
    no such targetted reclaim and there is no guarantee that suitable pages
    are available. As it is expected that this throttling happens when
    swap-over-NFS is used there is a possibility that the process will
    instead swap which may allocate network buffers from the PFMEMALLOC
    reserves. Hence, in the swap-over-nfs case where a process can be
    throtted and be killed it can use the reserves to exit or it can
    potentially use reserves to swap a few pages and then exit. This patch
    takes the option of using the reserves if necessary to allow the process
    exit quickly.

    If this patch passes review it should be considered a -stable candidate
    for 3.6.

    Signed-off-by: Mel Gorman
    Cc: David Rientjes
    Cc: Luigi Semenzato
    Cc: Dan Magenheimer
    Cc: KOSAKI Motohiro
    Cc: Sonny Rao
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

17 Nov, 2012

1 commit

  • Jiri Slaby reported the following:

    (It's an effective revert of "mm: vmscan: scale number of pages
    reclaimed by reclaim/compaction based on failures".) Given kswapd
    had hours of runtime in ps/top output yesterday in the morning
    and after the revert it's now 2 minutes in sum for the last 24h,
    I would say, it's gone.

    The intention of the patch in question was to compensate for the loss of
    lumpy reclaim. Part of the reason lumpy reclaim worked is because it
    aggressively reclaimed pages and this patch was meant to be a sane
    compromise.

    When compaction fails, it gets deferred and both compaction and
    reclaim/compaction is deferred avoid excessive reclaim. However, since
    commit c654345924f7 ("mm: remove __GFP_NO_KSWAPD"), kswapd is woken up
    each time and continues reclaiming which was not taken into account when
    the patch was developed.

    Attempts to address the problem ended up just changing the shape of the
    problem instead of fixing it. The release window gets closer and while
    a THP allocation failing is not a major problem, kswapd chewing up a lot
    of CPU is.

    This patch reverts commit 83fde0f22872 ("mm: vmscan: scale number of
    pages reclaimed by reclaim/compaction based on failures") and will be
    revisited in the future.

    Signed-off-by: Mel Gorman
    Cc: Zdenek Kabelac
    Tested-by: Valdis Kletnieks
    Cc: Jiri Slaby
    Cc: Rik van Riel
    Cc: Jiri Slaby
    Cc: Johannes Hirte
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

09 Nov, 2012

1 commit

  • In kswapd(), set current->reclaim_state to NULL before returning, as
    current->reclaim_state holds reference to variable on kswapd()'s stack.

    In rare cases, while returning from kswapd() during memory offlining,
    __free_slab() and freepages() can access the dangling pointer of
    current->reclaim_state.

    Signed-off-by: Takamori Yamaguchi
    Signed-off-by: Aaditya Kumar
    Acked-by: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Takamori Yamaguchi
     

09 Oct, 2012

3 commits

  • Presently CMA cannot migrate mlocked pages so it ends up failing to allocate
    contiguous memory space.

    This patch makes mlocked pages be migrated out. Of course, it can affect
    realtime processes but in CMA usecase, contiguous memory allocation failing
    is far worse than access latency to an mlocked page being variable while
    CMA is running. If someone wants to make the system realtime, he shouldn't
    enable CMA because stalls can still happen at random times.

    [akpm@linux-foundation.org: tweak comment text, per Mel]
    Signed-off-by: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Michal Nazarewicz
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Marek Szyprowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • page_evictable(page, vma) is an irritant: almost all its callers pass
    NULL for vma. Remove the vma arg and use mlocked_vma_newpage(vma, page)
    explicitly in the couple of places it's needed. But in those places we
    don't even need page_evictable() itself! They're dealing with a freshly
    allocated anonymous page, which has no "mapping" and cannot be mlocked yet.

    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Michel Lespinasse
    Cc: Ying Han
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Compaction caches if a pageblock was scanned and no pages were isolated so
    that the pageblocks can be skipped in the future to reduce scanning. This
    information is not cleared by the page allocator based on activity due to
    the impact it would have to the page allocator fast paths. Hence there is
    a requirement that something clear the cache or pageblocks will be skipped
    forever. Currently the cache is cleared if there were a number of recent
    allocation failures and it has not been cleared within the last 5 seconds.
    Time-based decisions like this are terrible as they have no relationship
    to VM activity and is basically a big hammer.

    Unfortunately, accurate heuristics would add cost to some hot paths so
    this patch implements a rough heuristic. There are two cases where the
    cache is cleared.

    1. If a !kswapd process completes a compaction cycle (migrate and free
    scanner meet), the zone is marked compact_blockskip_flush. When kswapd
    goes to sleep, it will clear the cache. This is expected to be the
    common case where the cache is cleared. It does not really matter if
    kswapd happens to be asleep or going to sleep when the flag is set as
    it will be woken on the next allocation request.

    2. If there have been multiple failures recently and compaction just
    finished being deferred then a process will clear the cache and start a
    full scan. This situation happens if there are multiple high-order
    allocation requests under heavy memory pressure.

    The clearing of the PG_migrate_skip bits and other scans is inherently
    racy but the race is harmless. For allocations that can fail such as THP,
    they will simply fail. For requests that cannot fail, they will retry the
    allocation. Tests indicated that scanning rates were roughly similar to
    when the time-based heuristic was used and the allocation success rates
    were similar.

    Signed-off-by: Mel Gorman
    Cc: Rik van Riel
    Cc: Richard Davies
    Cc: Shaohua Li
    Cc: Avi Kivity
    Cc: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman