01 Aug, 2012

40 commits

  • With CONFIG_SPARSEMEM_EXTREME, the two levels of memory section
    descriptors are allocated from slab or bootmem. When allocating from
    slab, let slab/bootmem allocator clear the memory chunk. We needn't clear
    it explicitly.

    Signed-off-by: Gavin Shan
    Reviewed-by: Michal Hocko
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gavin Shan
     
  • Add a mem_cgroup_from_css() helper to replace open-coded invokations of
    container_of(). To clarify the code and to add a little more type safety.

    [akpm@linux-foundation.org: fix extensive breakage]
    Signed-off-by: Wanpeng Li
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Gavin Shan
    Cc: Wanpeng Li
    Cc: Gavin Shan
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • The may_enter_fs test turns out to be too restrictive: though I saw no
    problem with it when testing on 3.5-rc6, it very soon OOMed when I tested
    on 3.5-rc6-mm1. I don't know what the difference there is, perhaps I just
    slightly changed the way I started off the testing: dd if=/dev/zero
    of=/mnt/temp bs=1M count=1024; rm -f /mnt/temp; sync repeatedly, in 20M
    memory.limit_in_bytes cgroup to ext4 on USB stick.

    ext4 (and gfs2 and xfs) turn out to allocate new pages for writing with
    AOP_FLAG_NOFS: that seems a little worrying, and it's unclear to me why
    the transaction needs to be started even before allocating pagecache
    memory. But it may not be worth worrying about these days: if direct
    reclaim avoids FS writeback, does __GFP_FS now mean anything?

    Anyway, we insisted on the may_enter_fs test to avoid hangs with the loop
    device; but since that also masks off __GFP_IO, we can test for __GFP_IO
    directly, ignoring may_enter_fs and __GFP_FS.

    But even so, the test still OOMs sometimes: when originally testing on
    3.5-rc6, it OOMed about one time in five or ten; when testing just now on
    3.5-rc6-mm1, it OOMed on the first iteration.

    This residual problem comes from an accumulation of pages under ordinary
    writeback, not marked PageReclaim, so rightly not causing the memcg check
    to wait on their writeback: these too can prevent shrink_page_list() from
    freeing any pages, so many times that memcg reclaim fails and OOMs.

    Deal with these in the same way as direct reclaim now deals with dirty FS
    pages: mark them PageReclaim. It is appropriate to rotate these to tail
    of list when writepage completes, but more importantly, the PageReclaim
    flag makes memcg reclaim wait on them if encountered again. Increment
    NR_VMSCAN_IMMEDIATE? That's arguable: I chose not.

    Setting PageReclaim here may occasionally race with end_page_writeback()
    clearing it: lru_deactivate_fn() already faced the same race, and
    correctly concluded that the window is small and the issue non-critical.

    With these changes, the test runs indefinitely without OOMing on ext4,
    ext3 and ext2: I'll move on to test with other filesystems later.

    Trivia: invert conditions for a clearer block without an else, and goto
    keep_locked to do the unlock_page.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Fengguang Wu
    Acked-by: Michal Hocko
    Cc: Dave Chinner
    Cc: Theodore Ts'o
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The current implementation of dirty pages throttling is not memcg aware
    which makes it easy to have memcg LRUs full of dirty pages. Without
    throttling, these LRUs can be scanned faster than the rate of writeback,
    leading to memcg OOM conditions when the hard limit is small.

    This patch fixes the problem by throttling the allocating process
    (possibly a writer) during the hard limit reclaim by waiting on
    PageReclaim pages. We are waiting only for PageReclaim pages because
    those are the pages that made one full round over LRU and that means that
    the writeback is much slower than scanning.

    The solution is far from being ideal - long term solution is memcg aware
    dirty throttling - but it is meant to be a band aid until we have a real
    fix. We are seeing this happening during nightly backups which are placed
    into containers to prevent from eviction of the real working set.

    The change affects only memcg reclaim and only when we encounter
    PageReclaim pages which is a signal that the reclaim doesn't catch up on
    with the writers so somebody should be throttled. This could be
    potentially unfair because it could be somebody else from the group who
    gets throttled on behalf of the writer but as writers need to allocate as
    well and they allocate in higher rate the probability that only innocent
    processes would be penalized is not that high.

    I have tested this change by a simple dd copying /dev/zero to tmpfs or
    ext3 running under small memcg (1G copy under 5M, 60M, 300M and 2G
    containers) and dd got killed by OOM killer every time. With the patch I
    could run the dd with the same size under 5M controller without any OOM.
    The issue is more visible with slower devices for output.

    * With the patch
    ================
    * tmpfs size=2G
    ---------------
    $ vim cgroup_cache_oom_test.sh
    $ ./cgroup_cache_oom_test.sh 5M
    using Limit 5M for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 30.4049 s, 34.5 MB/s
    $ ./cgroup_cache_oom_test.sh 60M
    using Limit 60M for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 31.4561 s, 33.3 MB/s
    $ ./cgroup_cache_oom_test.sh 300M
    using Limit 300M for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 20.4618 s, 51.2 MB/s
    $ ./cgroup_cache_oom_test.sh 2G
    using Limit 2G for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 1.42172 s, 738 MB/s

    * ext3
    ------
    $ ./cgroup_cache_oom_test.sh 5M
    using Limit 5M for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 27.9547 s, 37.5 MB/s
    $ ./cgroup_cache_oom_test.sh 60M
    using Limit 60M for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 30.3221 s, 34.6 MB/s
    $ ./cgroup_cache_oom_test.sh 300M
    using Limit 300M for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 24.5764 s, 42.7 MB/s
    $ ./cgroup_cache_oom_test.sh 2G
    using Limit 2G for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 3.35828 s, 312 MB/s

    * Without the patch
    ===================
    * tmpfs size=2G
    ---------------
    $ ./cgroup_cache_oom_test.sh 5M
    using Limit 5M for group
    ./cgroup_cache_oom_test.sh: line 46: 4668 Killed dd if=/dev/zero of=$OUT/zero bs=1M count=$count
    $ ./cgroup_cache_oom_test.sh 60M
    using Limit 60M for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 25.4989 s, 41.1 MB/s
    $ ./cgroup_cache_oom_test.sh 300M
    using Limit 300M for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 24.3928 s, 43.0 MB/s
    $ ./cgroup_cache_oom_test.sh 2G
    using Limit 2G for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 1.49797 s, 700 MB/s

    * ext3
    ------
    $ ./cgroup_cache_oom_test.sh 5M
    using Limit 5M for group
    ./cgroup_cache_oom_test.sh: line 46: 4689 Killed dd if=/dev/zero of=$OUT/zero bs=1M count=$count
    $ ./cgroup_cache_oom_test.sh 60M
    using Limit 60M for group
    ./cgroup_cache_oom_test.sh: line 46: 4692 Killed dd if=/dev/zero of=$OUT/zero bs=1M count=$count
    $ ./cgroup_cache_oom_test.sh 300M
    using Limit 300M for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 20.248 s, 51.8 MB/s
    $ ./cgroup_cache_oom_test.sh 2G
    using Limit 2G for group
    1000+0 records in
    1000+0 records out
    1048576000 bytes (1.0 GB) copied, 2.85201 s, 368 MB/s

    [akpm@linux-foundation.org: tweak changelog, reordered the test to optimize for CONFIG_CGROUP_MEM_RES_CTLR=n]
    [hughd@google.com: fix deadlock with loop driver]
    Cc: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Ying Han
    Cc: Greg Thelen
    Cc: Hugh Dickins
    Reviewed-by: Mel Gorman
    Acked-by: Johannes Weiner
    Reviewed-by: Fengguang Wu
    Signed-off-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • mmu_notifier_release() is called when the process is exiting. It will
    delete all the mmu notifiers. But at this time the page belonging to the
    process is still present in page tables and is present on the LRU list, so
    this race will happen:

    CPU 0 CPU 1
    mmu_notifier_release: try_to_unmap:
    hlist_del_init_rcu(&mn->hlist);
    ptep_clear_flush_notify:
    mmu nofifler not found
    free page !!!!!!
    /*
    * At the point, the page has been
    * freed, but it is still mapped in
    * the secondary MMU.
    */

    mn->ops->release(mn, mm);

    Then the box is not stable and sometimes we can get this bug:

    [ 738.075923] BUG: Bad page state in process migrate-perf pfn:03bec
    [ 738.075931] page:ffffea00000efb00 count:0 mapcount:0 mapping: (null) index:0x8076
    [ 738.075936] page flags: 0x20000000000014(referenced|dirty)

    The same issue is present in mmu_notifier_unregister().

    We can call ->release before deleting the notifier to ensure the page has
    been unmapped from the secondary MMU before it is freed.

    Signed-off-by: Xiao Guangrong
    Cc: Avi Kivity
    Cc: Marcelo Tosatti
    Cc: Paul Gortmaker
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiao Guangrong
     
  • shmem knows for sure that the page is in swap cache when attempting to
    charge a page, because the cache charge entry function has a check for it.
    Only anon pages may be removed from swap cache already when trying to
    charge their swapin.

    Adjust the comment, though: '4969c11 mm: fix swapin race condition' added
    a stable PageSwapCache check under the page lock in the do_swap_page()
    before calling the memory controller, so it's unuse_pte()'s pte_same()
    that may fail.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Only anon and shmem pages in the swap cache are attempted to be charged
    multiple times, from every swap pte fault or from shmem_unuse(). No other
    pages require checking PageCgroupUsed().

    Charging pages in the swap cache is also serialized by the page lock, and
    since both the try_charge and commit_charge are called under the same page
    lock section, the PageCgroupUsed() check might as well happen before the
    counter charging, let alone reclaim.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When shmem is charged upon swapin, it does not need to check twice whether
    the memory controller is enabled.

    Also, shmem pages do not have to be checked for everything that regular
    anon pages have to be checked for, so let shmem use the internal version
    directly and allow future patches to move around checks that are only
    required when swapping in anon pages.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • It does not matter to __mem_cgroup_try_charge() if the passed mm is NULL
    or init_mm, it will charge the root memcg in either case.

    Also fix up the comment in __mem_cgroup_try_charge() that claimed the
    init_mm would be charged when no mm was passed. It's not really
    incorrect, but confusing. Clarify that the root memcg is charged in this
    case.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • shmem page charges have not needed a separate charge type to tell them
    from regular file pages since 08e552c ("memcg: synchronized LRU").

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Charging cache pages may require swapin in the shmem case. Save the
    forward declaration and just move the swapin functions above the cache
    charging functions.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Only anon pages that are uncharged at the time of the last page table
    mapping vanishing may be in swapcache.

    When shmem pages, file pages, swap-freed anon pages, or just migrated
    pages are uncharged, they are known for sure to be not in swapcache.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Not all uncharge paths need to check if the page is swapcache, some of
    them can know for sure.

    Push down the check into all callsites of uncharge_common() so that the
    patch that removes some of them is more obvious.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The conditional mem_cgroup_cancel_charge_swapin() is a leftover from when
    the function would continue to reestablish the page even after
    mem_cgroup_try_charge_swapin() failed. After 85d9fc8 "memcg: fix refcnt
    handling at swapoff", the condition is always true when this code is
    reached.

    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Compaction (and page migration in general) can currently be hindered
    through pages being owned by memory cgroups that are at their limits and
    unreclaimable.

    The reason is that the replacement page is being charged against the limit
    while the page being replaced is also still charged. But this seems
    unnecessary, given that only one of the two pages will still be in use
    after migration finishes.

    This patch changes the memcg migration sequence so that the replacement
    page is not charged. Whatever page is still in use after successful or
    failed migration gets to keep the charge of the page that was going to be
    replaced.

    The replacement page will still show up temporarily in the rss/cache
    statistics, this can be fixed in a later patch as it's less urgent.

    Reported-by: David Rientjes
    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Commit b3a27d ("swap: Add swap slot free callback to
    block_device_operations") dereferences p->bdev->bd_disk but this is a NULL
    dereference if using swap-over-NFS. This patch checks SWP_BLKDEV on the
    swap_info_struct before dereferencing.

    With reference to this callback, Christoph Hellwig stated "Please just
    remove the callback entirely. It has no user outside the staging tree and
    was added clearly against the rules for that staging tree". This would
    also be my preference but there was not an obvious way of keeping zram in
    staging/ happy.

    Signed-off-by: Xiaotian Feng
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Christoph Hellwig
    Cc: David S. Miller
    Cc: Eric B Munson
    Cc: Eric Paris
    Cc: James Morris
    Cc: Mel Gorman
    Cc: Mike Christie
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: Trond Myklebust
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The patch "mm: add support for a filesystem to activate swap files and use
    direct_IO for writing swap pages" added support for using direct_IO to
    write swap pages but it is insufficient for highmem pages.

    To support highmem pages, this patch kmaps() the page before calling the
    direct_IO() handler. As direct_IO deals with virtual addresses an
    additional helper is necessary for get_kernel_pages() to lookup the struct
    page for a kmap virtual address.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Christoph Hellwig
    Cc: David S. Miller
    Cc: Eric B Munson
    Cc: Eric Paris
    Cc: James Morris
    Cc: Mel Gorman
    Cc: Mike Christie
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: Trond Myklebust
    Cc: Xiaotian Feng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The version of swap_activate introduced is sufficient for swap-over-NFS
    but would not provide enough information to implement a generic handler.
    This patch shuffles things slightly to ensure the same information is
    available for aops->swap_activate() as is available to the core.

    No functionality change.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Christoph Hellwig
    Cc: David S. Miller
    Cc: Eric B Munson
    Cc: Eric Paris
    Cc: James Morris
    Cc: Mel Gorman
    Cc: Mike Christie
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: Trond Myklebust
    Cc: Xiaotian Feng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently swapfiles are managed entirely by the core VM by using ->bmap to
    allocate space and write to the blocks directly. This effectively ensures
    that the underlying blocks are allocated and avoids the need for the swap
    subsystem to locate what physical blocks store offsets within a file.

    If the swap subsystem is to use the filesystem information to locate the
    blocks, it is critical that information such as block groups, block
    bitmaps and the block descriptor table that map the swap file were
    resident in memory. This patch adds address_space_operations that the VM
    can call when activating or deactivating swap backed by a file.

    int swap_activate(struct file *);
    int swap_deactivate(struct file *);

    The ->swap_activate() method is used to communicate to the file that the
    VM relies on it, and the address_space should take adequate measures such
    as reserving space in the underlying device, reserving memory for mempools
    and pinning information such as the block descriptor table in memory. The
    ->swap_deactivate() method is called on sys_swapoff() if ->swap_activate()
    returned success.

    After a successful swapfile ->swap_activate, the swapfile is marked
    SWP_FILE and swapper_space.a_ops will proxy to
    sis->swap_file->f_mappings->a_ops using ->direct_io to write swapcache
    pages and ->readpage to read.

    It is perfectly possible that direct_IO be used to read the swap pages but
    it is an unnecessary complication. Similarly, it is possible that
    ->writepage be used instead of direct_io to write the pages but filesystem
    developers have stated that calling writepage from the VM is undesirable
    for a variety of reasons and using direct_IO opens up the possibility of
    writing back batches of swap pages in the future.

    [a.p.zijlstra@chello.nl: Original patch]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Christoph Hellwig
    Cc: David S. Miller
    Cc: Eric B Munson
    Cc: Eric Paris
    Cc: James Morris
    Cc: Mel Gorman
    Cc: Mike Christie
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: Trond Myklebust
    Cc: Xiaotian Feng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch adds two new APIs get_kernel_pages() and get_kernel_page() that
    may be used to pin a vector of kernel addresses for IO. The initial user
    is expected to be NFS for allowing pages to be written to swap using
    aops->direct_IO(). Strictly speaking, swap-over-NFS only needs to pin one
    page for IO but it makes sense to express the API in terms of a vector and
    add a helper for pinning single pages.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Christoph Hellwig
    Cc: David S. Miller
    Cc: Eric B Munson
    Cc: Eric Paris
    Cc: James Morris
    Cc: Mel Gorman
    Cc: Mike Christie
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: Trond Myklebust
    Cc: Xiaotian Feng
    Cc: Mark Salter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • In order to teach filesystems to handle swap cache pages, three new page
    functions are introduced:

    pgoff_t page_file_index(struct page *);
    loff_t page_file_offset(struct page *);
    struct address_space *page_file_mapping(struct page *);

    page_file_index() - gives the offset of this page in the file in
    PAGE_CACHE_SIZE blocks. Like page->index is for mapped pages, this
    function also gives the correct index for PG_swapcache pages.

    page_file_offset() - uses page_file_index(), so that it will give the
    expected result, even for PG_swapcache pages.

    page_file_mapping() - gives the mapping backing the actual page; that is
    for swap cache pages it will give swap_file->f_mapping.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Christoph Hellwig
    Cc: David S. Miller
    Cc: Eric B Munson
    Cc: Eric Paris
    Cc: James Morris
    Cc: Mel Gorman
    Cc: Mike Christie
    Cc: Neil Brown
    Cc: Sebastian Andrzej Siewior
    Cc: Trond Myklebust
    Cc: Xiaotian Feng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Under significant pressure when writing back to network-backed storage,
    direct reclaimers may get throttled. This is expected to be a short-lived
    event and the processes get woken up again but processes do get stalled.
    This patch counts how many times such stalling occurs. It's up to the
    administrator whether to reduce these stalls by increasing
    min_free_kbytes.

    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • If swap is backed by network storage such as NBD, there is a risk that a
    large number of reclaimers can hang the system by consuming all
    PF_MEMALLOC reserves. To avoid these hangs, the administrator must tune
    min_free_kbytes in advance which is a bit fragile.

    This patch throttles direct reclaimers if half the PF_MEMALLOC reserves
    are in use. If the system is routinely getting throttled the system
    administrator can increase min_free_kbytes so degradation is smoother but
    the system will keep running.

    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Getting and putting objects in SLAB currently requires a function call but
    the bulk of the work is related to PFMEMALLOC reserves which are only
    consumed when network-backed storage is critical. Use an inline function
    to determine if the function call is required.

    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Change the skb allocation API to indicate RX usage and use this to fall
    back to the PFMEMALLOC reserve when needed. SKBs allocated from the
    reserve are tagged in skb->pfmemalloc. If an SKB is allocated from the
    reserve and the socket is later found to be unrelated to page reclaim, the
    packet is dropped so that the memory remains available for page reclaim.
    Network protocols are expected to recover from this packet loss.

    [a.p.zijlstra@chello.nl: Ideas taken from various patches]
    [davem@davemloft.net: Use static branches, coding style corrections]
    [sebastian@breakpoint.cc: Avoid unnecessary cast, fix !CONFIG_NET build]
    Signed-off-by: Mel Gorman
    Acked-by: David S. Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The reserve is proportionally distributed over all !highmem zones in the
    system. So we need to allow an emergency allocation access to all zones.
    In order to do that we need to break out of any mempolicy boundaries we
    might have.

    In my opinion that does not break mempolicies as those are user oriented
    and not system oriented. That is, system allocations are not guaranteed
    to be within mempolicy boundaries. For instance IRQs do not even have a
    mempolicy.

    So breaking out of mempolicy boundaries for 'rare' emergency allocations,
    which are always system allocations (as opposed to user) is ok.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Neil Brown
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • __alloc_pages_slowpath() is called when the number of free pages is below
    the low watermark. If the caller is entitled to use ALLOC_NO_WATERMARKS
    then the page will be marked page->pfmemalloc. This protects more pages
    than are strictly necessary as we only need to protect pages allocated
    below the min watermark (the pfmemalloc reserves).

    This patch only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was
    required to allocate the page.

    [rientjes@google.com: David noticed the problem during review]
    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This is needed to allow network softirq packet processing to make use of
    PF_MEMALLOC.

    Currently softirq context cannot use PF_MEMALLOC due to it not being
    associated with a task, and therefore not having task flags to fiddle with
    - thus the gfp to alloc flag mapping ignores the task flags when in
    interrupts (hard or soft) context.

    Allowing softirqs to make use of PF_MEMALLOC therefore requires some
    trickery. This patch borrows the task flags from whatever process happens
    to be preempted by the softirq. It then modifies the gfp to alloc flags
    mapping to not exclude task flags in softirq context, and modify the
    softirq code to save, clear and restore the PF_MEMALLOC flag.

    The save and clear, ensures the preempted task's PF_MEMALLOC flag doesn't
    leak into the softirq. The restore ensures a softirq's PF_MEMALLOC flag
    cannot leak back into the preempted process. This should be safe due to
    the following reasons

    Softirqs can run on multiple CPUs sure but the same task should not be
    executing the same softirq code. Neither should the softirq
    handler be preempted by any other softirq handler so the flags
    should not leak to an unrelated softirq.

    Softirqs re-enable hardware interrupts in __do_softirq() so can be
    preempted by hardware interrupts so PF_MEMALLOC is inherited
    by the hard IRQ. However, this is similar to a process in
    reclaim being preempted by a hardirq. While PF_MEMALLOC is
    set, gfp_to_alloc_flags() distinguishes between hard and
    soft irqs and avoids giving a hardirq the ALLOC_NO_WATERMARKS
    flag.

    If the softirq is deferred to ksoftirq then its flags may be used
    instead of a normal tasks but as the softirq cannot be preempted,
    the PF_MEMALLOC flag does not leak to other code by accident.

    [davem@davemloft.net: Document why PF_MEMALLOC is safe]
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Neil Brown
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • __GFP_MEMALLOC will allow the allocation to disregard the watermarks, much
    like PF_MEMALLOC. It allows one to pass along the memalloc state in
    object related allocation flags as opposed to task related flags, such as
    sk->sk_allocation. This removes the need for ALLOC_PFMEMALLOC as callers
    using __GFP_MEMALLOC can get the ALLOC_NO_WATERMARK flag which is now
    enough to identify allocations related to page reclaim.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Neil Brown
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patch removes the check for pfmemalloc from the alloc hotpath and
    puts the logic after the election of a new per cpu slab. For a pfmemalloc
    page we do not use the fast path but force the use of the slow path which
    is also used for the debug case.

    This has the side-effect of weakening pfmemalloc processing in the
    following way;

    1. A process that is allocating for network swap calls __slab_alloc.
    pfmemalloc_match is true so the freelist is loaded and c->freelist is
    now pointing to a pfmemalloc page.

    2. A process that is attempting normal allocations calls slab_alloc,
    finds the pfmemalloc page on the freelist and uses it because it did
    not check pfmemalloc_match()

    The patch allows non-pfmemalloc allocations to use pfmemalloc pages with
    the kmalloc slabs being the most vunerable caches on the grounds they
    are most likely to have a mix of pfmemalloc and !pfmemalloc requests. A
    later patch will still protect the system as processes will get throttled
    if the pfmemalloc reserves get depleted but performance will not degrade
    as smoothly.

    [mgorman@suse.de: Expanded changelog]
    Signed-off-by: Christoph Lameter
    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • When a user or administrator requires swap for their application, they
    create a swap partition and file, format it with mkswap and activate it
    with swapon. Swap over the network is considered as an option in diskless
    systems. The two likely scenarios are when blade servers are used as part
    of a cluster where the form factor or maintenance costs do not allow the
    use of disks and thin clients.

    The Linux Terminal Server Project recommends the use of the Network Block
    Device (NBD) for swap according to the manual at
    https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download
    There is also documentation and tutorials on how to setup swap over NBD at
    places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The
    nbd-client also documents the use of NBD as swap. Despite this, the fact
    is that a machine using NBD for swap can deadlock within minutes if swap
    is used intensively. This patch series addresses the problem.

    The core issue is that network block devices do not use mempools like
    normal block devices do. As the host cannot control where they receive
    packets from, they cannot reliably work out in advance how much memory
    they might need. Some years ago, Peter Zijlstra developed a series of
    patches that supported swap over an NFS that at least one distribution is
    carrying within their kernels. This patch series borrows very heavily
    from Peter's work to support swapping over NBD as a pre-requisite to
    supporting swap-over-NFS. The bulk of the complexity is concerned with
    preserving memory that is allocated from the PFMEMALLOC reserves for use
    by the network layer which is needed for both NBD and NFS.

    Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to
    preserve access to pages allocated under low memory situations
    to callers that are freeing memory.

    Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks

    Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC
    reserves without setting PFMEMALLOC.

    Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves
    for later use by network packet processing.

    Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required

    Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set.

    Patches 7-12 allows network processing to use PFMEMALLOC reserves when
    the socket has been marked as being used by the VM to clean pages. If
    packets are received and stored in pages that were allocated under
    low-memory situations and are unrelated to the VM, the packets
    are dropped.

    Patch 11 reintroduces __skb_alloc_page which the networking
    folk may object to but is needed in some cases to propogate
    pfmemalloc from a newly allocated page to an skb. If there is a
    strong objection, this patch can be dropped with the impact being
    that swap-over-network will be slower in some cases but it should
    not fail.

    Patch 13 is a micro-optimisation to avoid a function call in the
    common case.

    Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use
    PFMEMALLOC if necessary.

    Patch 15 notes that it is still possible for the PFMEMALLOC reserve
    to be depleted. To prevent this, direct reclaimers get throttled on
    a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is
    expected that kswapd and the direct reclaimers already running
    will clean enough pages for the low watermark to be reached and
    the throttled processes are woken up.

    Patch 16 adds a statistic to track how often processes get throttled

    Some basic performance testing was run using kernel builds, netperf on
    loopback for UDP and TCP, hackbench (pipes and sockets), iozone and
    sysbench. Each of them were expected to use the sl*b allocators
    reasonably heavily but there did not appear to be significant performance
    variances.

    For testing swap-over-NBD, a machine was booted with 2G of RAM with a
    swapfile backed by NBD. 8*NUM_CPU processes were started that create
    anonymous memory mappings and read them linearly in a loop. The total
    size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under
    memory pressure.

    Without the patches and using SLUB, the machine locks up within minutes
    and runs to completion with them applied. With SLAB, the story is
    different as an unpatched kernel run to completion. However, the patched
    kernel completed the test 45% faster.

    MICRO
    3.5.0-rc2 3.5.0-rc2
    vanilla swapnbd
    Unrecognised test vmscan-anon-mmap-write
    MMTests Statistics: duration
    Sys Time Running Test (seconds) 197.80 173.07
    User+Sys Time Running Test (seconds) 206.96 182.03
    Total Elapsed Time (seconds) 3240.70 1762.09

    This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages

    Allocations of pages below the min watermark run a risk of the machine
    hanging due to a lack of memory. To prevent this, only callers who have
    PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are
    allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to
    a slab though, nothing prevents other callers consuming free objects
    within those slabs. This patch limits access to slab pages that were
    alloced from the PFMEMALLOC reserves.

    When this patch is applied, pages allocated from below the low watermark
    are returned with page->pfmemalloc set and it is up to the caller to
    determine how the page should be protected. SLAB restricts access to any
    page with page->pfmemalloc set to callers which are known to able to
    access the PFMEMALLOC reserve. If one is not available, an attempt is
    made to allocate a new page rather than use a reserve. SLUB is a bit more
    relaxed in that it only records if the current per-CPU page was allocated
    from PFMEMALLOC reserve and uses another partial slab if the caller does
    not have the necessary GFP or process flags. This was found to be
    sufficient in tests to avoid hangs due to SLUB generally maintaining
    smaller lists than SLAB.

    In low-memory conditions it does mean that !PFMEMALLOC allocators can fail
    a slab allocation even though free objects are available because they are
    being preserved for callers that are freeing pages.

    [a.p.zijlstra@chello.nl: Original implementation]
    [sebastian@breakpoint.cc: Correct order of page flag clearing]
    Signed-off-by: Mel Gorman
    Cc: David Miller
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Mike Christie
    Cc: Eric B Munson
    Cc: Eric Dumazet
    Cc: Sebastian Andrzej Siewior
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When hotplug offlining happens on zone A, it starts to mark freed page as
    MIGRATE_ISOLATE type in buddy for preventing further allocation.
    (MIGRATE_ISOLATE is very irony type because it's apparently on buddy but
    we can't allocate them).

    When the memory shortage happens during hotplug offlining, current task
    starts to reclaim, then wake up kswapd. Kswapd checks watermark, then go
    sleep because current zone_watermark_ok_safe doesn't consider
    MIGRATE_ISOLATE freed page count. Current task continue to reclaim in
    direct reclaim path without kswapd's helping. The problem is that
    zone->all_unreclaimable is set by only kswapd so that current task would
    be looping forever like below.

    __alloc_pages_slowpath
    restart:
    wake_all_kswapd
    rebalance:
    __alloc_pages_direct_reclaim
    do_try_to_free_pages
    if global_reclaim && !all_unreclaimable
    return 1; /* It means we did did_some_progress */
    skip __alloc_pages_may_oom
    should_alloc_retry
    goto rebalance;

    If we apply KOSAKI's patch[1] which doesn't depends on kswapd about
    setting zone->all_unreclaimable, we can solve this problem by killing some
    task in direct reclaim path. But it doesn't wake up kswapd, still. It
    could be a problem still if other subsystem needs GFP_ATOMIC request. So
    kswapd should consider MIGRATE_ISOLATE when it calculate free pages BEFORE
    going sleep.

    This patch counts the number of MIGRATE_ISOLATE page block and
    zone_watermark_ok_safe will consider it if the system has such blocks
    (fortunately, it's very rare so no problem in POV overhead and kswapd is
    never hotpath).

    Copy/modify from Mel's quote
    "
    Ideal solution would be "allocating" the pageblock.
    It would keep the free space accounting as it is but historically,
    memory hotplug didn't allocate pages because it would be difficult to
    detect if a pageblock was isolated or if part of some balloon.
    Allocating just full pageblocks would work around this, However,
    it would play very badly with CMA.
    "

    [1] http://lkml.org/lkml/2012/6/14/74

    [akpm@linux-foundation.org: simplify nr_zone_isolate_freepages(), rework zone_watermark_ok_safe() comment, simplify set_pageblock_isolate() and restore_pageblock_isolate()]
    [akpm@linux-foundation.org: fix CONFIG_MEMORY_ISOLATION=n build]
    Signed-off-by: Minchan Kim
    Suggested-by: KOSAKI Motohiro
    Tested-by: Aaditya Kumar
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • __zone_watermark_ok currently compares free_pages which is a signed type
    with z->lowmem_reserve[classzone_idx] which is unsigned which might lead
    to sign overflow if free_pages doesn't satisfy the given order (or it came
    as negative already) and then we rely on the following order loop to fix
    it (which doesn't work for order-0). Let's fix the type conversion and do
    not rely on the given value of free_pages or follow up fixups.

    This patch fixes it because "memory-hotplug: fix kswapd looping forever
    problem" depends on this.

    As benefit of this patch, it doesn't rely on the loop to exit
    __zone_watermark_ok in case of high order check and make the first test
    effective.(ie, if (free_pages
    Tested-by: Aaditya Kumar
    Signed-off-by: Aaditya Kumar
    Signed-off-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • mm/page_alloc.c has some memory isolation functions but they are used only
    when we enable CONFIG_{CMA|MEMORY_HOTPLUG|MEMORY_FAILURE}. So let's make
    it configurable by new CONFIG_MEMORY_ISOLATION so that it can reduce
    binary size and we can check it simple by CONFIG_MEMORY_ISOLATION, not if
    defined CONFIG_{CMA|MEMORY_HOTPLUG|MEMORY_FAILURE}.

    Signed-off-by: Minchan Kim
    Cc: Andi Kleen
    Cc: Marek Szyprowski
    Acked-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Mel Gorman
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • By globally defining check_panic_on_oom(), the memcg oom handler can be
    moved entirely to mm/memcontrol.c. This removes the ugly #ifdef in the
    oom killer and cleans up the code.

    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Oleg Nesterov
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Since exiting tasks require write_lock_irq(&tasklist_lock) several times,
    try to reduce the amount of time the readside is held for oom kills. This
    makes the interface with the memcg oom handler more consistent since it
    now never needs to take tasklist_lock unnecessarily.

    The only time the oom killer now takes tasklist_lock is when iterating the
    children of the selected task, everything else is protected by
    rcu_read_lock().

    This requires that a reference to the selected process, p, is grabbed
    before calling oom_kill_process(). It may release it and grab a reference
    on another one of p's threads if !p->mm, but it also guarantees that it
    will release the reference before returning.

    [hughd@google.com: fix duplicate put_task_struct()]
    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Cc: Oleg Nesterov
    Cc: KOSAKI Motohiro
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The global oom killer is serialized by the per-zonelist
    try_set_zonelist_oom() which is used in the page allocator. Concurrent
    oom kills are thus a rare event and only occur in systems using
    mempolicies and with a large number of nodes.

    Memory controller oom kills, however, can frequently be concurrent since
    there is no serialization once the oom killer is called for oom conditions
    in several different memcgs in parallel.

    This creates a massive contention on tasklist_lock since the oom killer
    requires the readside for the tasklist iteration. If several memcgs are
    calling the oom killer, this lock can be held for a substantial amount of
    time, especially if threads continue to enter it as other threads are
    exiting.

    Since the exit path grabs the writeside of the lock with irqs disabled in
    a few different places, this can cause a soft lockup on cpus as a result
    of tasklist_lock starvation.

    The kernel lacks unfair writelocks, and successful calls to the oom killer
    usually result in at least one thread entering the exit path, so an
    alternative solution is needed.

    This patch introduces a seperate oom handler for memcgs so that they do
    not require tasklist_lock for as much time. Instead, it iterates only
    over the threads attached to the oom memcg and grabs a reference to the
    selected thread before calling oom_kill_process() to ensure it doesn't
    prematurely exit.

    This still requires tasklist_lock for the tasklist dump, iterating
    children of the selected process, and killing all other threads on the
    system sharing the same memory as the selected victim. So while this
    isn't a complete solution to tasklist_lock starvation, it significantly
    reduces the amount of time that it is held.

    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Signed-off-by: David Rientjes
    Cc: Oleg Nesterov
    Cc: KOSAKI Motohiro
    Reviewed-by: Sha Zhengju
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • This patch introduces a helper function to process each thread during the
    iteration over the tasklist. A new return type, enum oom_scan_t, is
    defined to determine the future behavior of the iteration:

    - OOM_SCAN_OK: continue scanning the thread and find its badness,

    - OOM_SCAN_CONTINUE: do not consider this thread for oom kill, it's
    ineligible,

    - OOM_SCAN_ABORT: abort the iteration and return, or

    - OOM_SCAN_SELECT: always select this thread with the highest badness
    possible.

    There is no functional change with this patch. This new helper function
    will be used in the next patch in the memory controller.

    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: KOSAKI Motohiro
    Reviewed-by: Michal Hocko
    Signed-off-by: David Rientjes
    Cc: Oleg Nesterov
    Reviewed-by: Sha Zhengju
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Mark functions used by both boot and memory hotplug as __meminit to reduce
    memory footprint when memory hotplug is disabled.

    Alos guard zone_pcp_update() with CONFIG_MEMORY_HOTPLUG because it's only
    used by memory hotplug code.

    Signed-off-by: Jiang Liu
    Cc: Wei Wang
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rusty Russell
    Cc: Yinghai Lu
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: David Rientjes
    Cc: Keping Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu
     
  • When a zone becomes empty after memory offlining, free zone->pageset.
    Otherwise it will cause memory leak when adding memory to the empty zone
    again because build_all_zonelists() will allocate zone->pageset for an
    empty zone.

    Signed-off-by: Jiang Liu
    Signed-off-by: Wei Wang
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rusty Russell
    Cc: Yinghai Lu
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: David Rientjes
    Cc: Keping Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiang Liu