13 Jun, 2013

1 commit

  • read_swap_cache_async() can race against get_swap_page(), and stumble
    across a SWAP_HAS_CACHE entry in the swap map whose page wasn't brought
    into the swapcache yet.

    This transient swap_map state is expected to be transitory, but the
    actual placement of discard at scan_swap_map() inserts a wait for I/O
    completion thus making the thread at read_swap_cache_async() to loop
    around its -EEXIST case, while the other end at get_swap_page() is
    scheduled away at scan_swap_map(). This can leave the system deadlocked
    if the I/O completion happens to be waiting on the CPU waitqueue where
    read_swap_cache_async() is busy looping and !CONFIG_PREEMPT.

    This patch introduces a cond_resched() call to make the aforementioned
    read_swap_cache_async() busy loop condition to bail out when necessary,
    thus avoiding the subtle race window.

    Signed-off-by: Rafael Aquini
    Acked-by: Johannes Weiner
    Acked-by: KOSAKI Motohiro
    Acked-by: Hugh Dickins
    Cc: Shaohua Li
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rafael Aquini
     

30 Apr, 2013

2 commits

  • In page reclaim, huge page is split. split_huge_page() adds tail pages
    to LRU list. Since we are reclaiming a huge page, it's better we
    reclaim all subpages of the huge page instead of just the head page.
    This patch adds split tail pages to shrink page list so the tail pages
    can be reclaimed soon.

    Before this patch, run a swap workload:
    thp_fault_alloc 3492
    thp_fault_fallback 608
    thp_collapse_alloc 6
    thp_collapse_alloc_failed 0
    thp_split 916

    With this patch:
    thp_fault_alloc 4085
    thp_fault_fallback 16
    thp_collapse_alloc 90
    thp_collapse_alloc_failed 0
    thp_split 1272

    fallback allocation is reduced a lot.

    [akpm@linux-foundation.org: fix CONFIG_SWAP=n build]
    Signed-off-by: Shaohua Li
    Acked-by: Rik van Riel
    Acked-by: Minchan Kim
    Acked-by: Johannes Weiner
    Reviewed-by: Wanpeng Li
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • swap_writepage() is currently where frontswap hooks into the swap write
    path to capture pages with the frontswap_store() function. However, if
    a frontswap backend wants to "resume" the writeback of a page to the
    swap device, it can't call swap_writepage() as the page will simply
    reenter the backend.

    This patch separates swap_writepage() into a top and bottom half, the
    bottom half named __swap_writepage() to allow a frontswap backend, like
    zswap, to resume writeback beyond the frontswap_store() hook.

    __add_to_swap_cache() is also made non-static so that the page for which
    writeback is to be resumed can be added to the swap cache.

    Signed-off-by: Seth Jennings
    Signed-off-by: Bob Liu
    Acked-by: Minchan Kim
    Reviewed-by: Dan Magenheimer
    Cc: Konrad Rzeszutek Wilk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Seth Jennings
     

24 Feb, 2013

2 commits

  • swap_lock is heavily contended when I test swap to 3 fast SSD (even
    slightly slower than swap to 2 such SSD). The main contention comes
    from swap_info_get(). This patch tries to fix the gap with adding a new
    per-partition lock.

    Global data like nr_swapfiles, total_swap_pages, least_priority and
    swap_list are still protected by swap_lock.

    nr_swap_pages is an atomic now, it can be changed without swap_lock. In
    theory, it's possible get_swap_page() finds no swap pages but actually
    there are free swap pages. But sounds not a big problem.

    Accessing partition specific data (like scan_swap_map and so on) is only
    protected by swap_info_struct.lock.

    Changing swap_info_struct.flags need hold swap_lock and
    swap_info_struct.lock, because scan_scan_map() will check it. read the
    flags is ok with either the locks hold.

    If both swap_lock and swap_info_struct.lock must be hold, we always hold
    the former first to avoid deadlock.

    swap_entry_free() can change swap_list. To delete that code, we add a
    new highest_priority_index. Whenever get_swap_page() is called, we
    check it. If it's valid, we use it.

    It's a pity get_swap_page() still holds swap_lock(). But in practice,
    swap_lock() isn't heavily contended in my test with this patch (or I can
    say there are other much more heavier bottlenecks like TLB flush). And
    BTW, looks get_swap_page() doesn't really need the lock. We never free
    swap_info[] and we check SWAP_WRITEOK flag. The only risk without the
    lock is we could swapout to some low priority swap, but we can quickly
    recover after several rounds of swap, so sounds not a big deal to me.
    But I'd prefer to fix this if it's a real problem.

    "swap: make each swap partition have one address_space" improved the
    swapout speed from 1.7G/s to 2G/s. This patch further improves the
    speed to 2.3G/s, so around 15% improvement. It's a multi-process test,
    so TLB flush isn't the biggest bottleneck before the patches.

    [arnd@arndb.de: fix it for nommu]
    [hughd@google.com: add missing unlock]
    [minchan@kernel.org: get rid of lockdep whinge on sys_swapon]
    Signed-off-by: Shaohua Li
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Greg Kroah-Hartman
    Cc: Seth Jennings
    Cc: Konrad Rzeszutek Wilk
    Cc: Xiao Guangrong
    Cc: Dan Magenheimer
    Cc: Stephen Rothwell
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Hugh Dickins
    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • When I use several fast SSD to do swap, swapper_space.tree_lock is
    heavily contended. This makes each swap partition have one
    address_space to reduce the lock contention. There is an array of
    address_space for swap. The swap entry type is the index to the array.

    In my test with 3 SSD, this increases the swapout throughput 20%.

    [akpm@linux-foundation.org: revert unneeded change to __add_to_swap_cache]
    Signed-off-by: Shaohua Li
    Cc: Hugh Dickins
    Acked-by: Rik van Riel
    Acked-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

01 Aug, 2012

2 commits

  • Currently swapfiles are managed entirely by the core VM by using ->bmap to
    allocate space and write to the blocks directly. This effectively ensures
    that the underlying blocks are allocated and avoids the need for the swap
    subsystem to locate what physical blocks store offsets within a file.

    If the swap subsystem is to use the filesystem information to locate the
    blocks, it is critical that information such as block groups, block
    bitmaps and the block descriptor table that map the swap file were
    resident in memory. This patch adds address_space_operations that the VM
    can call when activating or deactivating swap backed by a file.

    int swap_activate(struct file *);
    int swap_deactivate(struct file *);

    The ->swap_activate() method is used to communicate to the file that the
    VM relies on it, and the address_space should take adequate measures such
    as reserving space in the underlying device, reserving memory for mempools
    and pinning information such as the block descriptor table in memory. The
    ->swap_deactivate() method is called on sys_swapoff() if ->swap_activate()
    returned success.

    After a successful swapfile ->swap_activate, the swapfile is marked
    SWP_FILE and swapper_space.a_ops will proxy to
    sis->swap_file->f_mappings->a_ops using ->direct_io to write swapcache
    pages and ->readpage to read.

    It is perfectly possible that direct_IO be used to read the swap pages but
    it is an unnecessary complication. Similarly, it is possible that
    ->writepage be used instead of direct_io to write the pages but filesystem
    developers have stated that calling writepage from the VM is undesirable
    for a variety of reasons and using direct_IO opens up the possibility of
    writing back batches of swap pages in the future.

    [a.p.zijlstra@chello.nl: Original patch]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Christoph Hellwig
    Cc: David S. Miller
    Cc: Eric B Munson
    Cc: Eric Paris
    Cc: James Morris
    Cc: Mel Gorman
    Cc: Mike Christie
    Cc: Neil Brown
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: Trond Myklebust
    Cc: Xiaotian Feng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Swap readahead works fine, but the I/O to disk is almost always done in
    page size requests, despite the fact that readahead submits
    1<
    Acked-by: Rik van Riel
    Acked-by: Jens Axboe
    Reviewed-by: Minchan Kim
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christian Ehrhardt
     

24 Apr, 2012

1 commit

  • Mel reports a BUG_ON(slot == NULL) in radix_tree_tag_set() on s390
    3.0.13: called from __set_page_dirty_nobuffers() when page_remove_rmap()
    tries to transfer dirty flag from s390 storage key to struct page and
    radix_tree.

    That would be because of reclaim's shrink_page_list() calling
    add_to_swap() on this page at the same time: first PageSwapCache is set
    (causing page_mapping(page) to appear as &swapper_space), then
    page->private set, then tree_lock taken, then page inserted into
    radix_tree - so there's an interval before taking the lock when the
    radix_tree slot is empty.

    We could fix this by moving __add_to_swap_cache()'s spin_lock_irq up
    before the SetPageSwapCache. But a better fix is simply to do what's
    five years overdue: Ken Chen introduced __set_page_dirty_no_writeback()
    (if !PageDirty TestSetPageDirty) for tmpfs to skip all the radix_tree
    overhead, and swap is just the same - it ignores the radix_tree tag, and
    does not participate in dirty page accounting, so should be using
    __set_page_dirty_no_writeback() too.

    s390 testing now confirms that this does indeed fix the problem.

    Reported-by: Mel Gorman
    Signed-off-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: Andrew Morton
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Rik van Riel
    Cc: Ken Chen
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

22 Mar, 2012

1 commit

  • Ever since abandoning the virtual scan of processes, for scalability
    reasons, swap space has been a little more fragmented than before. This
    can lead to the situation where a large memory user is killed, swap space
    ends up full of "holes" and swapin readahead is totally ineffective.

    On my home system, after killing a leaky firefox it took over an hour to
    page just under 2GB of memory back in, slowing the virtual machines down
    to a crawl.

    This patch makes swapin readahead simply skip over holes, instead of
    stopping at them. This allows the system to swap things back in at rates
    of several MB/second, instead of a few hundred kB/second.

    The checks done in valid_swaphandles are already done in
    read_swap_cache_async as well, allowing us to remove a fair amount of
    code.

    [akpm@linux-foundation.org: fix it for page_cluster >= 32]
    Signed-off-by: Rik van Riel
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Acked-by: Johannes Weiner
    Acked-by: Mel Gorman
    Cc: Adrian Drzewiecki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

06 Mar, 2012

1 commit

  • When moving tasks from old memcg (with move_charge_at_immigrate on new
    memcg), followed by removal of old memcg, hit General Protection Fault in
    mem_cgroup_lru_del_list() (called from release_pages called from
    free_pages_and_swap_cache from tlb_flush_mmu from tlb_finish_mmu from
    exit_mmap from mmput from exit_mm from do_exit).

    Somewhat reproducible, takes a few hours: the old struct mem_cgroup has
    been freed and poisoned by SLAB_DEBUG, but mem_cgroup_lru_del_list() is
    still trying to update its stats, and take page off lru before freeing.

    A task, or a charge, or a page on lru: each secures a memcg against
    removal. In this case, the last task has been moved out of the old memcg,
    and it is exiting: anonymous pages are uncharged one by one from the
    memcg, as they are zapped from its pagetables, so the charge gets down to
    0; but the pages themselves are queued in an mmu_gather for freeing.

    Most of those pages will be on lru (and force_empty is careful to
    lru_add_drain_all, to add pages from pagevec to lru first), but not
    necessarily all: perhaps some have been isolated for page reclaim, perhaps
    some isolated for other reasons. So, force_empty may find no task, no
    charge and no page on lru, and let the removal proceed.

    There would still be no problem if these pages were immediately freed; but
    typically (and the put_page_testzero protocol demands it) they have to be
    added back to lru before they are found freeable, then removed from lru
    and freed. We don't see the issue when adding, because the
    mem_cgroup_iter() loops keep their own reference to the memcg being
    scanned; but when it comes to mem_cgroup_lru_del_list().

    I believe this was not an issue in v3.2: there, PageCgroupAcctLRU and
    PageCgroupUsed flags were used (like a trick with mirrors) to deflect view
    of pc->mem_cgroup to the stable root_mem_cgroup when neither set.
    38c5d72f3ebe ("memcg: simplify LRU handling by new rule") mercifully
    removed those convolutions, but left this General Protection Fault.

    But it's surprisingly easy to restore the old behaviour: just check
    PageCgroupUsed in mem_cgroup_lru_add_list() (which decides on which lruvec
    to add), and reset pc to root_mem_cgroup if page is uncharged. A risky
    change? just going back to how it worked before; testing, and an audit of
    uses of pc->mem_cgroup, show no problem.

    And there's a nice bonus: with mem_cgroup_lru_add_list() itself making
    sure that an uncharged page goes to root lru, mem_cgroup_reset_owner() no
    longer has any purpose, and we can safely revert 4e5f01c2b9b9 ("memcg:
    clear pc->mem_cgroup if necessary").

    Calling update_page_reclaim_stat() after add_page_to_lru_list() in swap.c
    is not strictly necessary: the lru_lock there, with RCU before memcg
    structures are freed, makes mem_cgroup_get_reclaim_stat_from_page safe
    without that; but it seems cleaner to rely on one dependency less.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

13 Jan, 2012

1 commit

  • This is a preparation before removing a flag PCG_ACCT_LRU in page_cgroup
    and reducing atomic ops/complexity in memcg LRU handling.

    In some cases, pages are added to lru before charge to memcg and pages
    are not classfied to memory cgroup at lru addtion. Now, the lru where
    the page should be added is determined a bit in page_cgroup->flags and
    pc->mem_cgroup. I'd like to remove the check of flag.

    To handle the case pc->mem_cgroup may contain stale pointers if pages
    are added to LRU before classification. This patch resets
    pc->mem_cgroup to root_mem_cgroup before lru additions.

    [akpm@linux-foundation.org: fix CONFIG_CGROUP_MEM_CONT=n build]
    [hughd@google.com: fix CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_CGROUP_MEM_RES_CTLR_SWAP=n build]
    [akpm@linux-foundation.org: ksm.c needs memcontrol.h, per Michal]
    [hughd@google.com: stop oops in mem_cgroup_reset_owner()]
    [hughd@google.com: fix page migration to reset_owner]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Miklos Szeredi
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Ying Han
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

04 Jan, 2012

1 commit

  • Move invalidate_bdev, block_sync_page into fs/block_dev.c. Export
    kill_bdev as well, so brd doesn't have to open code it. Reduce
    buffer_head.h requirement accordingly.

    Removed a rather large comment from invalidate_bdev, as it looked a bit
    obsolete to bother moving. The small comment replacing it says enough.

    Signed-off-by: Nick Piggin
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Al Viro

    Al Viro
     

31 Oct, 2011

1 commit


10 Mar, 2011

1 commit

  • Code has been converted over to the new explicit on-stack plugging,
    and delay users have been converted to use the new API for that.
    So lets kill off the old plugging along with aops->sync_page().

    Signed-off-by: Jens Axboe

    Jens Axboe
     

14 Jan, 2011

1 commit

  • Paging logic that splits the page before it is unmapped and added to swap
    to ensure backwards compatibility with the legacy swap code. Eventually
    swap should natively pageout the hugepages to increase performance and
    decrease seeking and fragmentation of swap space. swapoff can just skip
    over huge pmd as they cannot be part of swap yet. In add_to_swap be
    careful to split the page only if we got a valid swap entry so we don't
    split hugepages with a full swap.

    In theory we could split pages before isolating them during the lru scan,
    but for khugepaged to be safe, I'm relying on either mmap_sem write mode,
    or PG_lock taken, so split_huge_page has to run either with mmap_sem
    read/write mode or PG_lock taken. Calling it from isolate_lru_page would
    make locking more complicated, in addition to that split_huge_page would
    deadlock if called by __isolate_lru_page because it has to take the lru
    lock to add the tail pages.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

22 Sep, 2009

2 commits

  • After commit 355cfa73 ("mm: modify swap_map and add SWAP_HAS_CACHE flag"),
    only the context which have set SWAP_HAS_CACHE flag by swapcache_prepare()
    or get_swap_page() would call add_to_swap_cache(). So add_to_swap_cache()
    doesn't return -EEXIST any more.

    Even though it doesn't return -EEXIST, it's not good behavior conceptually
    to call swapcache_prepare() in the -EEXIST case, because it means clearing
    SWAP_HAS_CACHE flag while the entry is on swap cache.

    This patch removes redundant codes and comments from callers of it, and
    adds VM_BUG_ON() in error path of add_to_swap_cache() and some comments.

    Signed-off-by: Daisuke Nishimura
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • After commit 355cfa73 ("mm: modify swap_map and add SWAP_HAS_CACHE flag"),
    read_swap_cache_async() will busy-wait while a entry doesn't exist in swap
    cache but it has SWAP_HAS_CACHE flag.

    Such entries can exist on add/delete path of swap cache. On add path,
    add_to_swap_cache() is called soon after SWAP_HAS_CACHE flag is set, and
    on delete path, swapcache_free() will be called (SWAP_HAS_CACHE flag is
    cleared) soon after __delete_from_swap_cache() is called. So, the
    busy-wait works well in most cases.

    But this mechanism can cause soft lockup if add_to_swap_cache() sleeps and
    read_swap_cache_async() tries to swap-in the same entry on the same cpu.

    This patch calls radix_tree_preload() before swapcache_prepare() and
    divides add_to_swap_cache() into two part: radix_tree_preload() part and
    radix_tree_insert() part(define it as __add_to_swap_cache()).

    Signed-off-by: Daisuke Nishimura
    Cc: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     

11 Sep, 2009

1 commit


17 Jun, 2009

4 commits

  • The file argument resulted from address_space's readpage long time ago.

    We don't use it any more. Let's remove unnecessary argement.

    Signed-off-by: Minchan Kim
    Acked-by: Hugh Dickins
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Hugh removed add_to_swap's gfp_mask argument. (mm: remove gfp_mask from
    add_to_swap) So we have to remove annotation of gfp_mask of the function.

    Signed-off-by: Minchan Kim
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • This is a part of the patches for fixing memcg's swap accountinf leak.
    But, IMHO, not a bad patch even if no memcg.

    There are 2 kinds of references to swap.
    - reference from swap entry
    - reference from swap cache

    Then,

    - If there is swap cache && swap's refcnt is 1, there is only swap cache.
    (*) swapcount(entry) == 1 && find_get_page(swapper_space, entry) != NULL

    This counting logic have worked well for a long time. But considering
    that we cannot know there is a _real_ reference or not by swap_map[],
    current usage of counter is not very good.

    This patch adds a flag SWAP_HAS_CACHE and recored information that a swap
    entry has a cache or not. This will remove -1 magic used in swapfile.c
    and be a help to avoid unnecessary find_get_page().

    Signed-off-by: KAMEZAWA Hiroyuki
    Tested-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Li Zefan
    Cc: Dhaval Giani
    Cc: YAMAMOTO Takashi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • In a following patch, the usage of swap cache is recorded into swap_map.
    This patch is for necessary interface changes to do that.

    2 interfaces:

    - swapcache_prepare()
    - swapcache_free()

    are added for allocating/freeing refcnt from swap-cache to existing swap
    entries. But implementation itself is not changed under this patch. At
    adding swapcache_free(), memcg's hook code is moved under
    swapcache_free(). This is better than using scattered hooks.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Daisuke Nishimura
    Acked-by: Balbir Singh
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Li Zefan
    Cc: Dhaval Giani
    Cc: YAMAMOTO Takashi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

29 May, 2009

1 commit

  • mapping->tree_lock can be acquired from interrupt context. Then,
    following dead lock can occur.

    Assume "A" as a page.

    CPU0:
    lock_page_cgroup(A)
    interrupted
    -> take mapping->tree_lock.
    CPU1:
    take mapping->tree_lock
    -> lock_page_cgroup(A)

    This patch tries to fix above deadlock by moving memcg's hook to out of
    mapping->tree_lock. charge/uncharge of pagecache/swapcache is protected
    by page lock, not tree_lock.

    After this patch, lock_page_cgroup() is not called under mapping->tree_lock.

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     

09 Jan, 2009

2 commits

  • This patch implements per cgroup limit for usage of memory+swap. However
    there are SwapCache, double counting of swap-cache and swap-entry is
    avoided.

    Mem+Swap controller works as following.
    - memory usage is limited by memory.limit_in_bytes.
    - memory + swap usage is limited by memory.memsw_limit_in_bytes.

    This has following benefits.
    - A user can limit total resource usage of mem+swap.

    Without this, because memory resource controller doesn't take care of
    usage of swap, a process can exhaust all the swap (by memory leak.)
    We can avoid this case.

    And Swap is shared resource but it cannot be reclaimed (goes back to memory)
    until it's used. This characteristic can be trouble when the memory
    is divided into some parts by cpuset or memcg.
    Assume group A and group B.
    After some application executes, the system can be..

    Group A -- very large free memory space but occupy 99% of swap.
    Group B -- under memory shortage but cannot use swap...it's nearly full.

    Ability to set appropriate swap limit for each group is required.

    Maybe someone wonder "why not swap but mem+swap ?"

    - The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
    to move account from memory to swap...there is no change in usage of
    mem+swap.

    In other words, when we want to limit the usage of swap without affecting
    global LRU, mem+swap limit is better than just limiting swap.

    Accounting target information is stored in swap_cgroup which is
    per swap entry record.

    Charge is done as following.
    map
    - charge page and memsw.

    unmap
    - uncharge page/memsw if not SwapCache.

    swap-out (__delete_from_swap_cache)
    - uncharge page
    - record mem_cgroup information to swap_cgroup.

    swap-in (do_swap_page)
    - charged as page and memsw.
    record in swap_cgroup is cleared.
    memsw accounting is decremented.

    swap-free (swap_free())
    - if swap entry is freed, memsw is uncharged by PAGE_SIZE.

    There are people work under never-swap environments and consider swap as
    something bad. For such people, this mem+swap controller extension is just an
    overhead. This overhead is avoided by config or boot option.
    (see Kconfig. detail is not in this patch.)

    TODO:
    - maybe more optimization can be don in swap-in path. (but not very safe.)
    But we just do simple accounting at this stage.

    [nishimura@mxp.nes.nec.co.jp: make resize limit hold mutex]
    [hugh@veritas.com: memswap controller core swapcache fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • SwapCache support for memory resource controller (memcg)

    Before mem+swap controller, memcg itself should handle SwapCache in proper
    way. This is cut-out from it.

    In current memcg, SwapCache is just leaked and the user can create tons of
    SwapCache. This is a leak of account and should be handled.

    SwapCache accounting is done as following.

    charge (anon)
    - charged when it's mapped.
    (because of readahead, charge at add_to_swap_cache() is not sane)
    uncharge (anon)
    - uncharged when it's dropped from swapcache and fully unmapped.
    means it's not uncharged at unmap.
    Note: delete from swap cache at swap-in is done after rmap information
    is established.
    charge (shmem)
    - charged at swap-in. this prevents charge at add_to_page_cache().

    uncharge (shmem)
    - uncharged when it's dropped from swapcache and not on shmem's
    radix-tree.

    at migration, check against 'old page' is modified to handle shmem.

    Comparing to the old version discussed (and caused troubles), we have
    advantages of
    - PCG_USED bit.
    - simple migrating handling.

    So, situation is much easier than several months ago, maybe.

    [hugh@veritas.com: memcg: handle swap caches build fix]
    Reviewed-by: Daisuke Nishimura
    Tested-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Pavel Emelyanov
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

07 Jan, 2009

3 commits

  • Remove gfp_mask argument from add_to_swap(): it's misleading because its
    only caller, shrink_page_list(), is not atomic at that point; and in due
    course (implementing discard) we'll sometimes want to allocate some memory
    with GFP_NOIO (as is used in swap_writepage) when allocating swap.

    No change to the gfp_mask passed down to add_to_swap_cache(): still use
    __GFP_HIGH without __GFP_WAIT (with nomemalloc and nowarn as before):
    though it's not obvious if that's the best combination to ask for here.

    Signed-off-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Cc: Rik van Riel
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • remove_exclusive_swap_page(): its problem is in living up to its name.

    It doesn't matter if someone else has a reference to the page (raised
    page_count); it doesn't matter if the page is mapped into userspace
    (raised page_mapcount - though that hints it may be worth keeping the
    swap): all that matters is that there be no more references to the swap
    (and no writeback in progress).

    swapoff (try_to_unuse) has been removing pages from swapcache for years,
    with no concern for page count or page mapcount, and we used to have a
    comment in lookup_swap_cache() recognizing that: if you go for a page of
    swapcache, you'll get the right page, but it could have been removed from
    swapcache by the time you get page lock.

    So, give up asking for exclusivity: get rid of
    remove_exclusive_swap_page(), and remove_exclusive_swap_page_ref() and
    remove_exclusive_swap_page_count() which were spawned for the recent LRU
    work: replace them by the simpler try_to_free_swap() which just checks
    page_swapcount().

    Similarly, remove the page_count limitation from free_swap_and_count(),
    but assume that it's worth holding on to the swap if page is mapped and
    swap nowhere near full. Add a vm_swap_full() test in free_swap_cache()?
    It would be consistent, but I think we probably have enough for now.

    Signed-off-by: Hugh Dickins
    Cc: Lee Schermerhorn
    Cc: Rik van Riel
    Cc: Nick Piggin
    Cc: KAMEZAWA Hiroyuki
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The swap code is over-provisioned with BUG_ONs on assorted page flags,
    mostly dating back to 2.3. They're good documentation, and guard against
    developer error, but a waste of space on most systems: change them to
    VM_BUG_ONs, conditional on CONFIG_DEBUG_VM. Just delete the PagePrivate
    ones: they're later, from 2.5.69, but even less interesting now.

    Signed-off-by: Hugh Dickins
    Reviewed-by: Christoph Lameter
    Cc: Nick Piggin
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

20 Oct, 2008

4 commits

  • Setting and clearing the page locked when inserting it into swapcache /
    pagecache when it has no other references can use non-atomic page flags
    operations because no other CPU may be operating on it at this time.

    This saves one atomic operation when inserting a page into pagecache.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Swapin_readahead can read in a lot of data that the processes in memory
    never need. Adding swap cache pages to the inactive list prevents them
    from putting too much pressure on the working set.

    This has the potential to help the programs that are already in memory,
    but it could also be a disadvantage to processes that are trying to get
    swapped in.

    Signed-off-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Split the LRU lists in two, one set for pages that are backed by real file
    systems ("file") and one for pages that are backed by memory and swap
    ("anon"). The latter includes tmpfs.

    The advantage of doing this is that the VM will not have to scan over lots
    of anonymous pages (which we generally do not want to swap out), just to
    find the page cache pages that it should evict.

    This patch has the infrastructure and a basic policy to balance how much
    we scan the anon lists and how much we scan the file lists. The big
    policy changes are in separate patches.

    [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
    [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
    [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
    [hugh@veritas.com: memcg swapbacked pages active]
    [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
    [akpm@linux-foundation.org: fix /proc/vmstat units]
    [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
    [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
    [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Define page_file_cache() function to answer the question:
    is page backed by a file?

    Originally part of Rik van Riel's split-lru patch. Extracted to make
    available for other, independent reclaim patches.

    Moved inline function to linux/mm_inline.h where it will be needed by
    subsequent "split LRU" and "noreclaim" patches.

    Unfortunately this needs to use a page flag, since the PG_swapbacked state
    needs to be preserved all the way to the point where the page is last
    removed from the LRU. Trying to derive the status from other info in the
    page resulted in wrong VM statistics in earlier split VM patchsets.

    The total number of page flags in use on a 32 bit machine after this patch
    is 19.

    [akpm@linux-foundation.org: fix up out-of-order merge fallout]
    [hugh@veritas.com: splitlru: shmem_getpage SetPageSwapBacked sooner[
    Signed-off-by: Rik van Riel
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: MinChan Kim
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

21 Aug, 2008

1 commit

  • Adjust m show_swap_cache_info() to show "Free swap" as a
    signed long: the signed format is preferable, because during swapoff
    nr_swap_pages can legitimately go negative, so makes more sense thus
    (it used to be shown redundantly, once as signed and once as unsigned).

    Signed-off-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

05 Aug, 2008

1 commit

  • Converting page lock to new locking bitops requires a change of page flag
    operation naming, so we might as well convert it to something nicer
    (!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked).

    This also facilitates lockdeping of page lock.

    Signed-off-by: Nick Piggin
    Acked-by: KOSAKI Motohiro
    Acked-by: Peter Zijlstra
    Acked-by: Andrew Morton
    Acked-by: Benjamin Herrenschmidt
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

27 Jul, 2008

3 commits

  • Every arch implements its own show_mem() function. Most of them share
    quite some code, some of them are completely identical.

    This series implements a generic version of this function and migrates
    almost all architectures to it.

    This patch:

    Most show_mem() implementations calculate the amount of pages within
    the swapcache every time. Move the output to a more appropriate place
    and use the anyway available total_swapcache_pages variable.

    Signed-off-by: Johannes Weiner
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Haavard Skinnemoen
    Cc: Bryan Wu
    Cc: Chris Zankel
    Cc: Ingo Molnar
    Cc: Jeff Dike
    Cc: David S. Miller
    Cc: Paul Mundt
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: David Howells
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Yoshinori Sato
    Cc: Ralf Baechle
    Cc: Greg Ungerer
    Cc: Geert Uytterhoeven
    Cc: Roman Zippel
    Cc: Hirokazu Takata
    Cc: Mikael Starvik
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • mapping->tree_lock has no read lockers. convert the lock from an rwlock
    to a spinlock.

    Signed-off-by: Nick Piggin
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Hugh Dickins
    Cc: "Paul E. McKenney"
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • If we can be sure that elevating the page_count on a pagecache page will
    pin it, we can speculatively run this operation, and subsequently check to
    see if we hit the right page rather than relying on holding a lock or
    otherwise pinning a reference to the page.

    This can be done if get_page/put_page behaves consistently throughout the
    whole tree (ie. if we "get" the page after it has been used for something
    else, we must be able to free it with a put_page).

    Actually, there is a period where the count behaves differently: when the
    page is free or if it is a constituent page of a compound page. We need
    an atomic_inc_not_zero operation to ensure we don't try to grab the page
    in either case.

    This patch introduces the core locking protocol to the pagecache (ie.
    adds page_cache_get_speculative, and tweaks some update-side code to make
    it work).

    Thanks to Hugh for pointing out an improvement to the algorithm setting
    page_count to zero when we have control of all references, in order to
    hold off speculative getters.

    [kamezawa.hiroyu@jp.fujitsu.com: fix migration_entry_wait()]
    [hugh@veritas.com: fix add_to_page_cache]
    [akpm@linux-foundation.org: repair a comment]
    Signed-off-by: Nick Piggin
    Cc: Jeff Garzik
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Hugh Dickins
    Cc: "Paul E. McKenney"
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

30 Apr, 2008

1 commit

  • Add a new BDI capability flag: BDI_CAP_NO_ACCT_WB. If this flag is
    set, then don't update the per-bdi writeback stats from
    test_set_page_writeback() and test_clear_page_writeback().

    Misc cleanups:

    - convert bdi_cap_writeback_dirty() and friends to static inline functions
    - create a flag that includes all three dirty/writeback related flags,
    since almst all users will want to have them toghether

    Signed-off-by: Miklos Szeredi
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

20 Mar, 2008

1 commit

  • Fix various kernel-doc notation in mm/:

    filemap.c: add function short description; convert 2 to kernel-doc
    fremap.c: change parameter 'prot' to @prot
    pagewalk.c: change "-" in function parameters to ":"
    slab.c: fix short description of kmem_ptr_validate()
    swap.c: fix description & parameters of put_pages_list()
    swap_state.c: fix function parameters
    vmalloc.c: change "@returns" to "Returns:" since that is not a parameter

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap