09 Dec, 2011

1 commit


07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     

01 Nov, 2011

1 commit

  • unmap_and_move() is one a big messy function. Clean it up.

    Signed-off-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

31 Oct, 2011

1 commit


20 Oct, 2011

1 commit

  • I don't usually pay much attention to the stale "? " addresses in
    stack backtraces, but this lucky report from Pawel Sikora hints that
    mremap's move_ptes() has inadequate locking against page migration.

    3.0 BUG_ON(!PageLocked(p)) in migration_entry_to_page():
    kernel BUG at include/linux/swapops.h:105!
    RIP: 0010:[] []
    migration_entry_wait+0x156/0x160
    [] handle_pte_fault+0xae1/0xaf0
    [] ? __pte_alloc+0x42/0x120
    [] ? do_huge_pmd_anonymous_page+0xab/0x310
    [] handle_mm_fault+0x181/0x310
    [] ? vma_adjust+0x537/0x570
    [] do_page_fault+0x11d/0x4e0
    [] ? do_mremap+0x2d5/0x570
    [] page_fault+0x1f/0x30

    mremap's down_write of mmap_sem, together with i_mmap_mutex or lock,
    and pagetable locks, were good enough before page migration (with its
    requirement that every migration entry be found) came in, and enough
    while migration always held mmap_sem; but not enough nowadays, when
    there's memory hotremove and compaction.

    The danger is that move_ptes() lets a migration entry dodge around
    behind remove_migration_pte()'s back, so it's in the old location when
    looking at the new, then in the new location when looking at the old.

    Either mremap's move_ptes() must additionally take anon_vma lock(), or
    migration's remove_migration_pte() must stop peeking for is_swap_entry()
    before it takes pagetable lock.

    Consensus chooses the latter: we prefer to add overhead to migration
    than to mremapping, which gets used by JVMs and by exec stack setup.

    Reported-and-tested-by: Paweł Sikora
    Signed-off-by: Hugh Dickins
    Acked-by: Andrea Arcangeli
    Acked-by: Mel Gorman
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

17 Jun, 2011

1 commit

  • swapcache will reach the below code path in migrate_page_move_mapping,
    and swapcache is accounted as NR_FILE_PAGES but it's not accounted as
    NR_SHMEM.

    Hugh pointed out we must use PageSwapCache instead of comparing
    mapping to &swapper_space, to avoid build failure with CONFIG_SWAP=n.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

25 May, 2011

1 commit

  • Convert page_lock_anon_vma() over to use refcounts. This is done to
    prepare for the conversion of anon_vma from spinlock to mutex.

    Sadly this inceases the cost of page_lock_anon_vma() from one to two
    atomics, a follow up patch addresses this, lets keep that simple for now.

    Signed-off-by: Peter Zijlstra
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: Mel Gorman
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

31 Mar, 2011

1 commit


24 Mar, 2011

1 commit

  • Remove initialization of vaiable in caller of memory cgroup function.
    Actually, it's return value of memcg function but it's initialized in
    caller.

    Some memory cgroup uses following style to bring the result of start
    function to the end function for avoiding races.

    mem_cgroup_start_A(&(*ptr))
    /* Something very complicated can happen here. */
    mem_cgroup_end_A(*ptr)

    In some calls, *ptr should be initialized to NULL be caller. But it's
    ugly. This patch fixes that *ptr is initialized by _start function.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Johannes Weiner
    Acked-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

23 Mar, 2011

3 commits

  • __GFP_NO_KSWAPD allocations are usually very expensive and not mandatory
    to succeed as they have graceful fallback. Waiting for I/O in those,
    tends to be overkill in terms of latencies, so we can reduce their latency
    by disabling sync migrate.

    Unfortunately, even with async migration it's still possible for the
    process to be blocked waiting for a request slot (e.g. get_request_wait
    in the block layer) when ->writepage is called. To prevent
    __GFP_NO_KSWAPD blocking, this patch prevents ->writepage being called on
    dirty page cache for asynchronous migration.

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=31142

    [mel@csn.ul.ie: Avoid writebacks for NFS, retry locked pages, use bool]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Mel Gorman
    Cc: Arthur Marsh
    Cc: Clemens Ladisch
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Reported-by: Alex Villacis Lasso
    Tested-by: Alex Villacis Lasso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • The normal code pattern used in the kernel is: get/put.

    Signed-off-by: Peter Zijlstra
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Hugh Dickins
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • This function basically does:

    remove_from_page_cache(old);
    page_cache_release(old);
    add_to_page_cache_locked(new);

    Except it does this atomically, so there's no possibility for the "add" to
    fail because of a race.

    If memory cgroups are enabled, then the memory cgroup charge is also moved
    from the old page to the new.

    This function is currently used by fuse to move pages into the page cache
    on read, instead of copying the page contents.

    [minchan.kim@gmail.com: add freepage() hook to replace_page_cache_page()]
    Signed-off-by: Miklos Szeredi
    Acked-by: Rik van Riel
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

26 Feb, 2011

1 commit

  • The move_pages() usage of find_task_by_vpid() requires rcu_read_lock() to
    prevent free_pid() from reclaiming the pid.

    Without this patch, RCU warnings are printed in v2.6.38-rc4 move_pages()
    with:

    CONFIG_LOCKUP_DETECTOR=y
    CONFIG_PREEMPT=y
    CONFIG_LOCKDEP=y
    CONFIG_PROVE_LOCKING=y
    CONFIG_PROVE_RCU=y

    Previously, migrate_pages() went through a similar transformation
    replacing usage of tasklist_lock with rcu read lock:

    commit 55cfaa3cbdd29c4919ecb5fb8965c310f357e48c
    Author: Zeng Zhaoming
    Date: Thu Dec 2 14:31:13 2010 -0800

    mm/mempolicy.c: add rcu read lock to protect pid structure

    commit 1e50df39f6e2c3a4a3394df62baa8a213df16c54
    Author: KOSAKI Motohiro
    Date: Thu Jan 13 15:46:14 2011 -0800

    mempolicy: remove tasklist_lock from migrate_pages

    Signed-off-by: Greg Thelen
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: "Paul E. McKenney"
    Cc: Tetsuo Handa
    Cc: Sergey Senozhatsky
    Cc: Oleg Nesterov
    Cc: Zeng Zhaoming
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     

03 Feb, 2011

2 commits

  • If migrate_huge_page by memory-failure fails , it calls put_page in itself
    to decrease page reference and caller of migrate_huge_page also calls
    putback_lru_pages. It can do double free of page so it can make page
    corruption on page holder.

    In addtion, clean of pages on caller is consistent behavior with
    migrate_pages by cf608ac19c ("mm: compaction: fix COMPACTPAGEFAILED
    counting").

    Signed-off-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • In some cases migrate_pages could return zero while still leaving a few
    pages in the pagelist (and some caller wouldn't notice it has to call
    putback_lru_pages after commit cf608ac19c9 ("mm: compaction: fix
    COMPACTPAGEFAILED counting")).

    Add one missing putback_lru_pages not added by commit cf608ac19c95 ("mm:
    compaction: fix COMPACTPAGEFAILED counting").

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Minchan Kim
    Reviewed-by: Minchan Kim
    Cc: Christoph Lameter
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

26 Jan, 2011

1 commit

  • Callers of migrate_pages should putback_lru_pages to return pages
    isolated to LRU or free list. Now comment is rather confusing. It says
    caller always have to call it.

    It is more clear to point out that the caller has to call it if
    migrate_pages's return value isn't zero.

    Signed-off-by: Minchan Kim
    Cc: Christoph Lameter
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

14 Jan, 2011

8 commits

  • In the current implementation mem_cgroup_end_migration() decides whether
    the page migration has succeeded or not by checking "oldpage->mapping".

    But if we are tring to migrate a shmem swapcache, the page->mapping of it
    is NULL from the begining, so the check would be invalid. As a result,
    mem_cgroup_end_migration() assumes the migration has succeeded even if
    it's not, so "newpage" would be freed while it's not uncharged.

    This patch fixes it by passing mem_cgroup_end_migration() the result of
    the page migration.

    Signed-off-by: Daisuke Nishimura
    Reviewed-by: Minchan Kim
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Cc: Minchan Kim
    Reviewed-by: Johannes Weiner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • 2.6.37 added an unmap_and_move_huge_page() for memory failure recovery,
    but its anon_vma handling was still based around the 2.6.35 conventions.
    Update it to use page_lock_anon_vma, get_anon_vma, page_unlock_anon_vma,
    drop_anon_vma in the same way as we're now changing unmap_and_move().

    I don't particularly like to propose this for stable when I've not seen
    its problems in practice nor tested the solution: but it's clearly out of
    synch at present.

    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: "Jun'ichi Nomura"
    Cc: Andi Kleen
    Cc: [2.6.37, 2.6.36]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Increased usage of page migration in mmotm reveals that the anon_vma
    locking in unmap_and_move() has been deficient since 2.6.36 (or even
    earlier). Review at the time of f18194275c39835cb84563500995e0d503a32d9a
    ("mm: fix hang on anon_vma->root->lock") missed the issue here: the
    anon_vma to which we get a reference may already have been freed back to
    its slab (it is in use when we check page_mapped, but that can change),
    and so its anon_vma->root may be switched at any moment by reuse in
    anon_vma_prepare.

    Perhaps we could fix that with a get_anon_vma_unless_zero(), but let's
    not: just rely on page_lock_anon_vma() to do all the hard thinking for us,
    then we don't need any rcu read locking over here.

    In removing the rcu_unlock label: since PageAnon is a bit in
    page->mapping, it's impossible for a !page->mapping page to be anon; but
    insert VM_BUG_ON in case the implementation ever changes.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Hugh Dickins
    Reviewed-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: "Jun'ichi Nomura"
    Cc: Andi Kleen
    Cc: [2.6.37, 2.6.36]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • …lot during file page migration

    migrate_pages() -> unmap_and_move() only calls rcu_read_lock() for
    anonymous pages, as introduced by git commit
    989f89c57e6361e7d16fbd9572b5da7d313b073d ("fix rcu_read_lock() in page
    migraton"). The point of the RCU protection there is part of getting a
    stable reference to anon_vma and is only held for anon pages as file pages
    are locked which is sufficient protection against freeing.

    However, while a file page's mapping is being migrated, the radix tree is
    double checked to ensure it is the expected page. This uses
    radix_tree_deref_slot() -> rcu_dereference() without the RCU lock held
    triggering the following warning.

    [ 173.674290] ===================================================
    [ 173.676016] [ INFO: suspicious rcu_dereference_check() usage. ]
    [ 173.676016] ---------------------------------------------------
    [ 173.676016] include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!
    [ 173.676016]
    [ 173.676016] other info that might help us debug this:
    [ 173.676016]
    [ 173.676016]
    [ 173.676016] rcu_scheduler_active = 1, debug_locks = 0
    [ 173.676016] 1 lock held by hugeadm/2899:
    [ 173.676016] #0: (&(&inode->i_data.tree_lock)->rlock){..-.-.}, at: [<c10e3d2b>] migrate_page_move_mapping+0x40/0x1ab
    [ 173.676016]
    [ 173.676016] stack backtrace:
    [ 173.676016] Pid: 2899, comm: hugeadm Not tainted 2.6.37-rc5-autobuild
    [ 173.676016] Call Trace:
    [ 173.676016] [<c128cc01>] ? printk+0x14/0x1b
    [ 173.676016] [<c1063502>] lockdep_rcu_dereference+0x7d/0x86
    [ 173.676016] [<c10e3db5>] migrate_page_move_mapping+0xca/0x1ab
    [ 173.676016] [<c10e41ad>] migrate_page+0x23/0x39
    [ 173.676016] [<c10e491b>] buffer_migrate_page+0x22/0x107
    [ 173.676016] [<c10e48f9>] ? buffer_migrate_page+0x0/0x107
    [ 173.676016] [<c10e425d>] move_to_new_page+0x9a/0x1ae
    [ 173.676016] [<c10e47e6>] migrate_pages+0x1e7/0x2fa

    This patch introduces radix_tree_deref_slot_protected() which calls
    rcu_dereference_protected(). Users of it must pass in the
    mapping->tree_lock that is protecting this dereference. Holding the tree
    lock protects against parallel updaters of the radix tree meaning that
    rcu_dereference_protected is allowable.

    [akpm@linux-foundation.org: remove unneeded casts]
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: Milton Miller <miltonm@bga.com>
    Cc: Nick Piggin <nickpiggin@yahoo.com.au>
    Cc: Wu Fengguang <fengguang.wu@intel.com>
    Cc: <stable@kernel.org> [2.6.37.early]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • No pmd_trans_huge should ever materialize in migration ptes areas, because
    we split the hugepage before migration ptes are instantiated.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • With the introduction of the boolean sync parameter, the API looks a
    little inconsistent as offlining is still an int. Convert offlining to a
    bool for the sake of being tidy.

    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • …ompaction in the faster path

    Migration synchronously waits for writeback if the initial passes fails.
    Callers of memory compaction do not necessarily want this behaviour if the
    caller is latency sensitive or expects that synchronous migration is not
    going to have a significantly better success rate.

    This patch adds a sync parameter to migrate_pages() allowing the caller to
    indicate if wait_on_page_writeback() is allowed within migration or not.
    For reclaim/compaction, try_to_compact_pages() is first called
    asynchronously, direct reclaim runs and then try_to_compact_pages() is
    called synchronously as there is a greater expectation that it'll succeed.

    [akpm@linux-foundation.org: build/merge fix]
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Rik van Riel <riel@redhat.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Andy Whitcroft <apw@shadowen.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • Lumpy reclaim is disruptive. It reclaims a large number of pages and
    ignores the age of the pages it reclaims. This can incur significant
    stalls and potentially increase the number of major faults.

    Compaction has reached the point where it is considered reasonably stable
    (meaning it has passed a lot of testing) and is a potential candidate for
    displacing lumpy reclaim. This patch introduces an alternative to lumpy
    reclaim whe compaction is available called reclaim/compaction. The basic
    operation is very simple - instead of selecting a contiguous range of
    pages to reclaim, a number of order-0 pages are reclaimed and then
    compaction is later by either kswapd (compact_zone_order()) or direct
    compaction (__alloc_pages_direct_compact()).

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: use conventional task_struct naming]
    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

23 Dec, 2010

1 commit


27 Oct, 2010

3 commits

  • The vma returned by find_vma does not necessarily include the target
    address. If this happens the code tries to follow a page outside of any
    vma and returns ENOENT instead of EFAULT.

    Signed-off-by: Gleb Natapov
    Acked-by: Christoph Lameter
    Cc: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gleb Natapov
     
  • Presently update_nr_listpages() doesn't have a role. That's because lists
    passed is always empty just after calling migrate_pages. The
    migrate_pages cleans up page list which have failed to migrate before
    returning by aaa994b3.

    [PATCH] page migration: handle freeing of pages in migrate_pages()

    Do not leave pages on the lists passed to migrate_pages(). Seems that we will
    not need any postprocessing of pages. This will simplify the handling of
    pages by the callers of migrate_pages().

    At that time, we thought we don't need any postprocessing of pages. But
    the situation is changed. The compaction need to know the number of
    failed to migrate for COMPACTPAGEFAILED stat

    This patch makes new rule for caller of migrate_pages to call
    putback_lru_pages. So caller need to clean up the lists so it has a
    chance to postprocess the pages. [suggested by Christoph Lameter]

    Signed-off-by: Minchan Kim
    Cc: Hugh Dickins
    Cc: Andi Kleen
    Reviewed-by: Mel Gorman
    Reviewed-by: Wu Fengguang
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • This removes more dead code that was somehow missed by commit 0d99519efef
    (writeback: remove unused nonblocking and congestion checks). There are
    no behavior change except for the removal of two entries from one of the
    ext4 tracing interface.

    The nonblocking checks in ->writepages are no longer used because the
    flusher now prefer to block on get_request_wait() than to skip inodes on
    IO congestion. The latter will lead to more seeky IO.

    The nonblocking checks in ->writepage are no longer used because it's
    redundant with the WB_SYNC_NONE check.

    We no long set ->nonblocking in VM page out and page migration, because
    a) it's effectively redundant with WB_SYNC_NONE in current code
    b) it's old semantic of "Don't get stuck on request queues" is mis-behavior:
    that would skip some dirty inodes on congestion and page out others, which
    is unfair in terms of LRU age.

    Inspired by Christoph Hellwig. Thanks!

    Signed-off-by: Wu Fengguang
    Cc: Theodore Ts'o
    Cc: David Howells
    Cc: Sage Weil
    Cc: Steve French
    Cc: Chris Mason
    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     

11 Oct, 2010

1 commit

  • 31bit s390 doesn't have huge pages and failed with:

    > mm/migrate.c: In function 'remove_migration_pte':
    > mm/migrate.c:143:3: error: implicit declaration of function 'pte_mkhuge'
    > mm/migrate.c:143:7: error: incompatible types when assigning to type 'pte_t' from type 'int'

    Put that code into a ifdef.

    Reported by Heiko Carstens

    Signed-off-by: Andi Kleen

    Andi Kleen
     

08 Oct, 2010

1 commit

  • This patch extends page migration code to support hugepage migration.
    One of the potential users of this feature is soft offlining which
    is triggered by memory corrected errors (added by the next patch.)

    Todo:
    - there are other users of page migration such as memory policy,
    memory hotplug and memocy compaction.
    They are not ready for hugepage support for now.

    ChangeLog since v4:
    - define migrate_huge_pages()
    - remove changes on isolation/putback_lru_page()

    ChangeLog since v2:
    - refactor isolate/putback_lru_page() to handle hugepage
    - add comment about race on unmap_and_move_huge_page()

    ChangeLog since v1:
    - divide migration code path for hugepage
    - define routine checking migration swap entry for hugetlb
    - replace "goto" with "if/else" in remove_migration_pte()

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Jun'ichi Nomura
    Acked-by: Mel Gorman
    Signed-off-by: Andi Kleen

    Naoya Horiguchi
     

10 Aug, 2010

3 commits

  • KSM reference counts can cause an anon_vma to exist after the processe it
    belongs to have already exited. Because the anon_vma lock now lives in
    the root anon_vma, we need to ensure that the root anon_vma stays around
    until after all the "child" anon_vmas have been freed.

    The obvious way to do this is to have a "child" anon_vma take a reference
    to the root in anon_vma_fork. When the anon_vma is freed at munmap or
    process exit, we drop the refcount in anon_vma_unlink and possibly free
    the root anon_vma.

    The KSM anon_vma reference count function also needs to be modified to
    deal with the possibility of freeing 2 levels of anon_vma. The easiest
    way to do this is to break out the KSM magic and make it generic.

    When compiling without CONFIG_KSM, this code is compiled out.

    Signed-off-by: Rik van Riel
    Tested-by: Larry Woodman
    Acked-by: Larry Woodman
    Reviewed-by: Minchan Kim
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Acked-by: Linus Torvalds
    Tested-by: Dave Young
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Always (and only) lock the root (oldest) anon_vma whenever we do something
    in an anon_vma. The recently introduced anon_vma scalability is due to
    the rmap code scanning only the VMAs that need to be scanned. Many common
    operations still took the anon_vma lock on the root anon_vma, so always
    taking that lock is not expected to introduce any scalability issues.

    However, always taking the same lock does mean we only need to take one
    lock, which means rmap_walk on pages from any anon_vma in the vma is
    excluded from occurring during an munmap, expand_stack or other operation
    that needs to exclude rmap_walk and similar functions.

    Also add the proper locking to vma_adjust.

    Signed-off-by: Rik van Riel
    Tested-by: Larry Woodman
    Acked-by: Larry Woodman
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Subsitute a direct call of spin_lock(anon_vma->lock) with an inline
    function doing exactly the same.

    This makes it easier to do the substitution to the root anon_vma lock in a
    following patch.

    We will deal with the handful of special locks (nested, dec_and_lock, etc)
    separately.

    Signed-off-by: Rik van Riel
    Acked-by: Mel Gorman
    Acked-by: KAMEZAWA Hiroyuki
    Tested-by: Larry Woodman
    Acked-by: Larry Woodman
    Reviewed-by: Minchan Kim
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

28 May, 2010

1 commit

  • FILE_MAPPED per memcg of migrated file cache is not properly updated,
    because our hook in page_add_file_rmap() can't know to which memcg
    FILE_MAPPED should be counted.

    Basically, this patch is for fixing the bug but includes some big changes
    to fix up other messes.

    Now, at migrating mapped file, events happen in following sequence.

    1. allocate a new page.
    2. get memcg of an old page.
    3. charge ageinst a new page before migration. But at this point,
    no changes to new page's page_cgroup, no commit for the charge.
    (IOW, PCG_USED bit is not set.)
    4. page migration replaces radix-tree, old-page and new-page.
    5. page migration remaps the new page if the old page was mapped.
    6. Here, the new page is unlocked.
    7. memcg commits the charge for newpage, Mark the new page's page_cgroup
    as PCG_USED.

    Because "commit" happens after page-remap, we can count FILE_MAPPED
    at "5", because we should avoid to trust page_cgroup->mem_cgroup.
    if PCG_USED bit is unset.
    (Note: memcg's LRU removal code does that but LRU-isolation logic is used
    for helping it. When we overwrite page_cgroup->mem_cgroup, page_cgroup is
    not on LRU or page_cgroup->mem_cgroup is NULL.)

    We can lose file_mapped accounting information at 5 because FILE_MAPPED
    is updated only when mapcount changes 0->1. So we should catch it.

    BTW, historically, above implemntation comes from migration-failure
    of anonymous page. Because we charge both of old page and new page
    with mapcount=0, we can't catch
    - the page is really freed before remap.
    - migration fails but it's freed before remap
    or .....corner cases.

    New migration sequence with memcg is:

    1. allocate a new page.
    2. mark PageCgroupMigration to the old page.
    3. charge against a new page onto the old page's memcg. (here, new page's pc
    is marked as PageCgroupUsed.)
    4. page migration replaces radix-tree, page table, etc...
    5. At remapping, new page's page_cgroup is now makrked as "USED"
    We can catch 0->1 event and FILE_MAPPED will be properly updated.

    And we can catch SWAPOUT event after unlock this and freeing this
    page by unmap() can be caught.

    7. Clear PageCgroupMigration of the old page.

    So, FILE_MAPPED will be correctly updated.

    Then, for what MIGRATION flag is ?
    Without it, at migration failure, we may have to charge old page again
    because it may be fully unmapped. "charge" means that we have to dive into
    memory reclaim or something complated. So, it's better to avoid
    charge it again. Before this patch, __commit_charge() was working for
    both of the old/new page and fixed up all. But this technique has some
    racy condtion around FILE_MAPPED and SWAPOUT etc...
    Now, the kernel use MIGRATION flag and don't uncharge old page until
    the end of migration.

    I hope this change will make memcg's page migration much simpler. This
    page migration has caused several troubles. Worth to add a flag for
    simplification.

    Reviewed-by: Daisuke Nishimura
    Tested-by: Daisuke Nishimura
    Reported-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Christoph Lameter
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    akpm@linux-foundation.org
     

25 May, 2010

6 commits

  • This patch is the core of a mechanism which compacts memory in a zone by
    relocating movable pages towards the end of the zone.

    A single compaction run involves a migration scanner and a free scanner.
    Both scanners operate on pageblock-sized areas in the zone. The migration
    scanner starts at the bottom of the zone and searches for all movable
    pages within each area, isolating them onto a private list called
    migratelist. The free scanner starts at the top of the zone and searches
    for suitable areas and consumes the free pages within making them
    available for the migration scanner. The pages isolated for migration are
    then migrated to the newly isolated free pages.

    [aarcange@redhat.com: Fix unsafe optimisation]
    [mel@csn.ul.ie: do not schedule work on other CPUs for compaction]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • PageAnon pages that are unmapped may or may not have an anon_vma so are
    not currently migrated. However, a swap cache page can be migrated and
    fits this description. This patch identifies page swap caches and allows
    them to be migrated but ensures that no attempt to made to remap the pages
    would would potentially try to access an already freed anon_vma.

    Signed-off-by: Mel Gorman
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Rik van Riel
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • rmap_walk_anon() was triggering errors in memory compaction that look like
    use-after-free errors. The problem is that between the page being
    isolated from the LRU and rcu_read_lock() being taken, the mapcount of the
    page dropped to 0 and the anon_vma gets freed. This can happen during
    memory compaction if pages being migrated belong to a process that exits
    before migration completes. Hence, the use-after-free race looks like

    1. Page isolated for migration
    2. Process exits
    3. page_mapcount(page) drops to zero so anon_vma was no longer reliable
    4. unmap_and_move() takes the rcu_lock but the anon_vma is already garbage
    4. call try_to_unmap, looks up tha anon_vma and "locks" it but the lock
    is garbage.

    This patch checks the mapcount after the rcu lock is taken. If the
    mapcount is zero, the anon_vma is assumed to be freed and no further
    action is taken.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • For clarity of review, KSM and page migration have separate refcounts on
    the anon_vma. While clear, this is a waste of memory. This patch gets
    KSM and page migration to share their toys in a spirit of harmony.

    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This patchset is a memory compaction mechanism that reduces external
    fragmentation memory by moving GFP_MOVABLE pages to a fewer number of
    pageblocks. The term "compaction" was chosen as there are is a number of
    mechanisms that are not mutually exclusive that can be used to defragment
    memory. For example, lumpy reclaim is a form of defragmentation as was
    slub "defragmentation" (really a form of targeted reclaim). Hence, this
    is called "compaction" to distinguish it from other forms of
    defragmentation.

    In this implementation, a full compaction run involves two scanners
    operating within a zone - a migration and a free scanner. The migration
    scanner starts at the beginning of a zone and finds all movable pages
    within one pageblock_nr_pages-sized area and isolates them on a
    migratepages list. The free scanner begins at the end of the zone and
    searches on a per-area basis for enough free pages to migrate all the
    pages on the migratepages list. As each area is respectively migrated or
    exhausted of free pages, the scanners are advanced one area. A compaction
    run completes within a zone when the two scanners meet.

    This method is a bit primitive but is easy to understand and greater
    sophistication would require maintenance of counters on a per-pageblock
    basis. This would have a big impact on allocator fast-paths to improve
    compaction which is a poor trade-off.

    It also does not try relocate virtually contiguous pages to be physically
    contiguous. However, assuming transparent hugepages were in use, a
    hypothetical khugepaged might reuse compaction code to isolate free pages,
    split them and relocate userspace pages for promotion.

    Memory compaction can be triggered in one of three ways. It may be
    triggered explicitly by writing any value to /proc/sys/vm/compact_memory
    and compacting all of memory. It can be triggered on a per-node basis by
    writing any value to /sys/devices/system/node/nodeN/compact where N is the
    node ID to be compacted. When a process fails to allocate a high-order
    page, it may compact memory in an attempt to satisfy the allocation
    instead of entering direct reclaim. Explicit compaction does not finish
    until the two scanners meet and direct compaction ends if a suitable page
    becomes available that would meet watermarks.

    The series is in 14 patches. The first three are not "core" to the series
    but are important pre-requisites.

    Patch 1 reference counts anon_vma for rmap_walk_anon(). Without this
    patch, it's possible to use anon_vma after free if the caller is
    not holding a VMA or mmap_sem for the pages in question. While
    there should be no existing user that causes this problem,
    it's a requirement for memory compaction to be stable. The patch
    is at the start of the series for bisection reasons.
    Patch 2 merges the KSM and migrate counts. It could be merged with patch 1
    but would be slightly harder to review.
    Patch 3 skips over unmapped anon pages during migration as there are no
    guarantees about the anon_vma existing. There is a window between
    when a page was isolated and migration started during which anon_vma
    could disappear.
    Patch 4 notes that PageSwapCache pages can still be migrated even if they
    are unmapped.
    Patch 5 allows CONFIG_MIGRATION to be set without CONFIG_NUMA
    Patch 6 exports a "unusable free space index" via debugfs. It's
    a measure of external fragmentation that takes the size of the
    allocation request into account. It can also be calculated from
    userspace so can be dropped if requested
    Patch 7 exports a "fragmentation index" which only has meaning when an
    allocation request fails. It determines if an allocation failure
    would be due to a lack of memory or external fragmentation.
    Patch 8 moves the definition for LRU isolation modes for use by compaction
    Patch 9 is the compaction mechanism although it's unreachable at this point
    Patch 10 adds a means of compacting all of memory with a proc trgger
    Patch 11 adds a means of compacting a specific node with a sysfs trigger
    Patch 12 adds "direct compaction" before "direct reclaim" if it is
    determined there is a good chance of success.
    Patch 13 adds a sysctl that allows tuning of the threshold at which the
    kernel will compact or direct reclaim
    Patch 14 temporarily disables compaction if an allocation failure occurs
    after compaction.

    Testing of compaction was in three stages. For the test, debugging,
    preempt, the sleep watchdog and lockdep were all enabled but nothing nasty
    popped out. min_free_kbytes was tuned as recommended by hugeadm to help
    fragmentation avoidance and high-order allocations. It was tested on X86,
    X86-64 and PPC64.

    Ths first test represents one of the easiest cases that can be faced for
    lumpy reclaim or memory compaction.

    1. Machine freshly booted and configured for hugepage usage with
    a) hugeadm --create-global-mounts
    b) hugeadm --pool-pages-max DEFAULT:8G
    c) hugeadm --set-recommended-min_free_kbytes
    d) hugeadm --set-recommended-shmmax

    The min_free_kbytes here is important. Anti-fragmentation works best
    when pageblocks don't mix. hugeadm knows how to calculate a value that
    will significantly reduce the worst of external-fragmentation-related
    events as reported by the mm_page_alloc_extfrag tracepoint.

    2. Load up memory
    a) Start updatedb
    b) Create in parallel a X files of pagesize*128 in size. Wait
    until files are created. By parallel, I mean that 4096 instances
    of dd were launched, one after the other using &. The crude
    objective being to mix filesystem metadata allocations with
    the buffer cache.
    c) Delete every second file so that pageblocks are likely to
    have holes
    d) kill updatedb if it's still running

    At this point, the system is quiet, memory is full but it's full with
    clean filesystem metadata and clean buffer cache that is unmapped.
    This is readily migrated or discarded so you'd expect lumpy reclaim
    to have no significant advantage over compaction but this is at
    the POC stage.

    3. In increments, attempt to allocate 5% of memory as hugepages.
    Measure how long it took, how successful it was, how many
    direct reclaims took place and how how many compactions. Note
    the compaction figures might not fully add up as compactions
    can take place for orders other than the hugepage size

    X86 vanilla compaction
    Final page count 913 916 (attempted 1002)
    pages reclaimed 68296 9791

    X86-64 vanilla compaction
    Final page count: 901 902 (attempted 1002)
    Total pages reclaimed: 112599 53234

    PPC64 vanilla compaction
    Final page count: 93 94 (attempted 110)
    Total pages reclaimed: 103216 61838

    There was not a dramatic improvement in success rates but it wouldn't be
    expected in this case either. What was important is that fewer pages were
    reclaimed in all cases reducing the amount of IO required to satisfy a
    huge page allocation.

    The second tests were all performance related - kernbench, netperf, iozone
    and sysbench. None showed anything too remarkable.

    The last test was a high-order allocation stress test. Many kernel
    compiles are started to fill memory with a pressured mix of unmovable and
    movable allocations. During this, an attempt is made to allocate 90% of
    memory as huge pages - one at a time with small delays between attempts to
    avoid flooding the IO queue.

    vanilla compaction
    Percentage of request allocated X86 98 99
    Percentage of request allocated X86-64 95 98
    Percentage of request allocated PPC64 55 70

    This patch:

    rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
    locking an anon_vma and it does not appear to have sufficient locking to
    ensure the anon_vma does not disappear from under it.

    This patch copies an approach used by KSM to take a reference on the
    anon_vma while pages are being migrated. This should prevent rmap_walk()
    running into nasty surprises later because anon_vma has been freed.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • putback_lru_page() never can fail. So it doesn't matter count of "the
    number of pages put back".

    In addition, users of this functions don't use return value.

    Let's remove unnecessary code.

    Signed-off-by: Minchan Kim
    Reviewed-by: Rik van Riel
    Reviewed-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim