01 Aug, 2012

3 commits

  • Compaction (and page migration in general) can currently be hindered
    through pages being owned by memory cgroups that are at their limits and
    unreclaimable.

    The reason is that the replacement page is being charged against the limit
    while the page being replaced is also still charged. But this seems
    unnecessary, given that only one of the two pages will still be in use
    after migration finishes.

    This patch changes the memcg migration sequence so that the replacement
    page is not charged. Whatever page is still in use after successful or
    failed migration gets to keep the charge of the page that was going to be
    replaced.

    The replacement page will still show up temporarily in the rss/cache
    statistics, this can be fixed in a later patch as it's less urgent.

    Reported-by: David Rientjes
    Signed-off-by: Johannes Weiner
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: Wanpeng Li
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • With HugeTLB pages, hugetlb cgroup is uncharged in compound page
    destructor. Since we are holding a hugepage reference, we can be sure
    that old page won't get uncharged till the last put_page().

    Signed-off-by: Aneesh Kumar K.V
    Cc: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Hillf Danton
    Cc: Michal Hocko
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Since we migrate only one hugepage, don't use linked list for passing the
    page around. Directly pass the page that need to be migrated as argument.
    This also removes the usage of page->lru in the migrate path.

    Signed-off-by: Aneesh Kumar K.V
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Cc: Hillf Danton
    Reviewed-by: Michal Hocko
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

04 Jun, 2012

1 commit

  • New tmpfs use of !PageUptodate pages for fallocate() is triggering the
    WARNING: at mm/page-writeback.c:1990 when __set_page_dirty_nobuffers()
    is called from migrate_page_copy() for compaction.

    It is anomalous that migration should use __set_page_dirty_nobuffers()
    on an address_space that does not participate in dirty and writeback
    accounting; and this has also been observed to insert surprising dirty
    tags into a tmpfs radix_tree, despite tmpfs not using tags at all.

    We should probably give migrate_page_copy() a better way to preserve the
    tag and migrate accounting info, when mapping_cap_account_dirty(). But
    that needs some more work: so in the interim, avoid the warning by using
    a simple SetPageDirty on PageSwapBacked pages.

    Reported-and-tested-by: Dave Jones
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

24 May, 2012

1 commit

  • Pull user namespace enhancements from Eric Biederman:
    "This is a course correction for the user namespace, so that we can
    reach an inexpensive, maintainable, and reasonably complete
    implementation.

    Highlights:
    - Config guards make it impossible to enable the user namespace and
    code that has not been converted to be user namespace safe.

    - Use of the new kuid_t type ensures the if you somehow get past the
    config guards the kernel will encounter type errors if you enable
    user namespaces and attempt to compile in code whose permission
    checks have not been updated to be user namespace safe.

    - All uids from child user namespaces are mapped into the initial
    user namespace before they are processed. Removing the need to add
    an additional check to see if the user namespace of the compared
    uids remains the same.

    - With the user namespaces compiled out the performance is as good or
    better than it is today.

    - For most operations absolutely nothing changes performance or
    operationally with the user namespace enabled.

    - The worst case performance I could come up with was timing 1
    billion cache cold stat operations with the user namespace code
    enabled. This went from 156s to 164s on my laptop (or 156ns to
    164ns per stat operation).

    - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value.
    Most uid/gid setting system calls treat these value specially
    anyway so attempting to use -1 as a uid would likely cause
    entertaining failures in userspace.

    - If setuid is called with a uid that can not be mapped setuid fails.
    I have looked at sendmail, login, ssh and every other program I
    could think of that would call setuid and they all check for and
    handle the case where setuid fails.

    - If stat or a similar system call is called from a context in which
    we can not map a uid we lie and return overflowuid. The LFS
    experience suggests not lying and returning an error code might be
    better, but the historical precedent with uids is different and I
    can not think of anything that would break by lying about a uid we
    can't map.

    - Capabilities are localized to the current user namespace making it
    safe to give the initial user in a user namespace all capabilities.

    My git tree covers all of the modifications needed to convert the core
    kernel and enough changes to make a system bootable to runlevel 1."

    Fix up trivial conflicts due to nearby independent changes in fs/stat.c

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits)
    userns: Silence silly gcc warning.
    cred: use correct cred accessor with regards to rcu read lock
    userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq
    userns: Convert cgroup permission checks to use uid_eq
    userns: Convert tmpfs to use kuid and kgid where appropriate
    userns: Convert sysfs to use kgid/kuid where appropriate
    userns: Convert sysctl permission checks to use kuid and kgids.
    userns: Convert proc to use kuid/kgid where appropriate
    userns: Convert ext4 to user kuid/kgid where appropriate
    userns: Convert ext3 to use kuid/kgid where appropriate
    userns: Convert ext2 to use kuid/kgid where appropriate.
    userns: Convert devpts to use kuid/kgid where appropriate
    userns: Convert binary formats to use kuid/kgid where appropriate
    userns: Add negative depends on entries to avoid building code that is userns unsafe
    userns: signal remove unnecessary map_cred_ns
    userns: Teach inode_capable to understand inodes whose uids map to other namespaces.
    userns: Fail exec for suid and sgid binaries with ids outside our user namespace.
    userns: Convert stat to return values mapped from kuids and kgids
    userns: Convert user specfied uids and gids in chown into kuids and kgid
    userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs
    ...

    Linus Torvalds
     

16 May, 2012

1 commit


26 Apr, 2012

1 commit

  • Commit 3268c63 ("mm: fix move/migrate_pages() race on task struct") has
    added an odd construct where 'mm' is checked for being NULL, and if it is,
    it would get dereferenced anyways by mput()ing it.

    Signed-off-by: Sasha Levin
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Acked-by: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

22 Mar, 2012

1 commit

  • Migration functions perform the rcu_read_unlock too early. As a result
    the task pointed to may change from under us. This can result in an oops,
    as reported by Dave Hansen in https://lkml.org/lkml/2012/2/23/302.

    The following patch extend the period of the rcu_read_lock until after the
    permissions checks are done. We also take a refcount so that the task
    reference is stable when calling security check functions and performing
    cpuset node validation (which takes a mutex).

    The refcount is dropped before actual page migration occurs so there is no
    change to the refcounts held during page migration.

    Also move the determination of the mm of the task struct to immediately
    before the do_migrate*() calls so that it is clear that we switch from
    handling the task during permission checks to the mm for the actual
    migration. Since the determination is only done once and we then no
    longer use the task_struct we can be sure that we operate on a specific
    address space that will not change from under us.

    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Christoph Lameter
    Cc: "Eric W. Biederman"
    Reported-by: Dave Hansen
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

06 Mar, 2012

1 commit

  • When moving tasks from old memcg (with move_charge_at_immigrate on new
    memcg), followed by removal of old memcg, hit General Protection Fault in
    mem_cgroup_lru_del_list() (called from release_pages called from
    free_pages_and_swap_cache from tlb_flush_mmu from tlb_finish_mmu from
    exit_mmap from mmput from exit_mm from do_exit).

    Somewhat reproducible, takes a few hours: the old struct mem_cgroup has
    been freed and poisoned by SLAB_DEBUG, but mem_cgroup_lru_del_list() is
    still trying to update its stats, and take page off lru before freeing.

    A task, or a charge, or a page on lru: each secures a memcg against
    removal. In this case, the last task has been moved out of the old memcg,
    and it is exiting: anonymous pages are uncharged one by one from the
    memcg, as they are zapped from its pagetables, so the charge gets down to
    0; but the pages themselves are queued in an mmu_gather for freeing.

    Most of those pages will be on lru (and force_empty is careful to
    lru_add_drain_all, to add pages from pagevec to lru first), but not
    necessarily all: perhaps some have been isolated for page reclaim, perhaps
    some isolated for other reasons. So, force_empty may find no task, no
    charge and no page on lru, and let the removal proceed.

    There would still be no problem if these pages were immediately freed; but
    typically (and the put_page_testzero protocol demands it) they have to be
    added back to lru before they are found freeable, then removed from lru
    and freed. We don't see the issue when adding, because the
    mem_cgroup_iter() loops keep their own reference to the memcg being
    scanned; but when it comes to mem_cgroup_lru_del_list().

    I believe this was not an issue in v3.2: there, PageCgroupAcctLRU and
    PageCgroupUsed flags were used (like a trick with mirrors) to deflect view
    of pc->mem_cgroup to the stable root_mem_cgroup when neither set.
    38c5d72f3ebe ("memcg: simplify LRU handling by new rule") mercifully
    removed those convolutions, but left this General Protection Fault.

    But it's surprisingly easy to restore the old behaviour: just check
    PageCgroupUsed in mem_cgroup_lru_add_list() (which decides on which lruvec
    to add), and reset pc to root_mem_cgroup if page is uncharged. A risky
    change? just going back to how it worked before; testing, and an audit of
    uses of pc->mem_cgroup, show no problem.

    And there's a nice bonus: with mem_cgroup_lru_add_list() itself making
    sure that an uncharged page goes to root lru, mem_cgroup_reset_owner() no
    longer has any purpose, and we can safely revert 4e5f01c2b9b9 ("memcg:
    clear pc->mem_cgroup if necessary").

    Calling update_page_reclaim_stat() after add_page_to_lru_list() in swap.c
    is not strictly necessary: the lru_lock there, with RCU before memcg
    structures are freed, makes mem_cgroup_get_reclaim_stat_from_page safe
    without that; but it seems cleaner to rely on one dependency less.

    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Johannes Weiner
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

04 Feb, 2012

1 commit

  • Postpone resetting page->mapping until the final remove_migration_ptes().
    Otherwise the expression PageAnon(migration_entry_to_page(entry)) does not
    work.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

13 Jan, 2012

3 commits

  • This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
    mode that avoids writing back pages to backing storage. Async compaction
    maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
    For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
    used.

    This avoids sync compaction stalling for an excessive length of time,
    particularly when copying files to a USB stick where there might be a
    large number of dirty pages backed by a filesystem that does not support
    ->writepages.

    [aarcange@redhat.com: This patch is heavily based on Andrea's work]
    [akpm@linux-foundation.org: fix fs/nfs/write.c build]
    [akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Asynchronous compaction is used when allocating transparent hugepages to
    avoid blocking for long periods of time. Due to reports of stalling,
    there was a debate on disabling synchronous compaction but this severely
    impacted allocation success rates. Part of the reason was that many dirty
    pages are skipped in asynchronous compaction by the following check;

    if (PageDirty(page) && !sync &&
    mapping->a_ops->migratepage != migrate_page)
    rc = -EBUSY;

    This skips over all mapping aops using buffer_migrate_page() even though
    it is possible to migrate some of these pages without blocking. This
    patch updates the ->migratepage callback with a "sync" parameter. It is
    the responsibility of the callback to fail gracefully if migration would
    block.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Dave Jones
    Cc: Jan Kara
    Cc: Andy Isaacson
    Cc: Nai Xia
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This is a preparation before removing a flag PCG_ACCT_LRU in page_cgroup
    and reducing atomic ops/complexity in memcg LRU handling.

    In some cases, pages are added to lru before charge to memcg and pages
    are not classfied to memory cgroup at lru addtion. Now, the lru where
    the page should be added is determined a bit in page_cgroup->flags and
    pc->mem_cgroup. I'd like to remove the check of flag.

    To handle the case pc->mem_cgroup may contain stale pointers if pages
    are added to LRU before classification. This patch resets
    pc->mem_cgroup to root_mem_cgroup before lru additions.

    [akpm@linux-foundation.org: fix CONFIG_CGROUP_MEM_CONT=n build]
    [hughd@google.com: fix CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_CGROUP_MEM_RES_CTLR_SWAP=n build]
    [akpm@linux-foundation.org: ksm.c needs memcontrol.h, per Michal]
    [hughd@google.com: stop oops in mem_cgroup_reset_owner()]
    [hughd@google.com: fix page migration to reset_owner]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Miklos Szeredi
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Ying Han
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

11 Jan, 2012

3 commits


09 Dec, 2011

1 commit


07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     

01 Nov, 2011

1 commit

  • unmap_and_move() is one a big messy function. Clean it up.

    Signed-off-by: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

31 Oct, 2011

1 commit


20 Oct, 2011

1 commit

  • I don't usually pay much attention to the stale "? " addresses in
    stack backtraces, but this lucky report from Pawel Sikora hints that
    mremap's move_ptes() has inadequate locking against page migration.

    3.0 BUG_ON(!PageLocked(p)) in migration_entry_to_page():
    kernel BUG at include/linux/swapops.h:105!
    RIP: 0010:[] []
    migration_entry_wait+0x156/0x160
    [] handle_pte_fault+0xae1/0xaf0
    [] ? __pte_alloc+0x42/0x120
    [] ? do_huge_pmd_anonymous_page+0xab/0x310
    [] handle_mm_fault+0x181/0x310
    [] ? vma_adjust+0x537/0x570
    [] do_page_fault+0x11d/0x4e0
    [] ? do_mremap+0x2d5/0x570
    [] page_fault+0x1f/0x30

    mremap's down_write of mmap_sem, together with i_mmap_mutex or lock,
    and pagetable locks, were good enough before page migration (with its
    requirement that every migration entry be found) came in, and enough
    while migration always held mmap_sem; but not enough nowadays, when
    there's memory hotremove and compaction.

    The danger is that move_ptes() lets a migration entry dodge around
    behind remove_migration_pte()'s back, so it's in the old location when
    looking at the new, then in the new location when looking at the old.

    Either mremap's move_ptes() must additionally take anon_vma lock(), or
    migration's remove_migration_pte() must stop peeking for is_swap_entry()
    before it takes pagetable lock.

    Consensus chooses the latter: we prefer to add overhead to migration
    than to mremapping, which gets used by JVMs and by exec stack setup.

    Reported-and-tested-by: Paweł Sikora
    Signed-off-by: Hugh Dickins
    Acked-by: Andrea Arcangeli
    Acked-by: Mel Gorman
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

17 Jun, 2011

1 commit

  • swapcache will reach the below code path in migrate_page_move_mapping,
    and swapcache is accounted as NR_FILE_PAGES but it's not accounted as
    NR_SHMEM.

    Hugh pointed out we must use PageSwapCache instead of comparing
    mapping to &swapper_space, to avoid build failure with CONFIG_SWAP=n.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

25 May, 2011

1 commit

  • Convert page_lock_anon_vma() over to use refcounts. This is done to
    prepare for the conversion of anon_vma from spinlock to mutex.

    Sadly this inceases the cost of page_lock_anon_vma() from one to two
    atomics, a follow up patch addresses this, lets keep that simple for now.

    Signed-off-by: Peter Zijlstra
    Reviewed-by: KAMEZAWA Hiroyuki
    Reviewed-by: KOSAKI Motohiro
    Acked-by: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: Mel Gorman
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

31 Mar, 2011

1 commit


24 Mar, 2011

1 commit

  • Remove initialization of vaiable in caller of memory cgroup function.
    Actually, it's return value of memcg function but it's initialized in
    caller.

    Some memory cgroup uses following style to bring the result of start
    function to the end function for avoiding races.

    mem_cgroup_start_A(&(*ptr))
    /* Something very complicated can happen here. */
    mem_cgroup_end_A(*ptr)

    In some calls, *ptr should be initialized to NULL be caller. But it's
    ugly. This patch fixes that *ptr is initialized by _start function.

    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: Johannes Weiner
    Acked-by: Daisuke Nishimura
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

23 Mar, 2011

3 commits

  • __GFP_NO_KSWAPD allocations are usually very expensive and not mandatory
    to succeed as they have graceful fallback. Waiting for I/O in those,
    tends to be overkill in terms of latencies, so we can reduce their latency
    by disabling sync migrate.

    Unfortunately, even with async migration it's still possible for the
    process to be blocked waiting for a request slot (e.g. get_request_wait
    in the block layer) when ->writepage is called. To prevent
    __GFP_NO_KSWAPD blocking, this patch prevents ->writepage being called on
    dirty page cache for asynchronous migration.

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=31142

    [mel@csn.ul.ie: Avoid writebacks for NFS, retry locked pages, use bool]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Mel Gorman
    Cc: Arthur Marsh
    Cc: Clemens Ladisch
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Reported-by: Alex Villacis Lasso
    Tested-by: Alex Villacis Lasso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • The normal code pattern used in the kernel is: get/put.

    Signed-off-by: Peter Zijlstra
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Hugh Dickins
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • This function basically does:

    remove_from_page_cache(old);
    page_cache_release(old);
    add_to_page_cache_locked(new);

    Except it does this atomically, so there's no possibility for the "add" to
    fail because of a race.

    If memory cgroups are enabled, then the memory cgroup charge is also moved
    from the old page to the new.

    This function is currently used by fuse to move pages into the page cache
    on read, instead of copying the page contents.

    [minchan.kim@gmail.com: add freepage() hook to replace_page_cache_page()]
    Signed-off-by: Miklos Szeredi
    Acked-by: Rik van Riel
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

26 Feb, 2011

1 commit

  • The move_pages() usage of find_task_by_vpid() requires rcu_read_lock() to
    prevent free_pid() from reclaiming the pid.

    Without this patch, RCU warnings are printed in v2.6.38-rc4 move_pages()
    with:

    CONFIG_LOCKUP_DETECTOR=y
    CONFIG_PREEMPT=y
    CONFIG_LOCKDEP=y
    CONFIG_PROVE_LOCKING=y
    CONFIG_PROVE_RCU=y

    Previously, migrate_pages() went through a similar transformation
    replacing usage of tasklist_lock with rcu read lock:

    commit 55cfaa3cbdd29c4919ecb5fb8965c310f357e48c
    Author: Zeng Zhaoming
    Date: Thu Dec 2 14:31:13 2010 -0800

    mm/mempolicy.c: add rcu read lock to protect pid structure

    commit 1e50df39f6e2c3a4a3394df62baa8a213df16c54
    Author: KOSAKI Motohiro
    Date: Thu Jan 13 15:46:14 2011 -0800

    mempolicy: remove tasklist_lock from migrate_pages

    Signed-off-by: Greg Thelen
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: "Paul E. McKenney"
    Cc: Tetsuo Handa
    Cc: Sergey Senozhatsky
    Cc: Oleg Nesterov
    Cc: Zeng Zhaoming
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     

03 Feb, 2011

2 commits

  • If migrate_huge_page by memory-failure fails , it calls put_page in itself
    to decrease page reference and caller of migrate_huge_page also calls
    putback_lru_pages. It can do double free of page so it can make page
    corruption on page holder.

    In addtion, clean of pages on caller is consistent behavior with
    migrate_pages by cf608ac19c ("mm: compaction: fix COMPACTPAGEFAILED
    counting").

    Signed-off-by: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • In some cases migrate_pages could return zero while still leaving a few
    pages in the pagelist (and some caller wouldn't notice it has to call
    putback_lru_pages after commit cf608ac19c9 ("mm: compaction: fix
    COMPACTPAGEFAILED counting")).

    Add one missing putback_lru_pages not added by commit cf608ac19c95 ("mm:
    compaction: fix COMPACTPAGEFAILED counting").

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Minchan Kim
    Reviewed-by: Minchan Kim
    Cc: Christoph Lameter
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

26 Jan, 2011

1 commit

  • Callers of migrate_pages should putback_lru_pages to return pages
    isolated to LRU or free list. Now comment is rather confusing. It says
    caller always have to call it.

    It is more clear to point out that the caller has to call it if
    migrate_pages's return value isn't zero.

    Signed-off-by: Minchan Kim
    Cc: Christoph Lameter
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

14 Jan, 2011

8 commits

  • In the current implementation mem_cgroup_end_migration() decides whether
    the page migration has succeeded or not by checking "oldpage->mapping".

    But if we are tring to migrate a shmem swapcache, the page->mapping of it
    is NULL from the begining, so the check would be invalid. As a result,
    mem_cgroup_end_migration() assumes the migration has succeeded even if
    it's not, so "newpage" would be freed while it's not uncharged.

    This patch fixes it by passing mem_cgroup_end_migration() the result of
    the page migration.

    Signed-off-by: Daisuke Nishimura
    Reviewed-by: Minchan Kim
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Cc: Minchan Kim
    Reviewed-by: Johannes Weiner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • 2.6.37 added an unmap_and_move_huge_page() for memory failure recovery,
    but its anon_vma handling was still based around the 2.6.35 conventions.
    Update it to use page_lock_anon_vma, get_anon_vma, page_unlock_anon_vma,
    drop_anon_vma in the same way as we're now changing unmap_and_move().

    I don't particularly like to propose this for stable when I've not seen
    its problems in practice nor tested the solution: but it's clearly out of
    synch at present.

    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: "Jun'ichi Nomura"
    Cc: Andi Kleen
    Cc: [2.6.37, 2.6.36]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Increased usage of page migration in mmotm reveals that the anon_vma
    locking in unmap_and_move() has been deficient since 2.6.36 (or even
    earlier). Review at the time of f18194275c39835cb84563500995e0d503a32d9a
    ("mm: fix hang on anon_vma->root->lock") missed the issue here: the
    anon_vma to which we get a reference may already have been freed back to
    its slab (it is in use when we check page_mapped, but that can change),
    and so its anon_vma->root may be switched at any moment by reuse in
    anon_vma_prepare.

    Perhaps we could fix that with a get_anon_vma_unless_zero(), but let's
    not: just rely on page_lock_anon_vma() to do all the hard thinking for us,
    then we don't need any rcu read locking over here.

    In removing the rcu_unlock label: since PageAnon is a bit in
    page->mapping, it's impossible for a !page->mapping page to be anon; but
    insert VM_BUG_ON in case the implementation ever changes.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Hugh Dickins
    Reviewed-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: "Jun'ichi Nomura"
    Cc: Andi Kleen
    Cc: [2.6.37, 2.6.36]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • …lot during file page migration

    migrate_pages() -> unmap_and_move() only calls rcu_read_lock() for
    anonymous pages, as introduced by git commit
    989f89c57e6361e7d16fbd9572b5da7d313b073d ("fix rcu_read_lock() in page
    migraton"). The point of the RCU protection there is part of getting a
    stable reference to anon_vma and is only held for anon pages as file pages
    are locked which is sufficient protection against freeing.

    However, while a file page's mapping is being migrated, the radix tree is
    double checked to ensure it is the expected page. This uses
    radix_tree_deref_slot() -> rcu_dereference() without the RCU lock held
    triggering the following warning.

    [ 173.674290] ===================================================
    [ 173.676016] [ INFO: suspicious rcu_dereference_check() usage. ]
    [ 173.676016] ---------------------------------------------------
    [ 173.676016] include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!
    [ 173.676016]
    [ 173.676016] other info that might help us debug this:
    [ 173.676016]
    [ 173.676016]
    [ 173.676016] rcu_scheduler_active = 1, debug_locks = 0
    [ 173.676016] 1 lock held by hugeadm/2899:
    [ 173.676016] #0: (&(&inode->i_data.tree_lock)->rlock){..-.-.}, at: [<c10e3d2b>] migrate_page_move_mapping+0x40/0x1ab
    [ 173.676016]
    [ 173.676016] stack backtrace:
    [ 173.676016] Pid: 2899, comm: hugeadm Not tainted 2.6.37-rc5-autobuild
    [ 173.676016] Call Trace:
    [ 173.676016] [<c128cc01>] ? printk+0x14/0x1b
    [ 173.676016] [<c1063502>] lockdep_rcu_dereference+0x7d/0x86
    [ 173.676016] [<c10e3db5>] migrate_page_move_mapping+0xca/0x1ab
    [ 173.676016] [<c10e41ad>] migrate_page+0x23/0x39
    [ 173.676016] [<c10e491b>] buffer_migrate_page+0x22/0x107
    [ 173.676016] [<c10e48f9>] ? buffer_migrate_page+0x0/0x107
    [ 173.676016] [<c10e425d>] move_to_new_page+0x9a/0x1ae
    [ 173.676016] [<c10e47e6>] migrate_pages+0x1e7/0x2fa

    This patch introduces radix_tree_deref_slot_protected() which calls
    rcu_dereference_protected(). Users of it must pass in the
    mapping->tree_lock that is protecting this dereference. Holding the tree
    lock protects against parallel updaters of the radix tree meaning that
    rcu_dereference_protected is allowable.

    [akpm@linux-foundation.org: remove unneeded casts]
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: Milton Miller <miltonm@bga.com>
    Cc: Nick Piggin <nickpiggin@yahoo.com.au>
    Cc: Wu Fengguang <fengguang.wu@intel.com>
    Cc: <stable@kernel.org> [2.6.37.early]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • No pmd_trans_huge should ever materialize in migration ptes areas, because
    we split the hugepage before migration ptes are instantiated.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • With the introduction of the boolean sync parameter, the API looks a
    little inconsistent as offlining is still an int. Convert offlining to a
    bool for the sake of being tidy.

    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • …ompaction in the faster path

    Migration synchronously waits for writeback if the initial passes fails.
    Callers of memory compaction do not necessarily want this behaviour if the
    caller is latency sensitive or expects that synchronous migration is not
    going to have a significantly better success rate.

    This patch adds a sync parameter to migrate_pages() allowing the caller to
    indicate if wait_on_page_writeback() is allowed within migration or not.
    For reclaim/compaction, try_to_compact_pages() is first called
    asynchronously, direct reclaim runs and then try_to_compact_pages() is
    called synchronously as there is a greater expectation that it'll succeed.

    [akpm@linux-foundation.org: build/merge fix]
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Cc: Rik van Riel <riel@redhat.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Andy Whitcroft <apw@shadowen.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • Lumpy reclaim is disruptive. It reclaims a large number of pages and
    ignores the age of the pages it reclaims. This can incur significant
    stalls and potentially increase the number of major faults.

    Compaction has reached the point where it is considered reasonably stable
    (meaning it has passed a lot of testing) and is a potential candidate for
    displacing lumpy reclaim. This patch introduces an alternative to lumpy
    reclaim whe compaction is available called reclaim/compaction. The basic
    operation is very simple - instead of selecting a contiguous range of
    pages to reclaim, a number of order-0 pages are reclaimed and then
    compaction is later by either kswapd (compact_zone_order()) or direct
    compaction (__alloc_pages_direct_compact()).

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: use conventional task_struct naming]
    Signed-off-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Acked-by: Johannes Weiner
    Cc: Andy Whitcroft
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman