18 Jan, 2011

2 commits

  • This reverts commit d8505dee1a87b8d41b9c4ee1325cd72258226fbc.

    Chris Mason ended up chasing down some page allocation errors and pages
    stuck waiting on the IO scheduler, and was able to narrow it down to two
    commits: commit 744ed1442757 ("mm: batch activate_page() to reduce lock
    contention") and d8505dee1a87 ("mm: simplify code of swap.c").

    This reverts the second one.

    Reported-and-debugged-by: Chris Mason
    Cc: Mel Gorman
    Cc: Andrew Morton
    Cc: Jens Axboe
    Cc: linux-mm
    Cc: KAMEZAWA Hiroyuki
    Cc: Andrea Arcangeli
    Cc: Shaohua Li
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • This reverts commit 744ed1442757767ffede5008bb13e0805085902e.

    Chris Mason ended up chasing down some page allocation errors and pages
    stuck waiting on the IO scheduler, and was able to narrow it down to two
    commits: commit 744ed1442757 ("mm: batch activate_page() to reduce lock
    contention") and d8505dee1a87 ("mm: simplify code of swap.c").

    This reverts the first of them.

    Reported-and-debugged-by: Chris Mason
    Cc: Mel Gorman
    Cc: Andrew Morton
    Cc: Jens Axboe
    Cc: linux-mm
    Cc: KAMEZAWA Hiroyuki
    Cc: Andrea Arcangeli
    Cc: Shaohua Li
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

17 Jan, 2011

1 commit

  • pmdp_get_and_clear/pmdp_clear_flush/pmdp_splitting_flush were trapped as
    BUG() and they were defined only to diminish the risk of build issues on
    not-x86 archs and to be consistent with the generic pte methods previously
    defined in include/asm-generic/pgtable.h.

    But they are causing more trouble than they were supposed to solve, so
    it's simpler not to define them when THP is off.

    This is also correcting the export of pmdp_splitting_flush which is
    currently unused (x86 isn't using the generic implementation in
    mm/pgtable-generic.c and no other arch needs that [yet]).

    Signed-off-by: Andrea Arcangeli
    Sam Ravnborg
    Cc: Stephen Rothwell
    Cc: "David S. Miller"
    Cc: Benjamin Herrenschmidt
    Cc: "Luck, Tony"
    Cc: James Bottomley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

15 Jan, 2011

2 commits


14 Jan, 2011

35 commits

  • * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (59 commits)
    ACPI / PM: Fix build problems for !CONFIG_ACPI related to NVS rework
    ACPI: fix resource check message
    ACPI / Battery: Update information on info notification and resume
    ACPI: Drop device flag wake_capable
    ACPI: Always check if _PRW is present before trying to evaluate it
    ACPI / PM: Check status of power resources under mutexes
    ACPI / PM: Rename acpi_power_off_device()
    ACPI / PM: Drop acpi_power_nocheck
    ACPI / PM: Drop acpi_bus_get_power()
    Platform / x86: Make fujitsu_laptop use acpi_bus_update_power()
    ACPI / Fan: Rework the handling of power resources
    ACPI / PM: Register power resource devices as soon as they are needed
    ACPI / PM: Register acpi_power_driver early
    ACPI / PM: Add function for updating device power state consistently
    ACPI / PM: Add function for device power state initialization
    ACPI / PM: Introduce __acpi_bus_get_power()
    ACPI / PM: Introduce function for refcounting device power resources
    ACPI / PM: Add functions for manipulating lists of power resources
    ACPI / PM: Prevent acpi_power_get_inferred_state() from making changes
    ACPICA: Update version to 20101209
    ...

    Linus Torvalds
     
  • In the current implementation mem_cgroup_end_migration() decides whether
    the page migration has succeeded or not by checking "oldpage->mapping".

    But if we are tring to migrate a shmem swapcache, the page->mapping of it
    is NULL from the begining, so the check would be invalid. As a result,
    mem_cgroup_end_migration() assumes the migration has succeeded even if
    it's not, so "newpage" would be freed while it's not uncharged.

    This patch fixes it by passing mem_cgroup_end_migration() the result of
    the page migration.

    Signed-off-by: Daisuke Nishimura
    Reviewed-by: Minchan Kim
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Cc: Minchan Kim
    Reviewed-by: Johannes Weiner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • In mem_cgroup_alloc() we currently do either kmalloc() or vmalloc() then
    followed by memset() to zero the memory. This can be more efficiently
    achieved by using kzalloc() and vzalloc(). There's also one situation
    where we can use kzalloc_node() - this is what's new in this version of
    the patch.

    Signed-off-by: Jesper Juhl
    Cc: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Wu Fengguang
    Cc: Balbir Singh
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Juhl
     
  • Commit b1dd693e ("memcg: avoid deadlock between move charge and
    try_charge()") can cause another deadlock about mmap_sem on task migration
    if cpuset and memcg are mounted onto the same mount point.

    After the commit, cgroup_attach_task() has sequence like:

    cgroup_attach_task()
    ss->can_attach()
    cpuset_can_attach()
    mem_cgroup_can_attach()
    down_read(&mmap_sem) (1)
    ss->attach()
    cpuset_attach()
    mpol_rebind_mm()
    down_write(&mmap_sem) (2)
    up_write(&mmap_sem)
    cpuset_migrate_mm()
    do_migrate_pages()
    down_read(&mmap_sem)
    up_read(&mmap_sem)
    mem_cgroup_move_task()
    mem_cgroup_clear_mc()
    up_read(&mmap_sem)

    We can cause deadlock at (2) because we've already aquire the mmap_sem at (1).

    But the commit itself is necessary to fix deadlocks which have existed
    before the commit like:

    Ex.1)
    move charge | try charge
    --------------------------------------+------------------------------
    mem_cgroup_can_attach() | down_write(&mmap_sem)
    mc.moving_task = current | ..
    mem_cgroup_precharge_mc() | __mem_cgroup_try_charge()
    mem_cgroup_count_precharge() | prepare_to_wait()
    down_read(&mmap_sem) | if (mc.moving_task)
    -> cannot aquire the lock | -> true
    | schedule()
    | -> move charge should wake it up

    Ex.2)
    move charge | try charge
    --------------------------------------+------------------------------
    mem_cgroup_can_attach() |
    mc.moving_task = current |
    mem_cgroup_precharge_mc() |
    mem_cgroup_count_precharge() |
    down_read(&mmap_sem) |
    .. |
    up_read(&mmap_sem) |
    | down_write(&mmap_sem)
    mem_cgroup_move_task() | ..
    mem_cgroup_move_charge() | __mem_cgroup_try_charge()
    down_read(&mmap_sem) | prepare_to_wait()
    -> cannot aquire the lock | if (mc.moving_task)
    | -> true
    | schedule()
    | -> move charge should wake it up

    This patch fixes all of these problems by:
    1. revert the commit.
    2. To fix the Ex.1, we set mc.moving_task after mem_cgroup_count_precharge()
    has released the mmap_sem.
    3. To fix the Ex.2, we use down_read_trylock() instead of down_read() in
    mem_cgroup_move_charge() and, if it has failed to aquire the lock, cancel
    all extra charges, wake up all waiters, and retry trylock.

    Signed-off-by: Daisuke Nishimura
    Reported-by: Ben Blum
    Cc: Miao Xie
    Cc: David Rientjes
    Cc: Paul Menage
    Cc: Hiroyuki Kamezawa
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • Signed-off-by: Minchan Kim
    Acked-by: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Adding the number of swap pages to the byte limit of a memory control
    group makes no sense. Convert the pages to bytes before adding them.

    The only user of this code is the OOM killer, and the way it is used means
    that the error results in a higher OOM badness value. Since the cgroup
    limit is the same for all tasks in the cgroup, the error should have no
    practical impact at the moment.

    But let's not wait for future or changing users to trip over it.

    Signed-off-by: Johannes Weiner
    Cc: Greg Thelen
    Cc: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Introduce a new bit spin lock, PCG_MOVE_LOCK, to synchronize the page
    accounting and migration code. This reworks the locking scheme of
    _update_stat() and _move_account() by adding new lock bit PCG_MOVE_LOCK,
    which is always taken under IRQ disable.

    1. If pages are being migrated from a memcg, then updates to that
    memcg page statistics are protected by grabbing PCG_MOVE_LOCK using
    move_lock_page_cgroup(). In an upcoming commit, memcg dirty page
    accounting will be updating memcg page accounting (specifically: num
    writeback pages) from IRQ context (softirq). Avoid a deadlocking
    nested spin lock attempt by disabling irq on the local processor when
    grabbing the PCG_MOVE_LOCK.

    2. lock for update_page_stat is used only for avoiding race with
    move_account(). So, IRQ awareness of lock_page_cgroup() itself is not
    a problem. The problem is between mem_cgroup_update_page_stat() and
    mem_cgroup_move_account_page().

    Trade-off:
    * Changing lock_page_cgroup() to always disable IRQ (or
    local_bh) has some impacts on performance and I think
    it's bad to disable IRQ when it's not necessary.
    * adding a new lock makes move_account() slower. Score is
    here.

    Performance Impact: moving a 8G anon process.

    Before:
    real 0m0.792s
    user 0m0.000s
    sys 0m0.780s

    After:
    real 0m0.854s
    user 0m0.000s
    sys 0m0.842s

    This score is bad but planned patches for optimization can reduce
    this impact.

    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Greg Thelen
    Reviewed-by: Minchan Kim
    Acked-by: Daisuke Nishimura
    Cc: Andrea Righi
    Cc: Balbir Singh
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Replace usage of the mem_cgroup_update_file_mapped() memcg
    statistic update routine with two new routines:
    * mem_cgroup_inc_page_stat()
    * mem_cgroup_dec_page_stat()

    As before, only the file_mapped statistic is managed. However, these more
    general interfaces allow for new statistics to be more easily added. New
    statistics are added with memcg dirty page accounting.

    Signed-off-by: Greg Thelen
    Signed-off-by: Andrea Righi
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Daisuke Nishimura
    Cc: Balbir Singh
    Cc: Minchan Kim
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Thelen
     
  • The zone->lru_lock is heavily contented in workload where activate_page()
    is frequently used. We could do batch activate_page() to reduce the lock
    contention. The batched pages will be added into zone list when the pool
    is full or page reclaim is trying to drain them.

    For example, in a 4 socket 64 CPU system, create a sparse file and 64
    processes, processes shared map to the file. Each process read access the
    whole file and then exit. The process exit will do unmap_vmas() and cause
    a lot of activate_page() call. In such workload, we saw about 58% total
    time reduction with below patch. Other workloads with a lot of
    activate_page also benefits a lot too.

    I tested some microbenchmarks:
    case-anon-cow-rand-mt 0.58%
    case-anon-cow-rand -3.30%
    case-anon-cow-seq-mt -0.51%
    case-anon-cow-seq -5.68%
    case-anon-r-rand-mt 0.23%
    case-anon-r-rand 0.81%
    case-anon-r-seq-mt -0.71%
    case-anon-r-seq -1.99%
    case-anon-rx-rand-mt 2.11%
    case-anon-rx-seq-mt 3.46%
    case-anon-w-rand-mt -0.03%
    case-anon-w-rand -0.50%
    case-anon-w-seq-mt -1.08%
    case-anon-w-seq -0.12%
    case-anon-wx-rand-mt -5.02%
    case-anon-wx-seq-mt -1.43%
    case-fork 1.65%
    case-fork-sleep -0.07%
    case-fork-withmem 1.39%
    case-hugetlb -0.59%
    case-lru-file-mmap-read-mt -0.54%
    case-lru-file-mmap-read 0.61%
    case-lru-file-mmap-read-rand -2.24%
    case-lru-file-readonce -0.64%
    case-lru-file-readtwice -11.69%
    case-lru-memcg -1.35%
    case-mmap-pread-rand-mt 1.88%
    case-mmap-pread-rand -15.26%
    case-mmap-pread-seq-mt 0.89%
    case-mmap-pread-seq -69.72%
    case-mmap-xread-rand-mt 0.71%
    case-mmap-xread-seq-mt 0.38%

    The most significent are:
    case-lru-file-readtwice -11.69%
    case-mmap-pread-rand -15.26%
    case-mmap-pread-seq -69.72%

    which use activate_page a lot. others are basically variations because
    each run has slightly difference.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Shaohua Li
    Cc: Andi Kleen
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Clean up code and remove duplicate code. Next patch will use
    pagevec_lru_move_fn introduced here too.

    Signed-off-by: Shaohua Li
    Cc: Andi Kleen
    Cc: Minchan Kim
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • It's old-fashioned and unneeded.

    akpm:/usr/src/25> size mm/page_alloc.o
    text data bss dec hex filename
    39884 1241317 18808 1300009 13d629 mm/page_alloc.o (before)
    39838 1241317 18808 1299963 13d5fb mm/page_alloc.o (after)

    Acked-by: David Rientjes
    Acked-by: Mel Gorman
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • 2.6.37 added an unmap_and_move_huge_page() for memory failure recovery,
    but its anon_vma handling was still based around the 2.6.35 conventions.
    Update it to use page_lock_anon_vma, get_anon_vma, page_unlock_anon_vma,
    drop_anon_vma in the same way as we're now changing unmap_and_move().

    I don't particularly like to propose this for stable when I've not seen
    its problems in practice nor tested the solution: but it's clearly out of
    synch at present.

    Signed-off-by: Hugh Dickins
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: "Jun'ichi Nomura"
    Cc: Andi Kleen
    Cc: [2.6.37, 2.6.36]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Increased usage of page migration in mmotm reveals that the anon_vma
    locking in unmap_and_move() has been deficient since 2.6.36 (or even
    earlier). Review at the time of f18194275c39835cb84563500995e0d503a32d9a
    ("mm: fix hang on anon_vma->root->lock") missed the issue here: the
    anon_vma to which we get a reference may already have been freed back to
    its slab (it is in use when we check page_mapped, but that can change),
    and so its anon_vma->root may be switched at any moment by reuse in
    anon_vma_prepare.

    Perhaps we could fix that with a get_anon_vma_unless_zero(), but let's
    not: just rely on page_lock_anon_vma() to do all the hard thinking for us,
    then we don't need any rcu read locking over here.

    In removing the rcu_unlock label: since PageAnon is a bit in
    page->mapping, it's impossible for a !page->mapping page to be anon; but
    insert VM_BUG_ON in case the implementation ever changes.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Hugh Dickins
    Reviewed-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: "Jun'ichi Nomura"
    Cc: Andi Kleen
    Cc: [2.6.37, 2.6.36]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • It was hard to explain the page counts which were causing new LTP tests
    of KSM to fail: we need to drain the per-cpu pagevecs to LRU occasionally.

    Signed-off-by: Hugh Dickins
    Reported-by: CAI Qian
    Cc:Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When parsing changes to the huge page pool sizes made from userspace via
    the sysfs interface, bogus input values are being covered up by
    nr_hugepages_store_common and nr_overcommit_hugepages_store returning 0
    when strict_strtoul returns an error. This can cause an infinite loop in
    the nr_hugepages_store code. This patch changes the return value for
    these functions to -EINVAL when strict_strtoul returns an error.

    Signed-off-by: Eric B Munson
    Reported-by: CAI Qian
    Cc: Andrea Arcangeli
    Cc: Eric B Munson
    Cc: Michal Hocko
    Cc: Nishanth Aravamudan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • Huge pages with order >= MAX_ORDER must be allocated at boot via the
    kernel command line, they cannot be allocated or freed once the kernel is
    up and running. Currently we allow values to be written to the sysfs and
    sysctl files controling pool size for these huge page sizes. This patch
    makes the store functions for nr_hugepages and nr_overcommit_hugepages
    return -EINVAL when the pool for a page size >= MAX_ORDER is changed.

    [akpm@linux-foundation.org: avoid multiple return paths in nr_hugepages_store_common()]
    [caiqian@redhat.com: add checking in hugetlb_overcommit_handler()]
    Signed-off-by: Eric B Munson
    Reported-by: CAI Qian
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Cc: Nishanth Aravamudan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • proc_doulongvec_minmax may fail if the given buffer doesn't represent a
    valid number. If we provide something invalid we will initialize the
    resulting value (nr_overcommit_huge_pages in this case) to a random value
    from the stack.

    The issue was introduced by a3d0c6aa when the default handler has been
    replaced by the helper function where we do not check the return value.

    Reproducer:
    echo "" > /proc/sys/vm/nr_overcommit_hugepages

    [akpm@linux-foundation.org: correctly propagate proc_doulongvec_minmax return code]
    Signed-off-by: Michal Hocko
    Cc: CAI Qian
    Cc: Nishanth Aravamudan
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • As it stands this code will degenerate into a busy-wait if the calling task
    has signal_pending().

    Cc: Rolf Eike Beer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • dma_pool_free() scans for the page to free in the pool list holding the
    pool lock. Then it releases the lock basically to acquire it immediately
    again. Modify the code to only take the lock once.

    This will do some additional loops and computations with the lock held in
    if memory debugging is activated. If it is not activated the only new
    operations with this lock is one if and one substraction.

    Signed-off-by: Rolf Eike Beer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rolf Eike Beer
     
  • The previous approach of calucation of combined index was

    page_idx & ~(1 << order))

    but we have same result with

    page_idx & buddy_idx

    This reduces instructions slightly as well as enhances readability.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix used-unintialised warning]
    Signed-off-by: KyongHo Cho
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KyongHo Cho
     
  • Even if CONFIG_COMPAT_BRK is set in the kernel configuration, it can still
    be overriden by randomize_va_space sysctl.

    If this is the case, the min_brk computation in sys_brk() implementation
    is wrong, as it solely takes into account COMPAT_BRK setting, assuming
    that brk start is not randomized. But that might not be the case if
    randomize_va_space sysctl has been set to '2' at the time the binary has
    been loaded from disk.

    In such case, the check has to be done in a same way as in
    !CONFIG_COMPAT_BRK case.

    In addition to that, the check for the COMPAT_BRK case introduced back in
    a5b4592c ("brk: make sys_brk() honor COMPAT_BRK when computing lower
    bound") is slightly wrong -- the lower bound shouldn't be mm->end_code,
    but mm->end_data instead, as that's where the legacy applications expect
    brk section to start (i.e. immediately after last global variable).

    [akpm@linux-foundation.org: fix comment]
    Signed-off-by: Jiri Kosina
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Kosina
     
  • The NODEMASK_ALLOC macro may dynamically allocate memory for its second
    argument ('nodes_allowed' in this context).

    In nr_hugepages_store_common() we may abort early if strict_strtoul()
    fails, but in that case we do not free the memory already allocated to
    'nodes_allowed', causing a memory leak.

    This patch closes the leak by freeing the memory in the error path.

    [akpm@linux-foundation.org: use NODEMASK_FREE, per Minchan Kim]
    Signed-off-by: Jesper Juhl
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Juhl
     
  • …lot during file page migration

    migrate_pages() -> unmap_and_move() only calls rcu_read_lock() for
    anonymous pages, as introduced by git commit
    989f89c57e6361e7d16fbd9572b5da7d313b073d ("fix rcu_read_lock() in page
    migraton"). The point of the RCU protection there is part of getting a
    stable reference to anon_vma and is only held for anon pages as file pages
    are locked which is sufficient protection against freeing.

    However, while a file page's mapping is being migrated, the radix tree is
    double checked to ensure it is the expected page. This uses
    radix_tree_deref_slot() -> rcu_dereference() without the RCU lock held
    triggering the following warning.

    [ 173.674290] ===================================================
    [ 173.676016] [ INFO: suspicious rcu_dereference_check() usage. ]
    [ 173.676016] ---------------------------------------------------
    [ 173.676016] include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!
    [ 173.676016]
    [ 173.676016] other info that might help us debug this:
    [ 173.676016]
    [ 173.676016]
    [ 173.676016] rcu_scheduler_active = 1, debug_locks = 0
    [ 173.676016] 1 lock held by hugeadm/2899:
    [ 173.676016] #0: (&(&inode->i_data.tree_lock)->rlock){..-.-.}, at: [<c10e3d2b>] migrate_page_move_mapping+0x40/0x1ab
    [ 173.676016]
    [ 173.676016] stack backtrace:
    [ 173.676016] Pid: 2899, comm: hugeadm Not tainted 2.6.37-rc5-autobuild
    [ 173.676016] Call Trace:
    [ 173.676016] [<c128cc01>] ? printk+0x14/0x1b
    [ 173.676016] [<c1063502>] lockdep_rcu_dereference+0x7d/0x86
    [ 173.676016] [<c10e3db5>] migrate_page_move_mapping+0xca/0x1ab
    [ 173.676016] [<c10e41ad>] migrate_page+0x23/0x39
    [ 173.676016] [<c10e491b>] buffer_migrate_page+0x22/0x107
    [ 173.676016] [<c10e48f9>] ? buffer_migrate_page+0x0/0x107
    [ 173.676016] [<c10e425d>] move_to_new_page+0x9a/0x1ae
    [ 173.676016] [<c10e47e6>] migrate_pages+0x1e7/0x2fa

    This patch introduces radix_tree_deref_slot_protected() which calls
    rcu_dereference_protected(). Users of it must pass in the
    mapping->tree_lock that is protecting this dereference. Holding the tree
    lock protects against parallel updaters of the radix tree meaning that
    rcu_dereference_protected is allowable.

    [akpm@linux-foundation.org: remove unneeded casts]
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: Milton Miller <miltonm@bga.com>
    Cc: Nick Piggin <nickpiggin@yahoo.com.au>
    Cc: Wu Fengguang <fengguang.wu@intel.com>
    Cc: <stable@kernel.org> [2.6.37.early]
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     
  • Cleanup some code with common compound_trans_head helper.

    Signed-off-by: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Marcelo Tosatti
    Cc: Avi Kivity
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This makes KSM full operational with THP pages. Subpages are scanned
    while the hugepage is still in place and delivering max cpu performance,
    and only if there's a match and we're going to deduplicate memory, the
    single hugepages with the subpage match is split.

    There will be no false sharing between ksmd and khugepaged. khugepaged
    won't collapse 2m virtual regions with KSM pages inside. ksmd also should
    only split pages when the checksum matches and we're likely to split an
    hugepage for some long living ksm page (usual ksm heuristic to avoid
    sharing pages that get de-cowed).

    Signed-off-by: Andrea Arcangeli
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • MADV_HUGEPAGE and MADV_NOHUGEPAGE were fully effective only if run after
    mmap and before touching the memory. While this is enough for most
    usages, it's little effort to make madvise more dynamic at runtime on an
    existing mapping by making khugepaged aware about madvise.

    MADV_HUGEPAGE: register in khugepaged immediately without waiting a page
    fault (that may not ever happen if all pages are already mapped and the
    "enabled" knob was set to madvise during the initial page faults).

    MADV_NOHUGEPAGE: skip vmas marked VM_NOHUGEPAGE in khugepaged to stop
    collapsing pages where not needed.

    [akpm@linux-foundation.org: tweak comment]
    Signed-off-by: Andrea Arcangeli
    Cc: Michael Kerrisk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Add madvise MADV_NOHUGEPAGE to mark regions that are not important to be
    hugepage backed. Return -EINVAL if the vma is not of an anonymous type,
    or the feature isn't built into the kernel. Never silently return
    success.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Read compound_trans_order safe. Noop for CONFIG_TRANSPARENT_HUGEPAGE=n.

    Signed-off-by: Andrea Arcangeli
    Cc: Daisuke Nishimura
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • hugetlbfs was changed to allow memory failure to migrate the hugetlbfs
    pages and that broke THP as split_huge_page was then called on hugetlbfs
    pages too.

    compound_head/order was also run unsafe on THP pages that can be splitted
    at any time.

    All compound_head() invocations in memory-failure.c that are run on pages
    that aren't pinned and that can be freed and reused from under us (while
    compound_head is running) are buggy because compound_head can return a
    dangling pointer, but I'm not fixing this as this is a generic
    memory-failure bug not specific to THP but it applies to hugetlbfs too, so
    I can fix it later after THP is merged upstream.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Add debug checks for invariants that if broken could lead to mapcount vs
    page_mapcount debug checks to trigger later in split_huge_page.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Make sure we scale up nr_rotated when we encounter a referenced
    transparent huge page. This ensures pageout scanning balance is not
    distorted when there are huge pages on the LRU.

    Signed-off-by: Rik van Riel
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Count each transparent hugepage as HPAGE_PMD_NR pages in the LRU
    statistics, so the Active(anon) and Inactive(anon) statistics in
    /proc/meminfo are correct.

    Signed-off-by: Rik van Riel
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • On small systems, the extra memory used by the anti-fragmentation memory
    reserve and simply because huge pages are smaller than large pages can
    easily outweigh the benefits of less TLB misses.

    A less obvious concern is if run on a NUMA machine with asymmetric node
    sizes and one of them is very small. The reserve could make the node
    unusable.

    In case of the crashdump kernel, OOMs have been observed due to the
    anti-fragmentation memory reserve taking up a large fraction of the
    crashdump image.

    This patch disables transparent hugepages on systems with less than 1GB of
    RAM, but the hugepage subsystem is fully initialized so administrators can
    enable THP through /sys if desired.

    Signed-off-by: Rik van Riel
    Acked-by: Avi Kiviti
    Signed-off-by: Andrea Arcangeli
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • It makes no sense not to enable compaction for small order pages as we
    don't want to end up with bad order 2 allocations and good and graceful
    order 9 allocations.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This takes advantage of memory compaction to properly generate pages of
    order > 0 if regular page reclaim fails and priority level becomes more
    severe and we don't reach the proper watermarks.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli