Eric Lee / smarc-fsl-linux-kernel

18 Jan, 2011

2 commits

83896fb5e Revert "mm: simplify code of swap.c" ... Browse Code »

This reverts commit d8505dee1a87b8d41b9c4ee1325cd72258226fbc.

Chris Mason ended up chasing down some page allocation errors and pages
stuck waiting on the IO scheduler, and was able to narrow it down to two
commits: commit 744ed1442757 ("mm: batch activate_page() to reduce lock
contention") and d8505dee1a87 ("mm: simplify code of swap.c").

This reverts the second one.

Reported-and-debugged-by: Chris Mason
Cc: Mel Gorman
Cc: Andrew Morton
Cc: Jens Axboe
Cc: linux-mm
Cc: KAMEZAWA Hiroyuki
Cc: Andrea Arcangeli
Cc: Shaohua Li
Signed-off-by: Linus Torvalds

Linus Torvalds
2011-01-18 06:42:34 +0800
7a608572a Revert "mm: batch activate_page() to reduce lock contention" ... Browse Code »

This reverts commit 744ed1442757767ffede5008bb13e0805085902e.

Chris Mason ended up chasing down some page allocation errors and pages
stuck waiting on the IO scheduler, and was able to narrow it down to two
commits: commit 744ed1442757 ("mm: batch activate_page() to reduce lock
contention") and d8505dee1a87 ("mm: simplify code of swap.c").

This reverts the first of them.

Reported-and-debugged-by: Chris Mason
Cc: Mel Gorman
Cc: Andrew Morton
Cc: Jens Axboe
Cc: linux-mm
Cc: KAMEZAWA Hiroyuki
Cc: Andrea Arcangeli
Cc: Shaohua Li
Signed-off-by: Linus Torvalds

Linus Torvalds
2011-01-18 06:42:19 +0800

17 Jan, 2011

1 commit

b3697c025 fix non-x86 build failure in pmdp_get_and_clear ... Browse Code »

pmdp_get_and_clear/pmdp_clear_flush/pmdp_splitting_flush were trapped as
BUG() and they were defined only to diminish the risk of build issues on
not-x86 archs and to be consistent with the generic pte methods previously
defined in include/asm-generic/pgtable.h.

But they are causing more trouble than they were supposed to solve, so
it's simpler not to define them when THP is off.

This is also correcting the export of pmdp_splitting_flush which is
currently unused (x86 isn't using the generic implementation in
mm/pgtable-generic.c and no other arch needs that [yet]).

Signed-off-by: Andrea Arcangeli
Sam Ravnborg
Cc: Stephen Rothwell
Cc: "David S. Miller"
Cc: Benjamin Herrenschmidt
Cc: "Luck, Tony"
Cc: James Bottomley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-17 07:05:44 +0800

15 Jan, 2011

2 commits

68a1b1955 mm/slab.c: make local symbols static ... Browse Code »

Local symbols should be static.

Signed-off-by: H Hartley Sweeten
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Matt Mackall
Signed-off-by: Pekka Enberg

H Hartley Sweeten
2011-01-15 19:28:36 +0800
597fb188c Merge branch 'slub/hotplug' into slab/urgent Browse Code »

Pekka Enberg
2011-01-15 19:28:17 +0800

14 Jan, 2011

35 commits

52cfd503a Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6 ... Browse Code »

* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (59 commits)
ACPI / PM: Fix build problems for !CONFIG_ACPI related to NVS rework
ACPI: fix resource check message
ACPI / Battery: Update information on info notification and resume
ACPI: Drop device flag wake_capable
ACPI: Always check if _PRW is present before trying to evaluate it
ACPI / PM: Check status of power resources under mutexes
ACPI / PM: Rename acpi_power_off_device()
ACPI / PM: Drop acpi_power_nocheck
ACPI / PM: Drop acpi_bus_get_power()
Platform / x86: Make fujitsu_laptop use acpi_bus_update_power()
ACPI / Fan: Rework the handling of power resources
ACPI / PM: Register power resource devices as soon as they are needed
ACPI / PM: Register acpi_power_driver early
ACPI / PM: Add function for updating device power state consistently
ACPI / PM: Add function for device power state initialization
ACPI / PM: Introduce __acpi_bus_get_power()
ACPI / PM: Introduce function for refcounting device power resources
ACPI / PM: Add functions for manipulating lists of power resources
ACPI / PM: Prevent acpi_power_get_inferred_state() from making changes
ACPICA: Update version to 20101209
...

Linus Torvalds
2011-01-14 12:15:35 +0800
50de1dd96 memcg: fix memory migration of shmem swapcache ... Browse Code »

In the current implementation mem_cgroup_end_migration() decides whether
the page migration has succeeded or not by checking "oldpage->mapping".

But if we are tring to migrate a shmem swapcache, the page->mapping of it
is NULL from the begining, so the check would be invalid. As a result,
mem_cgroup_end_migration() assumes the migration has succeeded even if
it's not, so "newpage" would be freed while it's not uncharged.

This patch fixes it by passing mem_cgroup_end_migration() the result of
the page migration.

Signed-off-by: Daisuke Nishimura
Reviewed-by: Minchan Kim
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Cc: Minchan Kim
Reviewed-by: Johannes Weiner
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke Nishimura
2011-01-14 09:32:51 +0800
17295c88a memcg: use [kv]zalloc[_node] rather than [kv]malloc+memset ... Browse Code »

In mem_cgroup_alloc() we currently do either kmalloc() or vmalloc() then
followed by memset() to zero the memory. This can be more efficiently
achieved by using kzalloc() and vzalloc(). There's also one situation
where we can use kzalloc_node() - this is what's new in this version of
the patch.

Signed-off-by: Jesper Juhl
Cc: KAMEZAWA Hiroyuki
Cc: Minchan Kim
Cc: Wu Fengguang
Cc: Balbir Singh
Cc: Li Zefan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jesper Juhl
2011-01-14 09:32:51 +0800
dfe076b09 memcg: fix deadlock between cpuset and memcg ... Browse Code »

Commit b1dd693e ("memcg: avoid deadlock between move charge and
try_charge()") can cause another deadlock about mmap_sem on task migration
if cpuset and memcg are mounted onto the same mount point.

After the commit, cgroup_attach_task() has sequence like:

cgroup_attach_task()
ss->can_attach()
cpuset_can_attach()
mem_cgroup_can_attach()
down_read(&mmap_sem) (1)
ss->attach()
cpuset_attach()
mpol_rebind_mm()
down_write(&mmap_sem) (2)
up_write(&mmap_sem)
cpuset_migrate_mm()
do_migrate_pages()
down_read(&mmap_sem)
up_read(&mmap_sem)
mem_cgroup_move_task()
mem_cgroup_clear_mc()
up_read(&mmap_sem)

We can cause deadlock at (2) because we've already aquire the mmap_sem at (1).

But the commit itself is necessary to fix deadlocks which have existed
before the commit like:

Ex.1)
move charge | try charge
--------------------------------------+------------------------------
mem_cgroup_can_attach() | down_write(&mmap_sem)
mc.moving_task = current | ..
mem_cgroup_precharge_mc() | __mem_cgroup_try_charge()
mem_cgroup_count_precharge() | prepare_to_wait()
down_read(&mmap_sem) | if (mc.moving_task)
-> cannot aquire the lock | -> true
| schedule()
| -> move charge should wake it up

Ex.2)
move charge | try charge
--------------------------------------+------------------------------
mem_cgroup_can_attach() |
mc.moving_task = current |
mem_cgroup_precharge_mc() |
mem_cgroup_count_precharge() |
down_read(&mmap_sem) |
.. |
up_read(&mmap_sem) |
| down_write(&mmap_sem)
mem_cgroup_move_task() | ..
mem_cgroup_move_charge() | __mem_cgroup_try_charge()
down_read(&mmap_sem) | prepare_to_wait()
-> cannot aquire the lock | if (mc.moving_task)
| -> true
| schedule()
| -> move charge should wake it up

This patch fixes all of these problems by:
1. revert the commit.
2. To fix the Ex.1, we set mc.moving_task after mem_cgroup_count_precharge()
has released the mmap_sem.
3. To fix the Ex.2, we use down_read_trylock() instead of down_read() in
mem_cgroup_move_charge() and, if it has failed to aquire the lock, cancel
all extra charges, wake up all waiters, and retry trylock.

Signed-off-by: Daisuke Nishimura
Reported-by: Ben Blum
Cc: Miao Xie
Cc: David Rientjes
Cc: Paul Menage
Cc: Hiroyuki Kamezawa
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke Nishimura
2011-01-14 09:32:51 +0800
043d18b1e memcg: remove unnecessary return from void-returning mem_cgroup_del_lru_list() ... Browse Code »

Signed-off-by: Minchan Kim
Acked-by: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2011-01-14 09:32:50 +0800
f3e8eb70b memcg: fix unit mismatch in memcg oom limit calculation ... Browse Code »

Adding the number of swap pages to the byte limit of a memory control
group makes no sense. Convert the pages to bytes before adding them.

The only user of this code is the OOM killer, and the way it is used means
that the error results in a higher OOM badness value. Since the cgroup
limit is the same for all tasks in the cgroup, the error should have no
practical impact at the moment.

But let's not wait for future or changing users to trip over it.

Signed-off-by: Johannes Weiner
Cc: Greg Thelen
Cc: David Rientjes
Acked-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-01-14 09:32:50 +0800
dbd4ea78f memcg: add lock to synchronize page accounting and migration ... Browse Code »

Introduce a new bit spin lock, PCG_MOVE_LOCK, to synchronize the page
accounting and migration code. This reworks the locking scheme of
_update_stat() and _move_account() by adding new lock bit PCG_MOVE_LOCK,
which is always taken under IRQ disable.

1. If pages are being migrated from a memcg, then updates to that
memcg page statistics are protected by grabbing PCG_MOVE_LOCK using
move_lock_page_cgroup(). In an upcoming commit, memcg dirty page
accounting will be updating memcg page accounting (specifically: num
writeback pages) from IRQ context (softirq). Avoid a deadlocking
nested spin lock attempt by disabling irq on the local processor when
grabbing the PCG_MOVE_LOCK.

2. lock for update_page_stat is used only for avoiding race with
move_account(). So, IRQ awareness of lock_page_cgroup() itself is not
a problem. The problem is between mem_cgroup_update_page_stat() and
mem_cgroup_move_account_page().

Trade-off:
* Changing lock_page_cgroup() to always disable IRQ (or
local_bh) has some impacts on performance and I think
it's bad to disable IRQ when it's not necessary.
* adding a new lock makes move_account() slower. Score is
here.

Performance Impact: moving a 8G anon process.

Before:
real 0m0.792s
user 0m0.000s
sys 0m0.780s

After:
real 0m0.854s
user 0m0.000s
sys 0m0.842s

This score is bad but planned patches for optimization can reduce
this impact.

Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Greg Thelen
Reviewed-by: Minchan Kim
Acked-by: Daisuke Nishimura
Cc: Andrea Righi
Cc: Balbir Singh
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-01-14 09:32:50 +0800
2a7106f2c memcg: create extensible page stat update routines ... Browse Code »

Replace usage of the mem_cgroup_update_file_mapped() memcg
statistic update routine with two new routines:
* mem_cgroup_inc_page_stat()
* mem_cgroup_dec_page_stat()

As before, only the file_mapped statistic is managed. However, these more
general interfaces allow for new statistics to be more easily added. New
statistics are added with memcg dirty page accounting.

Signed-off-by: Greg Thelen
Signed-off-by: Andrea Righi
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Daisuke Nishimura
Cc: Balbir Singh
Cc: Minchan Kim
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Greg Thelen
2011-01-14 09:32:50 +0800
744ed1442 mm: batch activate_page() to reduce lock contention ... Browse Code »

The zone->lru_lock is heavily contented in workload where activate_page()
is frequently used. We could do batch activate_page() to reduce the lock
contention. The batched pages will be added into zone list when the pool
is full or page reclaim is trying to drain them.

For example, in a 4 socket 64 CPU system, create a sparse file and 64
processes, processes shared map to the file. Each process read access the
whole file and then exit. The process exit will do unmap_vmas() and cause
a lot of activate_page() call. In such workload, we saw about 58% total
time reduction with below patch. Other workloads with a lot of
activate_page also benefits a lot too.

I tested some microbenchmarks:
case-anon-cow-rand-mt 0.58%
case-anon-cow-rand -3.30%
case-anon-cow-seq-mt -0.51%
case-anon-cow-seq -5.68%
case-anon-r-rand-mt 0.23%
case-anon-r-rand 0.81%
case-anon-r-seq-mt -0.71%
case-anon-r-seq -1.99%
case-anon-rx-rand-mt 2.11%
case-anon-rx-seq-mt 3.46%
case-anon-w-rand-mt -0.03%
case-anon-w-rand -0.50%
case-anon-w-seq-mt -1.08%
case-anon-w-seq -0.12%
case-anon-wx-rand-mt -5.02%
case-anon-wx-seq-mt -1.43%
case-fork 1.65%
case-fork-sleep -0.07%
case-fork-withmem 1.39%
case-hugetlb -0.59%
case-lru-file-mmap-read-mt -0.54%
case-lru-file-mmap-read 0.61%
case-lru-file-mmap-read-rand -2.24%
case-lru-file-readonce -0.64%
case-lru-file-readtwice -11.69%
case-lru-memcg -1.35%
case-mmap-pread-rand-mt 1.88%
case-mmap-pread-rand -15.26%
case-mmap-pread-seq-mt 0.89%
case-mmap-pread-seq -69.72%
case-mmap-xread-rand-mt 0.71%
case-mmap-xread-seq-mt 0.38%

The most significent are:
case-lru-file-readtwice -11.69%
case-mmap-pread-rand -15.26%
case-mmap-pread-seq -69.72%

which use activate_page a lot. others are basically variations because
each run has slightly difference.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Shaohua Li
Cc: Andi Kleen
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2011-01-14 09:32:50 +0800
d8505dee1 mm: simplify code of swap.c ... Browse Code »

Clean up code and remove duplicate code. Next patch will use
pagevec_lru_move_fn introduced here too.

Signed-off-by: Shaohua Li
Cc: Andi Kleen
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2011-01-14 09:32:50 +0800
c06b1fca1 mm/page_alloc.c: don't cache `current' in a local ... Browse Code »

It's old-fashioned and unneeded.

akpm:/usr/src/25> size mm/page_alloc.o
text data bss dec hex filename
39884 1241317 18808 1300009 13d629 mm/page_alloc.o (before)
39838 1241317 18808 1299963 13d5fb mm/page_alloc.o (after)

Acked-by: David Rientjes
Acked-by: Mel Gorman
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2011-01-14 09:32:49 +0800
fd4a4663d mm: fix hugepage migration ... Browse Code »

2.6.37 added an unmap_and_move_huge_page() for memory failure recovery,
but its anon_vma handling was still based around the 2.6.35 conventions.
Update it to use page_lock_anon_vma, get_anon_vma, page_unlock_anon_vma,
drop_anon_vma in the same way as we're now changing unmap_and_move().

I don't particularly like to propose this for stable when I've not seen
its problems in practice nor tested the solution: but it's clearly out of
synch at present.

Signed-off-by: Hugh Dickins
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: "Jun'ichi Nomura"
Cc: Andi Kleen
Cc: [2.6.37, 2.6.36]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-01-14 09:32:49 +0800
1ce82b69e mm: fix migration hangs on anon_vma lock ... Browse Code »

Increased usage of page migration in mmotm reveals that the anon_vma
locking in unmap_and_move() has been deficient since 2.6.36 (or even
earlier). Review at the time of f18194275c39835cb84563500995e0d503a32d9a
("mm: fix hang on anon_vma->root->lock") missed the issue here: the
anon_vma to which we get a reference may already have been freed back to
its slab (it is in use when we check page_mapped, but that can change),
and so its anon_vma->root may be switched at any moment by reuse in
anon_vma_prepare.

Perhaps we could fix that with a get_anon_vma_unless_zero(), but let's
not: just rely on page_lock_anon_vma() to do all the hard thinking for us,
then we don't need any rcu read locking over here.

In removing the rcu_unlock label: since PageAnon is a bit in
page->mapping, it's impossible for a !page->mapping page to be anon; but
insert VM_BUG_ON in case the implementation ever changes.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Hugh Dickins
Reviewed-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Naoya Horiguchi
Cc: "Jun'ichi Nomura"
Cc: Andi Kleen
Cc: [2.6.37, 2.6.36]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-01-14 09:32:49 +0800
2919bfd07 ksm: drain pagevecs to lru ... Browse Code »

It was hard to explain the page counts which were causing new LTP tests
of KSM to fail: we need to drain the per-cpu pagevecs to LRU occasionally.

Signed-off-by: Hugh Dickins
Reported-by: CAI Qian
Cc:Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-01-14 09:32:49 +0800
73ae31e59 hugetlb: fix handling of parse errors in sysfs ... Browse Code »

When parsing changes to the huge page pool sizes made from userspace via
the sysfs interface, bogus input values are being covered up by
nr_hugepages_store_common and nr_overcommit_hugepages_store returning 0
when strict_strtoul returns an error. This can cause an infinite loop in
the nr_hugepages_store code. This patch changes the return value for
these functions to -EINVAL when strict_strtoul returns an error.

Signed-off-by: Eric B Munson
Reported-by: CAI Qian
Cc: Andrea Arcangeli
Cc: Eric B Munson
Cc: Michal Hocko
Cc: Nishanth Aravamudan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric B Munson
2011-01-14 09:32:49 +0800
adbe8726d hugetlb: do not allow pagesize >= MAX_ORDER pool adjustment ... Browse Code »

Huge pages with order >= MAX_ORDER must be allocated at boot via the
kernel command line, they cannot be allocated or freed once the kernel is
up and running. Currently we allow values to be written to the sysfs and
sysctl files controling pool size for these huge page sizes. This patch
makes the store functions for nr_hugepages and nr_overcommit_hugepages
return -EINVAL when the pool for a page size >= MAX_ORDER is changed.

[akpm@linux-foundation.org: avoid multiple return paths in nr_hugepages_store_common()]
[caiqian@redhat.com: add checking in hugetlb_overcommit_handler()]
Signed-off-by: Eric B Munson
Reported-by: CAI Qian
Cc: Andrea Arcangeli
Cc: Michal Hocko
Cc: Nishanth Aravamudan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric B Munson
2011-01-14 09:32:49 +0800
08d4a2465 hugetlb: check the return value of string conversion in sysctl handler ... Browse Code »

proc_doulongvec_minmax may fail if the given buffer doesn't represent a
valid number. If we provide something invalid we will initialize the
resulting value (nr_overcommit_huge_pages in this case) to a random value
from the stack.

The issue was introduced by a3d0c6aa when the default handler has been
replaced by the helper function where we do not check the return value.

Reproducer:
echo "" > /proc/sys/vm/nr_overcommit_hugepages

[akpm@linux-foundation.org: correctly propagate proc_doulongvec_minmax return code]
Signed-off-by: Michal Hocko
Cc: CAI Qian
Cc: Nishanth Aravamudan
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2011-01-14 09:32:49 +0800
684265d4a mm/dmapool.c: use TASK_UNINTERRUPTIBLE in dma_pool_alloc() ... Browse Code »

As it stands this code will degenerate into a busy-wait if the calling task
has signal_pending().

Cc: Rolf Eike Beer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2011-01-14 09:32:48 +0800
84bc227d7 mm/dmapool.c: take lock only once in dma_pool_free() ... Browse Code »

dma_pool_free() scans for the page to free in the pool list holding the
pool lock. Then it releases the lock basically to acquire it immediately
again. Modify the code to only take the lock once.

This will do some additional loops and computations with the lock held in
if memory debugging is activated. If it is not activated the only new
operations with this lock is one if and one substraction.

Signed-off-by: Rolf Eike Beer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rolf Eike Beer
2011-01-14 09:32:48 +0800
43506fad2 mm/page_alloc.c: simplify calculation of combined index of adjacent buddy lists ... Browse Code »
43

The previous approach of calucation of combined index was

page_idx & ~(1 << order))

but we have same result with

page_idx & buddy_idx

This reduces instructions slightly as well as enhances readability.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix used-unintialised warning]
Signed-off-by: KyongHo Cho
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KyongHo Cho
2011-01-14 09:32:48 +0800
5520e8948 brk: fix min_brk lower bound computation for COMPAT_BRK ... Browse Code »

Even if CONFIG_COMPAT_BRK is set in the kernel configuration, it can still
be overriden by randomize_va_space sysctl.

If this is the case, the min_brk computation in sys_brk() implementation
is wrong, as it solely takes into account COMPAT_BRK setting, assuming
that brk start is not randomized. But that might not be the case if
randomize_va_space sysctl has been set to '2' at the time the binary has
been loaded from disk.

In such case, the check has to be done in a same way as in
!CONFIG_COMPAT_BRK case.

In addition to that, the check for the COMPAT_BRK case introduced back in
a5b4592c ("brk: make sys_brk() honor COMPAT_BRK when computing lower
bound") is slightly wrong -- the lower bound shouldn't be mm->end_code,
but mm->end_data instead, as that's where the legacy applications expect
brk section to start (i.e. immediately after last global variable).

[akpm@linux-foundation.org: fix comment]
Signed-off-by: Jiri Kosina
Cc: Geert Uytterhoeven
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Kosina
2011-01-14 09:32:48 +0800
32d6feadf mm/hugetlb.c: fix error-path memory leak in nr_hugepages_store_common() ... Browse Code »

The NODEMASK_ALLOC macro may dynamically allocate memory for its second
argument ('nodes_allowed' in this context).

In nr_hugepages_store_common() we may abort early if strict_strtoul()
fails, but in that case we do not free the memory already allocated to
'nodes_allowed', causing a memory leak.

This patch closes the leak by freeing the memory in the error path.

[akpm@linux-foundation.org: use NODEMASK_FREE, per Minchan Kim]
Signed-off-by: Jesper Juhl
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jesper Juhl
2011-01-14 09:32:48 +0800
29c1f677d mm: migration: use rcu_dereference_protected when dereferencing the radix tree s… ... Browse Code »

…lot during file page migration

migrate_pages() -> unmap_and_move() only calls rcu_read_lock() for
anonymous pages, as introduced by git commit
989f89c57e6361e7d16fbd9572b5da7d313b073d ("fix rcu_read_lock() in page
migraton"). The point of the RCU protection there is part of getting a
stable reference to anon_vma and is only held for anon pages as file pages
are locked which is sufficient protection against freeing.

However, while a file page's mapping is being migrated, the radix tree is
double checked to ensure it is the expected page. This uses
radix_tree_deref_slot() -> rcu_dereference() without the RCU lock held
triggering the following warning.

[ 173.674290] ===================================================
[ 173.676016] [ INFO: suspicious rcu_dereference_check() usage. ]
[ 173.676016] ---------------------------------------------------
[ 173.676016] include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!
[ 173.676016]
[ 173.676016] other info that might help us debug this:
[ 173.676016]
[ 173.676016]
[ 173.676016] rcu_scheduler_active = 1, debug_locks = 0
[ 173.676016] 1 lock held by hugeadm/2899:
[ 173.676016] #0: (&(&inode->i_data.tree_lock)->rlock){..-.-.}, at: [<c10e3d2b>] migrate_page_move_mapping+0x40/0x1ab
[ 173.676016]
[ 173.676016] stack backtrace:
[ 173.676016] Pid: 2899, comm: hugeadm Not tainted 2.6.37-rc5-autobuild
[ 173.676016] Call Trace:
[ 173.676016] [<c128cc01>] ? printk+0x14/0x1b
[ 173.676016] [<c1063502>] lockdep_rcu_dereference+0x7d/0x86
[ 173.676016] [<c10e3db5>] migrate_page_move_mapping+0xca/0x1ab
[ 173.676016] [<c10e41ad>] migrate_page+0x23/0x39
[ 173.676016] [<c10e491b>] buffer_migrate_page+0x22/0x107
[ 173.676016] [<c10e48f9>] ? buffer_migrate_page+0x0/0x107
[ 173.676016] [<c10e425d>] move_to_new_page+0x9a/0x1ae
[ 173.676016] [<c10e47e6>] migrate_pages+0x1e7/0x2fa

This patch introduces radix_tree_deref_slot_protected() which calls
rcu_dereference_protected(). Users of it must pass in the
mapping->tree_lock that is protecting this dereference. Holding the tree
lock protects against parallel updaters of the radix tree meaning that
rcu_dereference_protected is allowable.

[akpm@linux-foundation.org: remove unneeded casts]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Milton Miller <miltonm@bga.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: <stable@kernel.org> [2.6.37.early]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Mel Gorman
2011-01-14 09:32:48 +0800
22e5c47ee thp: add compound_trans_head() helper ... Browse Code »

Cleanup some code with common compound_trans_head helper.

Signed-off-by: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Marcelo Tosatti
Cc: Avi Kivity
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:48 +0800
29ad768cf thp: KSM on THP ... Browse Code »

This makes KSM full operational with THP pages. Subpages are scanned
while the hugepage is still in place and delivering max cpu performance,
and only if there's a match and we're going to deduplicate memory, the
single hugepages with the subpage match is split.

There will be no false sharing between ksmd and khugepaged. khugepaged
won't collapse 2m virtual regions with KSM pages inside. ksmd also should
only split pages when the checksum matches and we're likely to split an
hugepage for some long living ksm page (usual ksm heuristic to avoid
sharing pages that get de-cowed).

Signed-off-by: Andrea Arcangeli
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:48 +0800
60ab3244e thp: khugepaged: make khugepaged aware about madvise ... Browse Code »

MADV_HUGEPAGE and MADV_NOHUGEPAGE were fully effective only if run after
mmap and before touching the memory. While this is enough for most
usages, it's little effort to make madvise more dynamic at runtime on an
existing mapping by making khugepaged aware about madvise.

MADV_HUGEPAGE: register in khugepaged immediately without waiting a page
fault (that may not ever happen if all pages are already mapped and the
"enabled" knob was set to madvise during the initial page faults).

MADV_NOHUGEPAGE: skip vmas marked VM_NOHUGEPAGE in khugepaged to stop
collapsing pages where not needed.

[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Andrea Arcangeli
Cc: Michael Kerrisk
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:47 +0800
a664b2d85 thp: madvise(MADV_NOHUGEPAGE) ... Browse Code »

Add madvise MADV_NOHUGEPAGE to mark regions that are not important to be
hugepage backed. Return -EINVAL if the vma is not of an anonymous type,
or the feature isn't built into the kernel. Never silently return
success.

Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:47 +0800
37c2ac787 thp: compound_trans_order ... Browse Code »

Read compound_trans_order safe. Noop for CONFIG_TRANSPARENT_HUGEPAGE=n.

Signed-off-by: Andrea Arcangeli
Cc: Daisuke Nishimura
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:47 +0800
91600e9e5 thp: fix memory-failure hugetlbfs vs THP collision ... Browse Code »

hugetlbfs was changed to allow memory failure to migrate the hugetlbfs
pages and that broke THP as split_huge_page was then called on hugetlbfs
pages too.

compound_head/order was also run unsafe on THP pages that can be splitted
at any time.

All compound_head() invocations in memory-failure.c that are run on pages
that aren't pinned and that can be freed and reused from under us (while
compound_head is running) are buggy because compound_head can return a
dangling pointer, but I'm not fixing this as this is a generic
memory-failure bug not specific to THP but it applies to hugetlbfs too, so
I can fix it later after THP is merged upstream.

Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:47 +0800
14d1a55cd thp: add debug checks for mapcount related invariants ... Browse Code »

Add debug checks for invariants that if broken could lead to mapcount vs
page_mapcount debug checks to trigger later in split_huge_page.

Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:47 +0800
9992af102 thp: scale nr_rotated to balance memory pressure ... Browse Code »

Make sure we scale up nr_rotated when we encounter a referenced
transparent huge page. This ensures pageout scanning balance is not
distorted when there are huge pages on the LRU.

Signed-off-by: Rik van Riel
Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2011-01-14 09:32:46 +0800
2c888cfbc thp: fix anon memory statistics with transparent hugepages ... Browse Code »

Count each transparent hugepage as HPAGE_PMD_NR pages in the LRU
statistics, so the Active(anon) and Inactive(anon) statistics in
/proc/meminfo are correct.

Signed-off-by: Rik van Riel
Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2011-01-14 09:32:46 +0800
97562cd24 thp: disable transparent hugepages by default on small systems ... Browse Code »

On small systems, the extra memory used by the anti-fragmentation memory
reserve and simply because huge pages are smaller than large pages can
easily outweigh the benefits of less TLB misses.

A less obvious concern is if run on a NUMA machine with asymmetric node
sizes and one of them is very small. The reserve could make the node
unusable.

In case of the crashdump kernel, OOMs have been observed due to the
anti-fragmentation memory reserve taking up a large fraction of the
crashdump image.

This patch disables transparent hugepages on systems with less than 1GB of
RAM, but the hugepage subsystem is fully initialized so administrators can
enable THP through /sys if desired.

Signed-off-by: Rik van Riel
Acked-by: Avi Kiviti
Signed-off-by: Andrea Arcangeli
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2011-01-14 09:32:46 +0800
c5a73c3d5 thp: use compaction for all allocation orders ... Browse Code »

It makes no sense not to enable compaction for small order pages as we
don't want to end up with bad order 2 allocations and good and graceful
order 9 allocations.

Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:46 +0800
5a03b051e thp: use compaction in kswapd for GFP_ATOMIC order > 0 ... Browse Code »

This takes advantage of memory compaction to properly generate pages of
order > 0 if regular page reclaim fails and priority level becomes more
severe and we don't reach the proper watermarks.

Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-14 09:32:46 +0800