Eric Lee / smarc-fsl-linux-kernel

16 Feb, 2011

1 commit

a7d6e4ecd thp: prevent hugepages during args/env copying into the user stack ... Browse Code »

Transparent hugepages can only be created if rmap is fully
functional. So we must prevent hugepages to be created while
is_vma_temporary_stack() is true.

This also optmizes away some harmless but unnecessary setting of
khugepaged_scan.address and it switches some BUG_ON to VM_BUG_ON.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-02-16 07:21:11 +0800

12 Feb, 2011

5 commits

678ff896a memcg: fix leak of accounting at failure path of hugepage collapsing ... Browse Code »

mem_cgroup_uncharge_page() should be called in all failure cases after
mem_cgroup_charge_newpage() is called in huge_memory.c::collapse_huge_page()

[ 4209.076861] BUG: Bad page state in process khugepaged pfn:1e9800
[ 4209.077601] page:ffffea0006b14000 count:0 mapcount:0 mapping: (null) index:0x2800
[ 4209.078674] page flags: 0x40000000004000(head)
[ 4209.079294] pc:ffff880214a30000 pc->flags:2146246697418756 pc->mem_cgroup:ffffc9000177a000
[ 4209.082177] (/A)
[ 4209.082500] Pid: 31, comm: khugepaged Not tainted 2.6.38-rc3-mm1 #1
[ 4209.083412] Call Trace:
[ 4209.083678] [] ? bad_page+0xe4/0x140
[ 4209.084240] [] ? free_pages_prepare+0xd6/0x120
[ 4209.084837] [] ? rwsem_down_failed_common+0xbd/0x150
[ 4209.085509] [] ? __free_pages_ok+0x32/0xe0
[ 4209.086110] [] ? free_compound_page+0x1b/0x20
[ 4209.086699] [] ? __put_compound_page+0x1c/0x30
[ 4209.087333] [] ? put_compound_page+0x4d/0x200
[ 4209.087935] [] ? put_page+0x45/0x50
[ 4209.097361] [] ? khugepaged+0x9e9/0x1430
[ 4209.098364] [] ? autoremove_wake_function+0x0/0x40
[ 4209.099121] [] ? khugepaged+0x0/0x1430
[ 4209.099780] [] ? kthread+0x96/0xa0
[ 4209.100452] [] ? kernel_thread_helper+0x4/0x10
[ 4209.101214] [] ? kthread+0x0/0xa0
[ 4209.101842] [] ? kernel_thread_helper+0x0/0x10

Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Daisuke Nishimura
Reviewed-by: Johannes Weiner
Cc: Andrea Arcangeli
Reviewed-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-02-12 08:12:20 +0800
f0fdc5e8e vmscan: fix zone shrinking exit when scan work is done ... Browse Code »

Commit 3e7d34497067 ("mm: vmscan: reclaim order-0 and use compaction
instead of lumpy reclaim") introduced an indefinite loop in
shrink_zone().

It meant to break out of this loop when no pages had been reclaimed and
not a single page was even scanned. The way it would detect the latter
is by taking a snapshot of sc->nr_scanned at the beginning of the
function and comparing it against the new sc->nr_scanned after the scan
loop. But it would re-iterate without updating that snapshot, looping
forever if sc->nr_scanned changed at least once since shrink_zone() was
invoked.

This is not the sole condition that would exit that loop, but it
requires other processes to change the zone state, as the reclaimer that
is stuck obviously can not anymore.

This is only happening for higher-order allocations, where reclaim is
run back to back with compaction.

Signed-off-by: Johannes Weiner
Reported-by: Michal Hocko
Tested-by: Kent Overstreet
Reported-by: Kent Overstreet
Acked-by: Mel Gorman
Cc: Andrea Arcangeli
Cc: Rik van Riel
Reviewed-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-02-12 08:12:20 +0800
419d8c96d mlock: do not munlock pages in __do_fault() ... Browse Code »

If the page is going to be written to, __do_page needs to break COW.

However, the old page (before breaking COW) was never mapped mapped into
the current pte (__do_fault is only called when the pte is not present),
so vmscan can't have marked the old page as PageMlocked due to being
mapped in __do_fault's VMA. Therefore, __do_fault() does not need to
worry about clearing PageMlocked() on the old page.

Signed-off-by: Michel Lespinasse
Reviewed-by: KAMEZAWA Hiroyuki
Acked-by: Hugh Dickins
Cc: Rik van Riel
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2011-02-12 08:12:20 +0800
e15f8c01a mlock: fix race when munlocking pages in do_wp_page() ... Browse Code »

vmscan can lazily find pages that are mapped within VM_LOCKED vmas, and
set the PageMlocked bit on these pages, transfering them onto the
unevictable list. When do_wp_page() breaks COW within a VM_LOCKED vma,
it may need to clear PageMlocked on the old page and set it on the new
page instead.

This change fixes an issue where do_wp_page() was clearing PageMlocked
on the old page while the pte was still pointing to it (as well as
rmap). Therefore, we were not protected against vmscan immediately
transfering the old page back onto the unevictable list. This could
cause pages to get stranded there forever.

I propose to move the corresponding code to the end of do_wp_page(),
after the pte (and rmap) have been pointed to the new page.
Additionally, we can use munlock_vma_page() instead of
clear_page_mlock(), so that the old page stays mlocked if there are
still other VM_LOCKED vmas mapping it.

Signed-off-by: Michel Lespinasse
Reviewed-by: KAMEZAWA Hiroyuki
Acked-by: Hugh Dickins
Cc: Rik van Riel
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2011-02-12 08:12:20 +0800
e6d2e2b2b memblock: don't adjust size in memblock_find_base() ... Browse Code »

While applying patch to use memblock to find aperture for 64bit x86.
Ingo found system with 1g + force_iommu

> No AGP bridge found
> Node 0: aperture @ 38000000 size 32 MB
> Aperture pointing to e820 RAM. Ignoring.
> Your BIOS doesn't leave a aperture memory hole
> Please enable the IOMMU option in the BIOS setup
> This costs you 64 MB of RAM
> Cannot allocate aperture memory hole (0,65536K)

the corresponding code:

addr = memblock_find_in_range(0, 1ULL<< 0xffffffff) {
printk(KERN_ERR
"Cannot allocate aperture memory hole (%lx,%uK)\n",
addr, aper_size>>10);
return 0;
}
memblock_x86_reserve_range(addr, addr + aper_size, "aperture64")

fails because memblock core code align the size with 512M. That could
make size way too big.

So don't align the size in that case.

actually __memblock_alloc_base, the another caller already align that
before calling that function.

BTW. x86 does not use __memblock_alloc_base...

Signed-off-by: Yinghai Lu
Cc: Ingo Molnar
Cc: David Miller
Cc: "H. Peter Anvin"
Cc: Benjamin Herrenschmidt
Cc: Dave Airlie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yinghai Lu
2011-02-12 08:12:20 +0800

03 Feb, 2011

11 commits

3751d6043 memcg: fix event counting breakage from recent THP update ... Browse Code »

Changes in e401f1761 ("memcg: modify accounting function for supporting
THP better") adds nr_pages to support multiple page size in
memory_cgroup_charge_statistics.

But counting the number of event nees abs(nr_pages) for increasing
counters. This patch fixes event counting.

Signed-off-by: KAMEZAWA Hiroyuki
Reviewed-by: Johannes Weiner
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-02-03 08:03:19 +0800
8493ae439 memcg: never OOM when charging huge pages ... Browse Code »

Huge page coverage should obviously have less priority than the continued
execution of a process.

Never kill a process when charging it a huge page fails. Instead, give up
after the first failed reclaim attempt and fall back to regular pages.

Signed-off-by: Johannes Weiner
Acked-by: KAMEZAWA Hiroyuki
Reviewed-by: Minchan Kim
Cc: Balbir Singh
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-02-03 08:03:19 +0800
19942822d memcg: prevent endless loop when charging huge pages to near-limit group ... Browse Code »

If reclaim after a failed charging was unsuccessful, the limits are
checked again, just in case they settled by means of other tasks.

This is all fine as long as every charge is of size PAGE_SIZE, because in
that case, being below the limit means having at least PAGE_SIZE bytes
available.

But with transparent huge pages, we may end up in an endless loop where
charging and reclaim fail, but we keep going because the limits are not
yet exceeded, although not allowing for a huge page.

Fix this up by explicitely checking for enough room, not just whether we
are within limits.

Signed-off-by: Johannes Weiner
Acked-by: KAMEZAWA Hiroyuki
Reviewed-by: Minchan Kim
Cc: Balbir Singh
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-02-03 08:03:19 +0800
9221edb71 memcg: prevent endless loop when charging huge pages ... Browse Code »

The charging code can encounter a charge size that is bigger than a
regular page in two situations: one is a batched charge to fill the
per-cpu stocks, the other is a huge page charge.

This code is distributed over two functions, however, and only the outer
one is aware of huge pages. In case the charging fails, the inner
function will tell the outer function to retry if the charge size is
bigger than regular pages--assuming batched charging is the only case.
And the outer function will retry forever charging a huge page.

This patch makes sure the inner function can distinguish between batch
charging and a single huge page charge. It will only signal another
attempt if batch charging failed, and go into regular reclaim when it is
called on behalf of a huge page.

Signed-off-by: Johannes Weiner
Acked-by: KAMEZAWA Hiroyuki
Reviewed-by: Minchan Kim
Cc: Balbir Singh
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-02-03 08:03:19 +0800
af241a083 thp: fix unsuitable behavior for hwpoisoned tail page ... Browse Code »

When a tail page of THP is poisoned, memory-failure will do nothing except
setting PG_hwpoison, while the expected behavior is that the process, who
is using the poisoned tail page, should be killed.

The above problem is caused by lru check of the poisoned tail page of THP.
Because PG_lru flag is only set on the head page of THP, the check always
consider the poisoned tail page as NON lru page.

So the lru check for the tail page of THP should be avoided, as like as
hugetlb.

This patch adds !PageTransCompound() before lru check for THP, because of
the check (!PageHuge() && !PageTransCompound()) the whole branch could be
optimized away at build time when both hugetlbfs and THP are set with "N"
(or in archs not supporting either of those).

[akpm@linux-foundation.org: fix unrelated typo in shake_page() comment]
Signed-off-by: Jin Dongming
Reviewed-by: Hidetoshi Seto
Cc: Andrea Arcangeli
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jin Dongming
2011-02-03 08:03:19 +0800
a6d30ddda thp: fix the wrong reported address of hwpoisoned hugepages ... Browse Code »

When the tail page of THP is poisoned, the head page will be poisoned too.
And the wrong address, address of head page, will be sent with sigbus
always.

So when the poisoned page is used by Guest OS which is running on KVM,
after the address changing(hva->gpa) by qemu, the unexpected process on
Guest OS will be killed by sigbus.

What we expected is that the process using the poisoned tail page could be
killed on Guest OS, but not that the process using the healthy head page
is killed.

Since it is not good to poison the healthy page, avoid poisoning other
than the page which is really poisoned.
(While we poison all pages in a huge page in case of hugetlb,
we can do this for THP thanks to split_huge_page().)

Here we fix two parts:
1. Isolate the poisoned page only to make sure
the reported address is the address of poisoned page.
2. make the poisoned page work as the poisoned regular page.

[akpm@linux-foundation.org: fix spello in comment]
Signed-off-by: Jin Dongming
Reviewed-by: Hidetoshi Seto
Cc: Andrea Arcangeli
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jin Dongming
2011-02-03 08:03:19 +0800
efeda7a41 thp: fix splitting of hwpoisoned hugepages ... Browse Code »

The poisoned THP is now split with split_huge_page() in
collect_procs_anon(). If kmalloc() is failed in collect_procs(),
split_huge_page() could not be called. And the work after
split_huge_page() for collecting the processes using poisoned page will
not be done, too. So the processes using the poisoned page could not be
killed.

The condition becomes worse when CONFIG_DEBUG_VM == "Y". Because the
poisoned THP could not be split, system panic will be caused by
VM_BUG_ON(PageTransHuge(page)) in try_to_unmap().

This patch does:
1. move split_huge_page() to the place before collect_procs().
This can be sure the failure of splitting THP is caused by itself.
2. when splitting THP is failed, stop the operations after it.
This can avoid unexpected system panic or non sense works.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jin Dongming
Reviewed-by: Hidetoshi Seto
Cc: Andrea Arcangeli
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jin Dongming
2011-02-03 08:03:19 +0800
48db54ee2 mm/migration: fix page corruption during hugepage migration ... Browse Code »

If migrate_huge_page by memory-failure fails , it calls put_page in itself
to decrease page reference and caller of migrate_huge_page also calls
putback_lru_pages. It can do double free of page so it can make page
corruption on page holder.

In addtion, clean of pages on caller is consistent behavior with
migrate_pages by cf608ac19c ("mm: compaction: fix COMPACTPAGEFAILED
counting").

Signed-off-by: Minchan Kim
Cc: Andrea Arcangeli
Cc: Christoph Lameter
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2011-02-03 08:03:18 +0800
57fc4a5ee mm: when migrate_pages returns 0, all pages must have been released ... Browse Code »

In some cases migrate_pages could return zero while still leaving a few
pages in the pagelist (and some caller wouldn't notice it has to call
putback_lru_pages after commit cf608ac19c9 ("mm: compaction: fix
COMPACTPAGEFAILED counting")).

Add one missing putback_lru_pages not added by commit cf608ac19c95 ("mm:
compaction: fix COMPACTPAGEFAILED counting").

Signed-off-by: Andrea Arcangeli
Signed-off-by: Minchan Kim
Reviewed-by: Minchan Kim
Cc: Christoph Lameter
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-02-03 08:03:18 +0800
552b372ba memsw: deprecate noswapaccount kernel parameter and schedule it for removal ... Browse Code »

noswapaccount couldn't be used to control memsw for both on/off cases so
we have added swapaccount[=0|1] parameter. This way we can turn the
feature in two ways noswapaccount resp. swapaccount=0. We have kept the
original noswapaccount but I think we should remove it after some time as
it just makes more command line parameters without any advantages and also
the code to handle parameters is uglier if we want both parameters.

Signed-off-by: Michal Hocko
Requested-by: KAMEZAWA Hiroyuki
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2011-02-03 08:03:18 +0800
fceda1bf4 memsw: handle swapaccount kernel parameter correctly ... Browse Code »

__setup based kernel command line parameters handlers which are handled in
obsolete_checksetup are provided with the parameter value including =
(more precisely everything right after the parameter name).

This means that the current implementation of swapaccount[=1|0] doesn't
work at all because if there is a value for the parameter then we are
testing for "0" resp. "1" but we are getting "=0" resp. "=1" and if
there is no parameter value we are getting an empty string rather than
NULL.

The original noswapccount parameter, which doesn't care about the value,
works correctly.

Signed-off-by: Michal Hocko
Acked-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2011-02-03 08:03:18 +0800

02 Feb, 2011

1 commit

fdf4c587a mlock: operate on any regions with protection != PROT_NONE ... Browse Code »

As Tao Ma noticed, change 5ecfda0 breaks blktrace. This is because
blktrace mmaps a file with PROT_WRITE permissions but without PROT_READ,
so my attempt to not unnecessarity break COW during mlock ended up
causing mlock to fail with a permission problem.

I am proposing to let mlock ignore vma protection in all cases except
PROT_NONE. In particular, mlock should not fail for PROT_WRITE regions
(as in the blktrace case, which broke at 5ecfda0) or for PROT_EXEC
regions (which seem to me like they were always broken).

Signed-off-by: Michel Lespinasse
Acked-by: Rik van Riel
Signed-off-by: Linus Torvalds

Michel Lespinasse
2011-02-02 07:20:50 +0800

31 Jan, 2011

1 commit

4fda11685 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-2.6-cm ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-2.6-cm:
kmemleak: Allow kmemleak metadata allocations to fail
kmemleak: remove memset by using kzalloc

Linus Torvalds
2011-01-31 10:55:38 +0800

28 Jan, 2011

2 commits

6ae4bd1f0 kmemleak: Allow kmemleak metadata allocations to fail ... Browse Code »

This patch adds __GFP_NORETRY and __GFP_NOMEMALLOC flags to the kmemleak
metadata allocations so that it has a smaller effect on the users of the
kernel slab allocator. Since kmemleak allocations can now fail more
often, this patch also reduces the verbosity by passing __GFP_NOWARN and
not dumping the stack trace when a kmemleak allocation fails.

Signed-off-by: Catalin Marinas
Reported-by: Toralf Förster
Acked-by: Pekka Enberg
Acked-by: David Rientjes
Cc: Ted Ts'o

Catalin Marinas
2011-01-28 02:32:06 +0800
0a08739e8 kmemleak: remove memset by using kzalloc ... Browse Code »

We don't need to memset if we just use kzalloc() rather than kmalloc() in
kmemleak_test_init().

Signed-off-by: Jesper Juhl
Reviewed-by: Minchan Kim
Signed-off-by: Catalin Marinas

Jesper Juhl
2011-01-28 02:31:51 +0800

26 Jan, 2011

9 commits

52dbb9050 memcg: fix race at move_parent around compound_order() ... Browse Code »

A fix up mem_cgroup_move_parent() which use compound_order() in
asynchronous manner. This compound_order() may return unknown value
because we don't take lock. Use PageTransHuge() and HPAGE_SIZE instead
of it.

Also clean up for mem_cgroup_move_parent().
- remove unnecessary initialization of local variable.
- rename charge_size -> page_size
- remove unnecessary (wrong) comment.
- added a comment about THP.

Note:
Current design take compound_page_lock() in caller of move_account().
This should be revisited when we implement direct move_task of hugepage
without splitting.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki
Reviewed-by: Johannes Weiner
Acked-by: Daisuke Nishimura
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-01-26 08:50:04 +0800
3d37c4a91 memcg: bugfix check mem_cgroup_disabled() at split fixup ... Browse Code »

mem_cgroup_disabled() should be checked at splitting. If disabled, no
heavy work is necesary.

Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Daisuke Nishimura
Reviewed-by: Johannes Weiner
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-01-26 08:50:03 +0800
01c88e2d6 memcg: fix account leak at failure of memsw acconting ... Browse Code »

Commit 4b53433468 ("memcg: clean up try_charge main loop") removes a
cancel of charge at case: memory charge-> success. mem+swap charge->
failure.

This leaks usage of memory. Fix it.

Signed-off-by: KAMEZAWA Hiroyuki
Reviewed-by: Johannes Weiner
Acked-by: Daisuke Nishimura
Cc: Balbir Singh
Cc: [2.6.36+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-01-26 08:50:03 +0800
28bd65781 mm: migration: clarify migrate_pages() comment ... Browse Code »

Callers of migrate_pages should putback_lru_pages to return pages
isolated to LRU or free list. Now comment is rather confusing. It says
caller always have to call it.

It is more clear to point out that the caller has to call it if
migrate_pages's return value isn't zero.

Signed-off-by: Minchan Kim
Cc: Christoph Lameter
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2011-01-26 08:50:02 +0800
33a938774 mm: compaction: don't depend on HUGETLB_PAGE ... Browse Code »

Commit 5d6892407 ("thp: select CONFIG_COMPACTION if TRANSPARENT_HUGEPAGE
enabled") causes this warning during the configuration process:

warning: (TRANSPARENT_HUGEPAGE) selects COMPACTION which has unmet
direct dependencies (EXPERIMENTAL && HUGETLB_PAGE && MMU)

COMPACTION doesn't depend on HUGETLB_PAGE, it doesn't depend on THP
either, it is also useful for regular alloc_pages(order > 0) including
the very kernel stack during fork (THREAD_ORDER = 1). It's always
better to enable COMPACTION.

The warning should be an error because we would end up with MIGRATION
not selected, and COMPACTION wouldn't work without migration (despite it
seems to build with an inline migrate_pages returning -ENOSYS).

I'd also like to remove EXPERIMENTAL: compaction has been in the kernel
for some releases (for full safety the default remains disabled which I
think is enough).

Signed-off-by: Andrea Arcangeli
Reported-by: Luca Tettamanti
Tested-by: Luca Tettamanti
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-01-26 08:50:02 +0800
8dba474f0 mm/memcontrol.c: fix uninitialized variable use in mem_cgroup_move_parent() ... Browse Code »

In mm/memcontrol.c::mem_cgroup_move_parent() there's a path that jumps
to the 'put_back' label

ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, charge);
if (ret || !parent)
goto put_back;

where we'll

if (charge > PAGE_SIZE)
compound_unlock_irqrestore(page, flags);

but, we have not assigned anything to 'flags' at this point, nor have we
called 'compound_lock_irqsave()' (which is what sets 'flags'). The
'put_back' label should be moved below the call to
compound_unlock_irqrestore() as per this patch.

Signed-off-by: Jesper Juhl
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: KAMEZAWA Hiroyuki
Cc: Pavel Emelianov
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jesper Juhl
2011-01-26 08:50:01 +0800
2ff754fa8 mm: clear pages_scanned only if draining a pcp adds pages to the buddy allocator ... Browse Code »
43

Commit 0e093d99763e ("writeback: do not sleep on the congestion queue if
there are no congested BDIs or if significant congestion is not being
encountered in the current zone") uncovered a livelock in the page
allocator that resulted in tasks infinitely looping trying to find
memory and kswapd running at 100% cpu.

The issue occurs because drain_all_pages() is called immediately
following direct reclaim when no memory is freed and try_to_free_pages()
returns non-zero because all zones in the zonelist do not have their
all_unreclaimable flag set.

When draining the per-cpu pagesets back to the buddy allocator for each
zone, the zone->pages_scanned counter is cleared to avoid erroneously
setting zone->all_unreclaimable later. The problem is that no pages may
actually be drained and, thus, the unreclaimable logic never fails
direct reclaim so the oom killer may be invoked.

This apparently only manifested after wait_iff_congested() was
introduced and the zone was full of anonymous memory that would not
congest the backing store. The page allocator would infinitely loop if
there were no other tasks waiting to be scheduled and clear
zone->pages_scanned because of drain_all_pages() as the result of this
change before kswapd could scan enough pages to trigger the reclaim
logic. Additionally, with every loop of the page allocator and in the
reclaim path, kswapd would be kicked and would end up running at 100%
cpu. In this scenario, current and kswapd are all running continuously
with kswapd incrementing zone->pages_scanned and current clearing it.

The problem is even more pronounced when current swaps some of its
memory to swap cache and the reclaimable logic then considers all active
anonymous memory in the all_unreclaimable logic, which requires a much
higher zone->pages_scanned value for try_to_free_pages() to return zero
that is never attainable in this scenario.

Before wait_iff_congested(), the page allocator would incur an
unconditional timeout and allow kswapd to elevate zone->pages_scanned to
a level that the oom killer would be called the next time it loops.

The fix is to only attempt to drain pcp pages if there is actually a
quantity to be drained. The unconditional clearing of
zone->pages_scanned in free_pcppages_bulk() need not be changed since
other callers already ensure that draining will occur. This patch
ensures that free_pcppages_bulk() will actually free memory before
calling into it from drain_all_pages() so zone->pages_scanned is only
cleared if appropriate.

Signed-off-by: David Rientjes
Cc: Mel Gorman
Reviewed-by: Johannes Weiner
Cc: Minchan Kim
Cc: Wu Fengguang
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Reviewed-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2011-01-26 08:50:01 +0800
f33261d75 mm: fix deferred congestion timeout if preferred zone is not allowed ... Browse Code »

Before 0e093d99763e ("writeback: do not sleep on the congestion queue if
there are no congested BDIs or if significant congestion is not being
encountered in the current zone"), preferred_zone was only used for NUMA
statistics, to determine the zoneidx from which to allocate from given
the type requested, and whether to utilize memory compaction.

wait_iff_congested(), though, uses preferred_zone to determine if the
congestion wait should be deferred because its dirty pages are backed by
a congested bdi. This incorrectly defers the timeout and busy loops in
the page allocator with various cond_resched() calls if preferred_zone
is not allowed in the current context, usually consuming 100% of a cpu.

This patch ensures preferred_zone is an allowed zone in the fastpath
depending on whether current is constrained by its cpuset or nodes in
its mempolicy (when the nodemask passed is non-NULL). This is correct
since the fastpath allocation always passes ALLOC_CPUSET when trying to
allocate memory. In the slowpath, this patch resets preferred_zone to
the first zone of the allowed type when the allocation is not
constrained by current's cpuset, i.e. it does not pass ALLOC_CPUSET.

This patch also ensures preferred_zone is from the set of allowed nodes
when called from within direct reclaim since allocations are always
constrained by cpusets in this context (it is blockable).

Both of these uses of cpuset_current_mems_allowed are protected by
get_mems_allowed().

Signed-off-by: David Rientjes
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: Minchan Kim
Cc: Wu Fengguang
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2011-01-26 08:50:00 +0800
f95ba941d mm/pgtable-generic.c: fix CONFIG_SWAP=n build ... Browse Code »

mips (and sparc32):

In file included from arch/mips/include/asm/tlb.h:21,
from mm/pgtable-generic.c:9:
include/asm-generic/tlb.h: In function `tlb_flush_mmu':
include/asm-generic/tlb.h:76: error: implicit declaration of function `release_pages'
include/asm-generic/tlb.h: In function `tlb_remove_page':
include/asm-generic/tlb.h:105: error: implicit declaration of function `page_cache_release'

free_pages_and_swap_cache() and free_page_and_swap_cache() are macros
which call release_pages() and page_cache_release(). The obvious fix is
to include pagemap.h in swap.h, where those macros are defined. But that
breaks sparc for weird reasons.

So fix it within mm/pgtable-generic.c instead.

Reported-by: Yoichi Yuasa
Cc: Geert Uytterhoeven
Acked-by: Sam Ravnborg
Cc: Sergei Shtylyov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2011-01-26 08:49:58 +0800

21 Jan, 2011

10 commits

713735b42 memcg: correctly order reading PCG_USED and pc->mem_cgroup ... Browse Code »

The placement of the read-side barrier is confused: the writer first
sets pc->mem_cgroup, then PCG_USED. The read-side barrier has to be
between testing PCG_USED and reading pc->mem_cgroup.

Signed-off-by: Johannes Weiner
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Daisuke Nishimura
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-01-21 09:02:06 +0800
382e27daa mm: fix truncate_setsize() comment ... Browse Code »

Contrary to what the comment says, truncate_setsize() should be called
*before* filesystem truncated blocks.

Signed-off-by: Jan Kara
Cc: Christoph Hellwig
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2011-01-21 09:02:06 +0800
987eba66e memcg: fix rmdir, force_empty with THP ... Browse Code »

Now, when THP is enabled, memcg's rmdir() function is broken because
move_account() for THP page is not supported.

This will cause account leak or -EBUSY issue at rmdir().
This patch fixes the issue by supporting move_account() THP pages.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-01-21 09:02:06 +0800
ece35ca81 memcg: fix LRU accounting with THP ... Browse Code »

memory cgroup's LRU stat should take care of size of pages because
Transparent Hugepage inserts hugepage into LRU. If this value is the
number wrong, memory reclaim will not work well.

Note: only head page of THP's huge page is linked into LRU.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-01-21 09:02:06 +0800
ca3e02141 memcg: fix USED bit handling at uncharge in THP ... Browse Code »

Now, under THP:

at charge:
- PageCgroupUsed bit is set to all page_cgroup on a hugepage.
....set to 512 pages.
at uncharge
- PageCgroupUsed bit is unset on the head page.

So, some pages will remain with "Used" bit.

This patch fixes that Used bit is set only to the head page.
Used bits for tail pages will be set at splitting if necessary.

This patch adds this lock order:
compound_lock() -> page_cgroup_move_lock().

[akpm@linux-foundation.org: fix warning]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-01-21 09:02:06 +0800
e401f1761 memcg: modify accounting function for supporting THP better ... Browse Code »

mem_cgroup_charge_statisics() was designed for charging a page but now, we
have transparent hugepage. To fix problems (in following patch) it's
required to change the function to get the number of pages as its
arguments.

The new function gets following as argument.
- type of page rather than 'pc'
- size of page which is accounted.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-01-21 09:02:05 +0800
82478fb7b mm: compaction: prevent division-by-zero during user-requested compaction ... Browse Code »

Up until 3e7d344 ("mm: vmscan: reclaim order-0 and use compaction instead
of lumpy reclaim"), compaction skipped calculating the fragmentation index
of a zone when compaction was explicitely requested through the procfs
knob.

However, when compaction_suitable was introduced, it did not come with an
extra check for order == -1, set on explicit compaction requests, and
passed this order on to the fragmentation index calculation, where it
overshifts the number of requested pages, leading to a division by zero.

This patch makes sure that order == -1 is recognized as the flag it is
rather than passing it along as valid order parameter.

[akpm@linux-foundation.org: add comment, per Mel]
Signed-off-by: Johannes Weiner
Reviewed-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-01-21 09:02:05 +0800
3305de51b mm/vmscan.c: remove duplicate include of compaction.h ... Browse Code »

Signed-off-by: Jesper Juhl
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jesper Juhl
2011-01-21 09:02:05 +0800
abb65272a memblock: fix memblock_is_region_memory() ... Browse Code »

memblock_is_region_memory() uses reserved memblocks to search for the
given region, while it should use the memory memblocks.

I encountered the problem with OMAP's framebuffer ram allocation.
Normally the ram is allocated dynamically, and this function is not
called. However, if we want to pass the framebuffer from the bootloader
to the kernel (to retain the boot image), this function is used to check
the validity of the kernel parameters for the framebuffer ram area.

Signed-off-by: Tomi Valkeinen
Acked-by: Yinghai Lu
Cc: Benjamin Herrenschmidt
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tomi Valkeinen
2011-01-21 09:02:05 +0800
453c71926 thp: keep highpte mapped until it is no longer needed ... Browse Code »

Two users reported THP-related crashes on 32-bit x86 machines. Their oops
reports indicated an invalid pte, and subsequent code inspection showed
that the highpte is actually used after unmap.

The fix is to unmap the pte only after all operations against it are
finished.

Signed-off-by: Johannes Weiner
Reported-by: Ilya Dryomov
Reported-by: werner
Cc: Andrea Arcangeli
Tested-by: Ilya Dryomov
Tested-by: Steven Rostedt
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-01-21 09:02:05 +0800