Eric Lee / smarc-fsl-linux-kernel

19 Jun, 2013

1 commit

35b587401 ENGR00266285-2 mm: dmapool: export for normal case ... Browse Code »

It is the same issue with ENGR00217721, and it is also needed
for normal usb case.

Signed-off-by: Peter Chen

Peter Chen
2013-06-19 18:13:58 +0800

08 Jan, 2013

1 commit

80ad41a36 fix echo 1 > compact_memory return error issue ... Browse Code »

when run the folloing command under shell, it will return error
sh/$ echo 1 > /proc/sys/vm/compact_memory
sh/$ sh: write error: Bad address

After strace, I found the following log:
...
write(1, "1\n", 2) = 3
write(1, "", 4294967295) = -1 EFAULT (Bad address)
write(2, "echo: write error: Bad address\n", 31echo: write error: Bad address
) = 31

This tells system return 3(COMPACT_COMPLETE) after write data to compact_memory.

The fix is to make the system just return 0 instead 3(COMPACT_COMPLETE) from
sysctl_compaction_handler after compaction_nodes finished.

Signed-off-by: Jason Liu
Suggested-by: David Rientjes
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Minchan Kim
Cc: KAMEZAWA Hiroyuki
Cc:
Signed-off-by: Andrew Morton

Jason Liu
2013-01-08 16:33:08 +0800

26 Nov, 2012

1 commit

2dd93590b mm: fix off-by-two in __zone_watermark_ok() ... Browse Code »

Commit 88f5acf88ae6 ("mm: page allocator: adjust the per-cpu counter
threshold when memory is low") changed the form how free_pages is
calculated but it forgot that we used to do free_pages - ((1 << order) -
1) so we ended up with off-by-two when calculating free_pages.

Reported-by: Wang Sheng-Hui
Signed-off-by: Michal Hocko
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2012-11-26 19:11:59 +0800

25 Jul, 2012

3 commits

651ce7849 ENGR00217721-4 implement dma_pool_alloc_nonbufferable interface ... Browse Code »

mm core part

- After USB driver prime a bulk transfer(whatever IN or OUT, take
OUT for example) on ep1, only one dTD is primed, an USB Interrupt
(bit 0 of USBSTS) will be issued, and find that endptcomplete
register is 0x2 which means an OUT transfer on ep1 is completed,
at this time the ep1 out queue head status is 0x1e18000, and next
dtd pointer is 0x1 which means transfer is done and everything is
OK, while the dTD token status is 0x2008080 which means this dTD
is still active, not completed yet.
- Audio SDMA and Ethernet have the similar issue
- root cause is not found yet
- work around:
change the non-cacheable bufferable memory to non-cacheable
non-bufferable memory to make this issue disappear.

Signed-off-by: Tony LIU

Tony LIU
2012-07-25 13:11:33 +0800
f16ba97e6 mm: Hold a file reference in madvise_remove ... Browse Code »

commit 9ab4233dd08036fe34a89c7dc6f47a8bf2eb29eb upstream.

Otherwise the code races with munmap (causing a use-after-free
of the vma) or with close (causing a use-after-free of the struct
file).

The bug was introduced by commit 90ed52ebe481 ("[PATCH] holepunch: fix
mmap_sem i_mutex deadlock")

[bwh: Backported to 3.2:
- Adjust context
- madvise_remove() calls vmtruncate_range(), not do_fallocate()]
[luto: Backported to 3.0: Adjust context]

Cc: Hugh Dickins
Cc: Miklos Szeredi
Cc: Badari Pulavarty
Cc: Nick Piggin
Signed-off-by: Ben Hutchings
Signed-off-by: Andy Lutomirski
Signed-off-by: Greg Kroah-Hartman

Andy Lutomirski
2012-07-25 13:11:25 +0800
d38cbf264 ENGR00216013-1 memblock: add memblock_end_of_DRAM_with_reserved() function. ... Browse Code »

add a function to check the end address including reserved memory,
this API can provide the top address of phy memory,
it can be used to check if the phy memory is valild in some driver
like VPU.

Signed-off-by: Zhang Jiejing

Zhang Jiejing
2012-07-25 13:10:57 +0800

18 Jun, 2012

3 commits

2209ffb96 hugetlb: fix resv_map leak in error path ... Browse Code »

commit c50ac050811d6485616a193eb0f37bfbd191cc89 and
4523e1458566a0e8ecfaff90f380dd23acc44d27 upstream.

When called for anonymous (non-shared) mappings, hugetlb_reserve_pages()
does a resv_map_alloc(). It depends on code in hugetlbfs's
vm_ops->close() to release that allocation.

However, in the mmap() failure path, we do a plain unmap_region() without
the remove_vma() which actually calls vm_ops->close().

This is a decent fix. This leak could get reintroduced if new code (say,
after hugetlb_reserve_pages() in hugetlbfs_file_mmap()) decides to return
an error. But, I think it would have to unroll the reservation anyway.

Christoph's test case:

http://marc.info/?l=linux-mm&m=133728900729735

This patch applies to 3.4 and later. A version for earlier kernels is at
https://lkml.org/lkml/2012/5/22/418.

Signed-off-by: Dave Hansen
Acked-by: Mel Gorman
Acked-by: KOSAKI Motohiro
Reported-by: Christoph Lameter
Tested-by: Christoph Lameter
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Dave Hansen
2012-06-18 02:23:13 +0800
c201beec4 mm: fix faulty initialization in vmalloc_init() ... Browse Code »

commit dbda591d920b4c7692725b13e3f68ecb251e9080 upstream.

The transfer of ->flags causes some of the static mapping virtual
addresses to be prematurely freed (before the mapping is removed) because
VM_LAZY_FREE gets "set" if tmp->flags has VM_IOREMAP set. This might
cause subsequent vmalloc/ioremap calls to fail because it might allocate
one of the freed virtual address ranges that aren't unmapped.

va->flags has different types of flags from tmp->flags. If a region with
VM_IOREMAP set is registered with vm_area_add_early(), it will be removed
by __purge_vmap_area_lazy().

Fix vmalloc_init() to correctly initialize vmap_area for the given
vm_struct.

Also initialise va->vm. If it is not set, find_vm_area() for the early
vm regions will always fail.

Signed-off-by: KyongHo Cho
Cc: "Olav Haugan"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

KyongHo
2012-06-18 02:23:13 +0800
5c2d31dda mm/vmalloc.c: change void* into explict vm_struct* ... Browse Code »

commit db1aecafef58b5dda39c4228debe2c845e4a27ab upstream.

vmap_area->private is void* but we don't use the field for various purpose
but use only for vm_struct. So change it to a vm_struct* with naming to
improve for readability and type checking.

Signed-off-by: Minchan Kim
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Minchan Kim
2012-06-18 02:23:13 +0800

09 Jun, 2012

1 commit

dce59c2fa mm: consider all swapped back pages in used-once logic ... Browse Code »

commit e48982734ea0500d1eba4f9d96195acc5406cad6 upstream.

Commit 645747462435 ("vmscan: detect mapped file pages used only once")
made mapped pages have another round in inactive list because they might
be just short lived and so we could consider them again next time. This
heuristic helps to reduce pressure on the active list with a streaming
IO worklods.

This patch fixes a regression introduced by this commit for heavy shmem
based workloads because unlike Anon pages, which are excluded from this
heuristic because they are usually long lived, shmem pages are handled
as a regular page cache.

This doesn't work quite well, unfortunately, if the workload is mostly
backed by shmem (in memory database sitting on 80% of memory) with a
streaming IO in the background (backup - up to 20% of memory). Anon
inactive list is full of (dirty) shmem pages when watermarks are hit.
Shmem pages are kept in the inactive list (they are referenced) in the
first round and it is hard to reclaim anything else so we reach lower
scanning priorities very quickly which leads to an excessive swap out.

Let's fix this by excluding all swap backed pages (they tend to be long
lived wrt. the regular page cache anyway) from used-once heuristic and
rather activate them if they are referenced.

The customer's workload is shmem backed database (80% of RAM) and they
are measuring transactions/s with an IO in the background (20%).
Transactions touch more or less random rows in the table. The
transaction rate fell by a factor of 3 (in the worst case) because of
commit 64574746. This patch restores the previous numbers.

Signed-off-by: Michal Hocko
Acked-by: Johannes Weiner
Cc: Mel Gorman
Cc: Minchan Kim
Cc: KAMEZAWA Hiroyuki
Reviewed-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Michal Hocko
2012-06-09 23:32:57 +0800

01 Jun, 2012

1 commit

beb957653 mm: mempolicy: Let vma_merge and vma_split handle vma->vm_policy linkages ... Browse Code »

commit 05f144a0d5c2207a0349348127f996e104ad7404 upstream.

Dave Jones' system call fuzz testing tool "trinity" triggered the
following bug error with slab debugging enabled

=============================================================================
BUG numa_policy (Not tainted): Poison overwritten
-----------------------------------------------------------------------------

INFO: 0xffff880146498250-0xffff880146498250. First byte 0x6a instead of 0x6b
INFO: Allocated in mpol_new+0xa3/0x140 age=46310 cpu=6 pid=32154
__slab_alloc+0x3d3/0x445
kmem_cache_alloc+0x29d/0x2b0
mpol_new+0xa3/0x140
sys_mbind+0x142/0x620
system_call_fastpath+0x16/0x1b
INFO: Freed in __mpol_put+0x27/0x30 age=46268 cpu=6 pid=32154
__slab_free+0x2e/0x1de
kmem_cache_free+0x25a/0x260
__mpol_put+0x27/0x30
remove_vma+0x68/0x90
exit_mmap+0x118/0x140
mmput+0x73/0x110
exit_mm+0x108/0x130
do_exit+0x162/0xb90
do_group_exit+0x4f/0xc0
sys_exit_group+0x17/0x20
system_call_fastpath+0x16/0x1b
INFO: Slab 0xffffea0005192600 objects=27 used=27 fp=0x (null) flags=0x20000000004080
INFO: Object 0xffff880146498250 @offset=592 fp=0xffff88014649b9d0

This implied a reference counting bug and the problem happened during
mbind().

mbind() applies a new memory policy to a range and uses mbind_range() to
merge existing VMAs or split them as necessary. In the event of splits,
mpol_dup() will allocate a new struct mempolicy and maintain existing
reference counts whose rules are documented in
Documentation/vm/numa_memory_policy.txt .

The problem occurs with shared memory policies. The vm_op->set_policy
increments the reference count if necessary and split_vma() and
vma_merge() have already handled the existing reference counts.
However, policy_vma() screws it up by replacing an existing
vma->vm_policy with one that potentially has the wrong reference count
leading to a premature free. This patch removes the damage caused by
policy_vma().

With this patch applied Dave's trinity tool runs an mbind test for 5
minutes without error. /proc/slabinfo reported that there are no
numa_policy or shared_policy_node objects allocated after the test
completed and the shared memory region was deleted.

Signed-off-by: Mel Gorman
Cc: Dave Jones
Cc: KOSAKI Motohiro
Cc: Stephen Wilson
Cc: Christoph Lameter
Cc: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Mel Gorman
2012-06-01 15:12:56 +0800

22 May, 2012

4 commits

37de6be49 memcg: free spare array to avoid memory leak ... Browse Code »

commit 8c7577637ca31385e92769a77e2ab5b428e8b99c upstream.

When the last event is unregistered, there is no need to keep the spare
array anymore. So free it to avoid memory leak.

Signed-off-by: Sha Zhengju
Acked-by: KAMEZAWA Hiroyuki
Reviewed-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Sha Zhengju
2012-05-22 00:40:04 +0800
c928e8c32 mm: nobootmem: fix sign extend problem in __free_pages_memory() ... Browse Code »

commit 6bc2e853c6b46a6041980d58200ad9b0a73a60ff upstream.

Systems with 8 TBytes of memory or greater can hit a problem where only
the the first 8 TB of memory shows up. This is due to "int i" being
smaller than "unsigned long start_aligned", causing the high bits to be
dropped.

The fix is to change `i' to unsigned long to match start_aligned
and end_aligned.

Thanks to Jack Steiner for assistance tracking this down.

Signed-off-by: Russ Anderson
Cc: Jack Steiner
Cc: Johannes Weiner
Cc: Tejun Heo
Cc: David S. Miller
Cc: Yinghai Lu
Cc: Gavin Shan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Russ Anderson
2012-05-22 00:40:02 +0800
afe85051b hugetlb: prevent BUG_ON in hugetlb_fault() -> hugetlb_cow() ... Browse Code »

commit 4998a6c0edce7fae9c0a5463f6ec3fa585258ee7 upstream.

Commit 66aebce747eaf ("hugetlb: fix race condition in hugetlb_fault()")
added code to avoid a race condition by elevating the page refcount in
hugetlb_fault() while calling hugetlb_cow().

However, one code path in hugetlb_cow() includes an assertion that the
page count is 1, whereas it may now also have the value 2 in this path.

The consensus is that this BUG_ON has served its purpose, so rather than
extending it to cover both cases, we just remove it.

Signed-off-by: Chris Metcalf
Acked-by: Mel Gorman
Acked-by: Hillf Danton
Acked-by: Hugh Dickins
Cc: Michal Hocko
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Chris Metcalf
2012-05-22 00:40:02 +0800
4f5e38765 percpu: pcpu_embed_first_chunk() should free unused parts after all allocs are complete ... Browse Code »

commit 42b64281453249dac52861f9b97d18552a7ec62b upstream.

pcpu_embed_first_chunk() allocates memory for each node, copies percpu
data and frees unused portions of it before proceeding to the next
group. This assumes that allocations for different nodes doesn't
overlap; however, depending on memory topology, the bootmem allocator
may end up allocating memory from a different node than the requested
one which may overlap with the portion freed from one of the previous
percpu areas. This leads to percpu groups for different nodes
overlapping which is a serious bug.

This patch separates out copy & partial free from the allocation loop
such that all allocations are complete before partial frees happen.

This also fixes overlapping frees which could happen on allocation
failure path - out_free_areas path frees whole groups but the groups
could have portions freed at that point.

Signed-off-by: Tejun Heo
Reported-by: "Pavel V. Panteleev"
Tested-by: "Pavel V. Panteleev"
LKML-Reference:
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2012-05-22 00:40:02 +0800

28 Apr, 2012

1 commit

56bd028dd mm: fix s390 BUG by __set_page_dirty_no_writeback on swap ... Browse Code »

commit aca50bd3b4c4bb5528a1878158ba7abce41de534 upstream.

Mel reports a BUG_ON(slot == NULL) in radix_tree_tag_set() on s390
3.0.13: called from __set_page_dirty_nobuffers() when page_remove_rmap()
tries to transfer dirty flag from s390 storage key to struct page and
radix_tree.

That would be because of reclaim's shrink_page_list() calling
add_to_swap() on this page at the same time: first PageSwapCache is set
(causing page_mapping(page) to appear as &swapper_space), then
page->private set, then tree_lock taken, then page inserted into
radix_tree - so there's an interval before taking the lock when the
radix_tree slot is empty.

We could fix this by moving __add_to_swap_cache()'s spin_lock_irq up
before the SetPageSwapCache. But a better fix is simply to do what's
five years overdue: Ken Chen introduced __set_page_dirty_no_writeback()
(if !PageDirty TestSetPageDirty) for tmpfs to skip all the radix_tree
overhead, and swap is just the same - it ignores the radix_tree tag, and
does not participate in dirty page accounting, so should be using
__set_page_dirty_no_writeback() too.

s390 testing now confirms that this does indeed fix the problem.

Reported-by: Mel Gorman
Signed-off-by: Hugh Dickins
Acked-by: Mel Gorman
Cc: Andrew Morton
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Rik van Riel
Cc: Ken Chen
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Hugh Dickins
2012-04-28 00:51:07 +0800

23 Apr, 2012

1 commit

98fb47dfd hugetlb: fix race condition in hugetlb_fault() ... Browse Code »

commit 66aebce747eaf9bc456bf1f1b217d8db843031d0 upstream.

The race is as follows:

Suppose a multi-threaded task forks a new process (on cpu A), thus
bumping up the ref count on all the pages. While the fork is occurring
(and thus we have marked all the PTEs as read-only), another thread in
the original process (on cpu B) tries to write to a huge page, taking an
access violation from the write-protect and calling hugetlb_cow(). Now,
suppose the fork() fails. It will undo the COW and decrement the ref
count on the pages, so the ref count on the huge page drops back to 1.
Meanwhile hugetlb_cow() also decrements the ref count by one on the
original page, since the original address space doesn't need it any
more, having copied a new page to replace the original page. This
leaves the ref count at zero, and when we call unlock_page(), we panic.

fork on CPU A fault on CPU B
============= ==============
...
down_write(&parent->mmap_sem);
down_write_nested(&child->mmap_sem);
...
while duplicating vmas
if error
break;
...
up_write(&child->mmap_sem);
up_write(&parent->mmap_sem); ...
down_read(&parent->mmap_sem);
...
lock_page(page);
handle COW
page_mapcount(old_page) == 2
alloc and prepare new_page
...
handle error
page_remove_rmap(page);
put_page(page);
...
fold new_page into pte
page_remove_rmap(page);
put_page(page);
...
oops ==> unlock_page(page);
up_read(&parent->mmap_sem);

The solution is to take an extra reference to the page while we are
holding the lock on it.

Signed-off-by: Chris Metcalf
Cc: Hillf Danton
Cc: Michal Hocko
Cc: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Chris Metcalf
2012-04-23 07:21:23 +0800

03 Apr, 2012

3 commits

295706896 slub: Do not hold slub_lock when calling sysfs_slab_add() ... Browse Code »

commit 66c4c35c6bc5a1a452b024cf0364635b28fd94e4 upstream.

sysfs_slab_add() calls various sysfs functions that actually may
end up in userspace doing all sorts of things.

Release the slub_lock after adding the kmem_cache structure to the list.
At that point the address of the kmem_cache is not known so we are
guaranteed exlusive access to the following modifications to the
kmem_cache structure.

If the sysfs_slab_add fails then reacquire the slub_lock to
remove the kmem_cache structure from the list.

Reported-by: Sasha Levin
Acked-by: Eric Dumazet
Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg
Signed-off-by: Greg Kroah-Hartman

Christoph Lameter
2012-04-03 00:27:20 +0800
59d76fa1c bootmem/sparsemem: remove limit constraint in alloc_bootmem_section ... Browse Code »

commit f5bf18fa22f8c41a13eb8762c7373eb3a93a7333 upstream.

While testing AMS (Active Memory Sharing) / CMO (Cooperative Memory
Overcommit) on powerpc, we tripped the following:

kernel BUG at mm/bootmem.c:483!
cpu 0x0: Vector: 700 (Program Check) at [c000000000c03940]
pc: c000000000a62bd8: .alloc_bootmem_core+0x90/0x39c
lr: c000000000a64bcc: .sparse_early_usemaps_alloc_node+0x84/0x29c
sp: c000000000c03bc0
msr: 8000000000021032
current = 0xc000000000b0cce0
paca = 0xc000000001d80000
pid = 0, comm = swapper
kernel BUG at mm/bootmem.c:483!
enter ? for help
[c000000000c03c80] c000000000a64bcc
.sparse_early_usemaps_alloc_node+0x84/0x29c
[c000000000c03d50] c000000000a64f10 .sparse_init+0x12c/0x28c
[c000000000c03e20] c000000000a474f4 .setup_arch+0x20c/0x294
[c000000000c03ee0] c000000000a4079c .start_kernel+0xb4/0x460
[c000000000c03f90] c000000000009670 .start_here_common+0x1c/0x2c

This is

BUG_ON(limit && goal + size > limit);

and after some debugging, it seems that

goal = 0x7ffff000000
limit = 0x80000000000

and sparse_early_usemaps_alloc_node ->
sparse_early_usemaps_alloc_pgdat_section calls

return alloc_bootmem_section(usemap_size() * count, section_nr);

This is on a system with 8TB available via the AMS pool, and as a quirk
of AMS in firmware, all of that memory shows up in node 0. So, we end
up with an allocation that will fail the goal/limit constraints.

In theory, we could "fall-back" to alloc_bootmem_node() in
sparse_early_usemaps_alloc_node(), but since we actually have HOTREMOVE
defined, we'll BUG_ON() instead. A simple solution appears to be to
unconditionally remove the limit condition in alloc_bootmem_section,
meaning allocations are allowed to cross section boundaries (necessary
for systems of this size).

Johannes Weiner pointed out that if alloc_bootmem_section() no longer
guarantees section-locality, we need check_usemap_section_nr() to print
possible cross-dependencies between node descriptors and the usemaps
allocated through it. That makes the two loops in
sparse_early_usemaps_alloc_node() identical, so re-factor the code a
bit.

[akpm@linux-foundation.org: code simplification]
Signed-off-by: Nishanth Aravamudan
Cc: Dave Hansen
Cc: Anton Blanchard
Cc: Paul Mackerras
Cc: Ben Herrenschmidt
Cc: Robert Jennings
Acked-by: Johannes Weiner
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Nishanth Aravamudan
2012-04-03 00:27:11 +0800
5a3e1f550 mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode ... Browse Code »

commit 1a5a9906d4e8d1976b701f889d8f35d54b928f25 upstream.

In some cases it may happen that pmd_none_or_clear_bad() is called with
the mmap_sem hold in read mode. In those cases the huge page faults can
allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
false positive from pmd_bad() that will not like to see a pmd
materializing as trans huge.

It's not khugepaged causing the problem, khugepaged holds the mmap_sem
in write mode (and all those sites must hold the mmap_sem in read mode
to prevent pagetables to go away from under them, during code review it
seems vm86 mode on 32bit kernels requires that too unless it's
restricted to 1 thread per process or UP builds). The race is only with
the huge pagefaults that can convert a pmd_none() into a
pmd_trans_huge().

Effectively all these pmd_none_or_clear_bad() sites running with
mmap_sem in read mode are somewhat speculative with the page faults, and
the result is always undefined when they run simultaneously. This is
probably why it wasn't common to run into this. For example if the
madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
fault, the hugepage will not be zapped, if the page fault runs first it
will be zapped.

Altering pmd_bad() not to error out if it finds hugepmds won't be enough
to fix this, because zap_pmd_range would then proceed to call
zap_pte_range (which would be incorrect if the pmd become a
pmd_trans_huge()).

The simplest way to fix this is to read the pmd in the local stack
(regardless of what we read, no need of actual CPU barriers, only
compiler barrier needed), and be sure it is not changing under the code
that computes its value. Even if the real pmd is changing under the
value we hold on the stack, we don't care. If we actually end up in
zap_pte_range it means the pmd was not none already and it was not huge,
and it can't become huge from under us (khugepaged locking explained
above).

All we need is to enforce that there is no way anymore that in a code
path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
can run into a hugepmd. The overhead of a barrier() is just a compiler
tweak and should not be measurable (I only added it for THP builds). I
don't exclude different compiler versions may have prevented the race
too by caching the value of *pmd on the stack (that hasn't been
verified, but it wouldn't be impossible considering
pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
and there's no external function called in between pmd_trans_huge and
pmd_none_or_clear_bad).

if (pmd_trans_huge(*pmd)) {
if (next-addr != HPAGE_PMD_SIZE) {
VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
split_huge_page_pmd(vma->vm_mm, pmd);
} else if (zap_huge_pmd(tlb, vma, pmd, addr))
continue;
/* fall through */
}
if (pmd_none_or_clear_bad(pmd))

Because this race condition could be exercised without special
privileges this was reported in CVE-2012-1179.

The race was identified and fully explained by Ulrich who debugged it.
I'm quoting his accurate explanation below, for reference.

====== start quote =======
mapcount 0 page_mapcount 1
kernel BUG at mm/huge_memory.c:1384!

At some point prior to the panic, a "bad pmd ..." message similar to the
following is logged on the console:

mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).

The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
the page's PMD table entry.

143 void pmd_clear_bad(pmd_t *pmd)
144 {
-> 145 pmd_ERROR(*pmd);
146 pmd_clear(pmd);
147 }

After the PMD table entry has been cleared, there is an inconsistency
between the actual number of PMD table entries that are mapping the page
and the page's map count (_mapcount field in struct page). When the page
is subsequently reclaimed, __split_huge_page() detects this inconsistency.

1381 if (mapcount != page_mapcount(page))
1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
1383 mapcount, page_mapcount(page));
-> 1384 BUG_ON(mapcount != page_mapcount(page));

The root cause of the problem is a race of two threads in a multithreaded
process. Thread B incurs a page fault on a virtual address that has never
been accessed (PMD entry is zero) while Thread A is executing an madvise()
system call on a virtual address within the same 2 MB (huge page) range.

virtual address space
.---------------------.
| |
| |
.-|---------------------|
| | |
| | |< |/////////////////////| > A(range)
page | |/////////////////////|-'
| | |
| | |
'-|---------------------|
| |
| |
'---------------------'

- Thread A is executing an madvise(..., MADV_DONTNEED) system call
on the virtual address range "A(range)" shown in the picture.

sys_madvise
// Acquire the semaphore in shared mode.
down_read(¤t->mm->mmap_sem)
...
madvise_vma
switch (behavior)
case MADV_DONTNEED:
madvise_dontneed
zap_page_range
unmap_vmas
unmap_page_range
zap_pud_range
zap_pmd_range
//
// Assume that this huge page has never been accessed.
// I.e. content of the PMD entry is zero (not mapped).
//
if (pmd_trans_huge(*pmd)) {
// We don't get here due to the above assumption.
}
//
// Assume that Thread B incurred a page fault and
.---------> // sneaks in here as shown below.
| //
| if (pmd_none_or_clear_bad(pmd))
| {
| if (unlikely(pmd_bad(*pmd)))
| pmd_clear_bad
| {
| pmd_ERROR
| // Log "bad pmd ..." message here.
| pmd_clear
| // Clear the page's PMD entry.
| // Thread B incremented the map count
| // in page_add_new_anon_rmap(), but
| // now the page is no longer mapped
| // by a PMD entry (-> inconsistency).
| }
| }
|
v
- Thread B is handling a page fault on virtual address "B(fault)" shown
in the picture.

...
do_page_fault
__do_page_fault
// Acquire the semaphore in shared mode.
down_read_trylock(&mm->mmap_sem)
...
handle_mm_fault
if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
// We get here due to the above assumption (PMD entry is zero).
do_huge_pmd_anonymous_page
alloc_hugepage_vma
// Allocate a new transparent huge page here.
...
__do_huge_pmd_anonymous_page
...
spin_lock(&mm->page_table_lock)
...
page_add_new_anon_rmap
// Here we increment the page's map count (starts at -1).
atomic_set(&page->_mapcount, 0)
set_pmd_at
// Here we set the page's PMD entry which will be cleared
// when Thread A calls pmd_clear_bad().
...
spin_unlock(&mm->page_table_lock)

The mmap_sem does not prevent the race because both threads are acquiring
it in shared mode (down_read). Thread B holds the page_table_lock while
the page's map count and PMD table entry are updated. However, Thread A
does not synchronize on that lock.

====== end quote =======

[akpm@linux-foundation.org: checkpatch fixes]
Reported-by: Ulrich Obergfell
Signed-off-by: Andrea Arcangeli
Acked-by: Johannes Weiner
Cc: Mel Gorman
Cc: Hugh Dickins
Cc: Dave Jones
Acked-by: Larry Woodman
Acked-by: Rik van Riel
Cc: Mark Salter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Andrea Arcangeli
2012-04-03 00:27:10 +0800

13 Mar, 2012

3 commits

3ddb5b56f mm: thp: fix BUG on mm->nr_ptes ... Browse Code »

commit 1c641e84719429bbfe62a95ed3545ee7fe24408f upstream.

Dave Jones reports a few Fedora users hitting the BUG_ON(mm->nr_ptes...)
in exit_mmap() recently.

Quoting Hugh's discovery and explanation of the SMP race condition:

"mm->nr_ptes had unusual locking: down_read mmap_sem plus
page_table_lock when incrementing, down_write mmap_sem (or mm_users
0) when decrementing; whereas THP is careful to increment and
decrement it under page_table_lock.

Now most of those paths in THP also hold mmap_sem for read or write
(with appropriate checks on mm_users), but two do not: when
split_huge_page() is called by hwpoison_user_mappings(), and when
called by add_to_swap().

It's conceivable that the latter case is responsible for the
exit_mmap() BUG_ON mm->nr_ptes that has been reported on Fedora."

The simplest way to fix it without having to alter the locking is to make
split_huge_page() a noop in nr_ptes terms, so by counting the preallocated
pagetables that exists for every mapped hugepage. It was an arbitrary
choice not to count them and either way is not wrong or right, because
they are not used but they're still allocated.

Reported-by: Dave Jones
Reported-by: Hugh Dickins
Signed-off-by: Andrea Arcangeli
Acked-by: Hugh Dickins
Cc: David Rientjes
Cc: Josh Boyer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Andrea Arcangeli
2012-03-13 01:32:56 +0800
0f062a5c0 NOMMU: Don't need to clear vm_mm when deleting a VMA ... Browse Code »

commit b94cfaf6685d691dc3fab023cf32f65e9b7be09c upstream.

Don't clear vm_mm in a deleted VMA as it's unnecessary and might
conceivably break the filesystem or driver VMA close routine.

Reported-by: Al Viro
Signed-off-by: David Howells
Acked-by: Al Viro
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

David Howells
2012-03-13 01:32:56 +0800
509c46f3d mm: memcg: Correct unregistring of events attached to the same eventfd ... Browse Code »

commit 371528caec553785c37f73fa3926ea0de84f986f upstream.

There is an issue when memcg unregisters events that were attached to
the same eventfd:

- On the first call mem_cgroup_usage_unregister_event() removes all
events attached to a given eventfd, and if there were no events left,
thresholds->primary would become NULL;

- Since there were several events registered, cgroups core will call
mem_cgroup_usage_unregister_event() again, but now kernel will oops,
as the function doesn't expect that threshold->primary may be NULL.

That's a good question whether mem_cgroup_usage_unregister_event()
should actually remove all events in one go, but nowadays it can't
do any better as cftype->unregister_event callback doesn't pass
any private event-associated cookie. So, let's fix the issue by
simply checking for threshold->primary.

FWIW, w/o the patch the following oops may be observed:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
IP: [] mem_cgroup_usage_unregister_event+0x9c/0x1f0
Pid: 574, comm: kworker/0:2 Not tainted 3.3.0-rc4+ #9 Bochs Bochs
RIP: 0010:[] [] mem_cgroup_usage_unregister_event+0x9c/0x1f0
RSP: 0018:ffff88001d0b9d60 EFLAGS: 00010246
Process kworker/0:2 (pid: 574, threadinfo ffff88001d0b8000, task ffff88001de91cc0)
Call Trace:
[] cgroup_event_remove+0x2b/0x60
[] process_one_work+0x174/0x450
[] worker_thread+0x123/0x2d0

Signed-off-by: Anton Vorontsov
Acked-by: KAMEZAWA Hiroyuki
Cc: Kirill A. Shutemov
Cc: Michal Hocko
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Anton Vorontsov
2012-03-13 01:32:55 +0800

01 Mar, 2012

1 commit

251ebc7bb NOMMU: Lock i_mmap_mutex for access to the VMA prio list ... Browse Code »

commit 918e556ec214ed2f584e4cac56d7b29e4bb6bf27 upstream.

Lock i_mmap_mutex for access to the VMA prio list to prevent concurrent
access. Currently, certain parts of the mmap handling are protected by
the region mutex, but not all.

Reported-by: Al Viro
Signed-off-by: David Howells
Acked-by: Al Viro
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

David Howells
2012-03-01 08:33:36 +0800

21 Feb, 2012

1 commit

16c7560fc slub: fix a possible memleak in __slab_alloc() ... Browse Code »

commit 73736e0387ba0e6d2b703407b4d26168d31516a7 upstream.

Zhihua Che reported a possible memleak in slub allocator on
CONFIG_PREEMPT=y builds.

It is possible current thread migrates right before disabling irqs in
__slab_alloc(). We must check again c->freelist, and perform a normal
allocation instead of scratching c->freelist.

Many thanks to Zhihua Che for spotting this bug, introduced in 2.6.39

V2: Its also possible an IRQ freed one (or several) object(s) and
populated c->freelist, so its not a CONFIG_PREEMPT only problem.

Reported-by: Zhihua Che
Signed-off-by: Eric Dumazet
Acked-by: Christoph Lameter
Signed-off-by: Pekka Enberg
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
2012-02-21 04:48:14 +0800

14 Feb, 2012

5 commits

91b08ca08 mm: fix UP THP spin_is_locked BUGs ... Browse Code »

commit b9980cdcf2524c5fe15d8cbae9c97b3ed6385563 upstream.

Fix CONFIG_TRANSPARENT_HUGEPAGE=y CONFIG_SMP=n CONFIG_DEBUG_VM=y
CONFIG_DEBUG_SPINLOCK=n kernel: spin_is_locked() is then always false,
and so triggers some BUGs in Transparent HugePage codepaths.

asm-generic/bug.h mentions this problem, and provides a WARN_ON_SMP(x);
but being too lazy to add VM_BUG_ON_SMP, BUG_ON_SMP, WARN_ON_SMP_ONCE,
VM_WARN_ON_SMP_ONCE, just test NR_CPUS != 1 in the existing VM_BUG_ONs.

Signed-off-by: Hugh Dickins
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Hugh Dickins
2012-02-14 03:06:11 +0800
b91348123 mm: compaction: check for overlapping nodes during isolation for migration ... Browse Code »

commit dc9086004b3d5db75997a645b3fe08d9138b7ad0 upstream.

When isolating pages for migration, migration starts at the start of a
zone while the free scanner starts at the end of the zone. Migration
avoids entering a new zone by never going beyond the free scanned.

Unfortunately, in very rare cases nodes can overlap. When this happens,
migration isolates pages without the LRU lock held, corrupting lists
which will trigger errors in reclaim or during page free such as in the
following oops

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [] free_pcppages_bulk+0xcc/0x450
PGD 1dda554067 PUD 1e1cb58067 PMD 0
Oops: 0000 [#1] SMP
CPU 37
Pid: 17088, comm: memcg_process_s Tainted: G X
RIP: free_pcppages_bulk+0xcc/0x450
Process memcg_process_s (pid: 17088, threadinfo ffff881c2926e000, task ffff881c2926c0c0)
Call Trace:
free_hot_cold_page+0x17e/0x1f0
__pagevec_free+0x90/0xb0
release_pages+0x22a/0x260
pagevec_lru_move_fn+0xf3/0x110
putback_lru_page+0x66/0xe0
unmap_and_move+0x156/0x180
migrate_pages+0x9e/0x1b0
compact_zone+0x1f3/0x2f0
compact_zone_order+0xa2/0xe0
try_to_compact_pages+0xdf/0x110
__alloc_pages_direct_compact+0xee/0x1c0
__alloc_pages_slowpath+0x370/0x830
__alloc_pages_nodemask+0x1b1/0x1c0
alloc_pages_vma+0x9b/0x160
do_huge_pmd_anonymous_page+0x160/0x270
do_page_fault+0x207/0x4c0
page_fault+0x25/0x30

The "X" in the taint flag means that external modules were loaded but but
is unrelated to the bug triggering. The real problem was because the PFN
layout looks like this

Zone PFN ranges:
DMA 0x00000010 -> 0x00001000
DMA32 0x00001000 -> 0x00100000
Normal 0x00100000 -> 0x01e80000
Movable zone start PFN for each node
early_node_map[14] active PFN ranges
0: 0x00000010 -> 0x0000009b
0: 0x00000100 -> 0x0007a1ec
0: 0x0007a354 -> 0x0007a379
0: 0x0007f7ff -> 0x0007f800
0: 0x00100000 -> 0x00680000
1: 0x00680000 -> 0x00e80000
0: 0x00e80000 -> 0x01080000
1: 0x01080000 -> 0x01280000
0: 0x01280000 -> 0x01480000
1: 0x01480000 -> 0x01680000
0: 0x01680000 -> 0x01880000
1: 0x01880000 -> 0x01a80000
0: 0x01a80000 -> 0x01c80000
1: 0x01c80000 -> 0x01e80000

The fix is straight-forward. isolate_migratepages() has to make a
similar check to isolate_freepage to ensure that it never isolates pages
from a zone it does not hold the LRU lock for.

This was discovered in a 3.0-based kernel but it affects 3.1.x, 3.2.x
and current mainline.

Signed-off-by: Mel Gorman
Acked-by: Michal Nazarewicz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Mel Gorman
2012-02-14 03:06:11 +0800
77650df5a mm: compaction: check pfn_valid when entering a new MAX_ORDER_NR_PAGES block dur… ... Browse Code »

…ing isolation for migration

commit 0bf380bc70ecba68cb4d74dc656cc2fa8c4d801a upstream.

When isolating for migration, migration starts at the start of a zone
which is not necessarily pageblock aligned. Further, it stops isolating
when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally
not aligned. This allows isolate_migratepages() to call pfn_to_page() on
an invalid PFN which can result in a crash. This was originally reported
against a 3.0-based kernel with the following trace in a crash dump.

PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s"
#0 [d72d3ad0] crash_kexec at c028cfdb
#1 [d72d3b24] oops_end at c05c5322
#2 [d72d3b38] __bad_area_nosemaphore at c0227e60
#3 [d72d3bec] bad_area at c0227fb6
#4 [d72d3c00] do_page_fault at c05c72ec
#5 [d72d3c80] error_code (via page_fault) at c05c47a4
EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000
DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50
CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002
#6 [d72d3cb4] isolate_migratepages at c030b15a
#7 [d72d3d14] zone_watermark_ok at c02d26cb
#8 [d72d3d2c] compact_zone at c030b8de
#9 [d72d3d68] compact_zone_order at c030bba1
#10 [d72d3db4] try_to_compact_pages at c030bc84
#11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7
#12 [d72d3e08] __alloc_pages_slowpath at c02d66c7
#13 [d72d3e78] __alloc_pages_nodemask at c02d6a97
#14 [d72d3eb8] alloc_pages_vma at c030a845
#15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb
#16 [d72d3f00] handle_mm_fault at c02f36c6
#17 [d72d3f30] do_page_fault at c05c70ed
#18 [d72d3fb0] error_code (via page_fault) at c05c47a4
EAX: b71ff000 EBX: 00000001 ECX: 00001600 EDX: 00000431
DS: 007b ESI: 08048950 ES: 007b EDI: bfaa3788
SS: 007b ESP: bfaa36e0 EBP: bfaa3828 GS: 6f50
CS: 0073 EIP: 080487c8 ERR: ffffffff EFLAGS: 00010202

It was also reported by Herbert van den Bergh against 3.1-based kernel
with the following snippet from the console log.

BUG: unable to handle kernel paging request at 01c00008
IP: [<c0522399>] isolate_migratepages+0x119/0x390
*pdpt = 000000002f7ce001 *pde = 0000000000000000

It is expected that it also affects 3.2.x and current mainline.

The problem is that pfn_valid is only called on the first PFN being
checked and that PFN is not necessarily aligned. Lets say we have a case
like this

H = MAX_ORDER_NR_PAGES boundary
| = pageblock boundary
m = cc->migrate_pfn
f = cc->free_pfn
o = memory hole

H------|------H------|----m-Hoooooo|ooooooH-f----|------H

The migrate_pfn is just below a memory hole and the free scanner is beyond
the hole. When isolate_migratepages started, it scans from migrate_pfn to
migrate_pfn+pageblock_nr_pages which is now in a memory hole. It checks
pfn_valid() on the first PFN but then scans into the hole where there are
not necessarily valid struct pages.

This patch ensures that isolate_migratepages calls pfn_valid when
necessary.

Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com>
Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Mel Gorman
2012-02-14 03:06:07 +0800
c713decb1 mm/filemap_xip.c: fix race condition in xip_file_fault() ... Browse Code »

commit 99f02ef1f18631eb0a4e0ea0a3d56878dbcb4b90 upstream.

Fix a race condition that shows in conjunction with xip_file_fault() when
two threads of the same user process fault on the same memory page.

In this case, the race winner will install the page table entry and the
unlucky loser will cause an oops: xip_file_fault calls vm_insert_pfn (via
vm_insert_mixed) which drops out at this check:

retval = -EBUSY;
if (!pte_none(*pte))
goto out_unlock;

The resulting -EBUSY return value will trigger a BUG_ON() in
xip_file_fault.

This fix simply considers the fault as fixed in this case, because the
race winner has successfully installed the pte.

[akpm@linux-foundation.org: use conventional (and consistent) comment layout]
Reported-by: David Sadler
Signed-off-by: Carsten Otte
Reported-by: Louis Alex Eisner
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Carsten Otte
2012-02-14 03:06:07 +0800
99067d691 readahead: fix pipeline break caused by block plug ... Browse Code »

commit 3deaa7190a8da38453c4fabd9dec7f66d17fff67 upstream.

Herbert Poetzl reported a performance regression since 2.6.39. The test
is a simple dd read, but with big block size. The reason is:

T1: ra (A, A+128k), (A+128k, A+256k)
T2: lock_page for page A, submit the 256k
T3: hit page A+128K, ra (A+256k, A+384). the range isn't submitted
because of plug and there isn't any lock_page till we hit page A+256k
because all pages from A to A+256k is in memory
T4: hit page A+256k, ra (A+384, A+ 512). Because of plug, the range isn't
submitted again.
T5: lock_page A+256k, so (A+256k, A+512k) will be submitted. The task is
waitting for (A+256k, A+512k) finish.

There is no request to disk in T3 and T4, so readahead pipeline breaks.

We really don't need block plug for generic_file_aio_read() for buffered
I/O. The readahead already has plug and has fine grained control when I/O
should be submitted. Deleting plug for buffered I/O fixes the regression.

One side effect is plug makes the request size 256k, the size is 128k
without it. This is because default ra size is 128k and not a reason we
need plug here.

Vivek said:

: We submit some readahead IO to device request queue but because of nested
: plug, queue never gets unplugged. When read logic reaches a page which is
: not in page cache, it waits for page to be read from the disk
: (lock_page_killable()) and that time we flush the plug list.
:
: So effectively read ahead logic is kind of broken in parts because of
: nested plugging. Removing top level plug (generic_file_aio_read()) for
: buffered reads, will allow unplugging queue earlier for readahead.

Signed-off-by: Shaohua Li
Signed-off-by: Wu Fengguang
Reported-by: Herbert Poetzl
Tested-by: Eric Dumazet
Cc: Christoph Hellwig
Cc: Jens Axboe
Cc: Vivek Goyal
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Shaohua Li
2012-02-14 03:06:04 +0800

26 Jan, 2012

2 commits

bcbef18db mm: fix NULL ptr dereference in __count_immobile_pages ... Browse Code »

commit 687875fb7de4a95223af20ee024282fa9099f860 upstream.

Fix the following NULL ptr dereference caused by

cat /sys/devices/system/memory/memory0/removable

Pid: 13979, comm: sed Not tainted 3.0.13-0.5-default #1 IBM BladeCenter LS21 -[7971PAM]-/Server Blade
RIP: __count_immobile_pages+0x4/0x100
Process sed (pid: 13979, threadinfo ffff880221c36000, task ffff88022e788480)
Call Trace:
is_pageblock_removable_nolock+0x34/0x40
is_mem_section_removable+0x74/0xf0
show_mem_removable+0x41/0x70
sysfs_read_file+0xfe/0x1c0
vfs_read+0xc7/0x130
sys_read+0x53/0xa0
system_call_fastpath+0x16/0x1b

We are crashing because we are trying to dereference NULL zone which
came from pfn=0 (struct page ffffea0000000000). According to the boot
log this page is marked reserved:
e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)

and early_node_map confirms that:
early_node_map[3] active PFN ranges
1: 0x00000010 -> 0x0000009c
1: 0x00000100 -> 0x000bffa3
1: 0x00100000 -> 0x00240000

The problem is that memory_present works in PAGE_SECTION_MASK aligned
blocks so the reserved range sneaks into the the section as well. This
also means that free_area_init_node will not take care of those reserved
pages and they stay uninitialized.

When we try to read the removable status we walk through all available
sections and hope that the zone is valid for all pages in the section.
But this is not true in this case as the zone and nid are not initialized.

We have only one node in this particular case and it is marked as node=1
(rather than 0) and that made the problem visible because page_to_nid will
return 0 and there are no zones on the node.

Let's check that the zone is valid and that the given pfn falls into its
boundaries and mark the section not removable. This might cause some
false positives, probably, but we do not have any sane way to find out
whether the page is reserved by the platform or it is just not used for
whatever other reasons.

Signed-off-by: Michal Hocko
Acked-by: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: Andrea Arcangeli
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Michal Hocko
2012-01-26 09:25:05 +0800
ea1c62778 memcg: add mem_cgroup_replace_page_cache() to fix LRU issue ... Browse Code »

commit ab936cbcd02072a34b60d268f94440fd5cf1970b upstream.

Commit ef6a3c6311 ("mm: add replace_page_cache_page() function") added a
function replace_page_cache_page(). This function replaces a page in the
radix-tree with a new page. WHen doing this, memory cgroup needs to fix
up the accounting information. memcg need to check PCG_USED bit etc.

In some(many?) cases, 'newpage' is on LRU before calling
replace_page_cache(). So, memcg's LRU accounting information should be
fixed, too.

This patch adds mem_cgroup_replace_page_cache() and removes the old hooks.
In that function, old pages will be unaccounted without touching
res_counter and new page will be accounted to the memcg (of old page).
WHen overwriting pc->mem_cgroup of newpage, take zone->lru_lock and avoid
races with LRU handling.

Background:
replace_page_cache_page() is called by FUSE code in its splice() handling.
Here, 'newpage' is replacing oldpage but this newpage is not a newly allocated
page and may be on LRU. LRU mis-accounting will be critical for memory cgroup
because rmdir() checks the whole LRU is empty and there is no account leak.
If a page is on the other LRU than it should be, rmdir() will fail.

This bug was added in March 2011, but no bug report yet. I guess there
are not many people who use memcg and FUSE at the same time with upstream
kernels.

The result of this bug is that admin cannot destroy a memcg because of
account leak. So, no panic, no deadlock. And, even if an active cgroup
exist, umount can succseed. So no problem at shutdown.

Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Miklos Szeredi
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

KAMEZAWA Hiroyuki
2012-01-26 09:24:43 +0800

07 Jan, 2012

5 commits

7204bf5ef mm: hugetlb: fix non-atomic enqueue of huge page ... Browse Code »

commit b0365c8d0cb6e79eb5f21418ae61ab511f31b575 upstream.

If a huge page is enqueued under the protection of hugetlb_lock, then the
operation is atomic and safe.

Signed-off-by: Hillf Danton
Reviewed-by: Michal Hocko
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Hillf Danton
2012-01-07 06:14:00 +0800
4875f39a0 memcg: keep root group unchanged if creation fails ... Browse Code »

commit a41c58a6665cc995e237303b05db42100b71b65e upstream.

If the request is to create non-root group and we fail to meet it, we
should leave the root unchanged.

Signed-off-by: Hillf Danton
Signed-off-by: Hugh Dickins
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Michal Hocko
Cc: Balbir Singh
Cc: David Rientjes
Cc: Andrea Arcangeli
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Hillf Danton
2012-01-07 06:13:55 +0800
2f79eddd4 vfs: __read_cache_page should use gfp argument rather than GFP_KERNEL ... Browse Code »

commit e6f67b8c05f5e129e126f4409ddac6f25f58ffcb upstream.

lockdep reports a deadlock in jfs because a special inode's rw semaphore
is taken recursively. The mapping's gfp mask is GFP_NOFS, but is not
used when __read_cache_page() calls add_to_page_cache_lru().

Signed-off-by: Dave Kleikamp
Acked-by: Hugh Dickins
Acked-by: Al Viro
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Dave Kleikamp
2012-01-07 06:13:54 +0800
724c6ae29 oom: fix integer overflow of points in oom_badness ... Browse Code »

commit ff05b6f7ae762b6eb464183eec994b28ea09f6dd upstream.

An integer overflow will happen on 64bit archs if task's sum of rss,
swapents and nr_ptes exceeds (2^31)/1000 value. This was introduced by
commit

f755a04 oom: use pte pages in OOM score

where the oom score computation was divided into several steps and it's no
longer computed as one expression in unsigned long(rss, swapents, nr_pte
are unsigned long), where the result value assigned to points(int) is in
range(1..1000). So there could be an int overflow while computing

176 points *= 1000;

and points may have negative value. Meaning the oom score for a mem hog task
will be one.

196 if (points
Acked-by: KOSAKI Motohiro
Acked-by: Oleg Nesterov
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Frantisek Hrbata
2012-01-07 06:13:51 +0800
d051424f2 percpu: fix per_cpu_ptr_to_phys() handling of non-page-aligned addresses ... Browse Code »

commit 9f57bd4d6dc69a4e3bf43044fa00fcd24dd363e3 upstream.

per_cpu_ptr_to_phys() incorrectly rounds up its result for non-kmalloc
case to the page boundary, which is bogus for any non-page-aligned
address.

This affects the only in-tree user of this function - sysfs handler
for per-cpu 'crash_notes' physical address. The trouble is that the
crash_notes per-cpu variable is not page-aligned:

crash_notes = 0xc08e8ed4
PER-CPU OFFSET VALUES:
CPU 0: 3711f000
CPU 1: 37129000
CPU 2: 37133000
CPU 3: 3713d000

So, the per-cpu addresses are:
crash_notes on CPU 0: f7a07ed4 => phys 36b57ed4
crash_notes on CPU 1: f7a11ed4 => phys 36b4ded4
crash_notes on CPU 2: f7a1bed4 => phys 36b43ed4
crash_notes on CPU 3: f7a25ed4 => phys 36b39ed4

However, /sys/devices/system/cpu/cpu*/crash_notes says:
/sys/devices/system/cpu/cpu0/crash_notes: 36b57000
/sys/devices/system/cpu/cpu1/crash_notes: 36b4d000
/sys/devices/system/cpu/cpu2/crash_notes: 36b43000
/sys/devices/system/cpu/cpu3/crash_notes: 36b39000

As you can see, all values are rounded down to a page
boundary. Consequently, this is where kexec sets up the NOTE segments,
and thus where the secondary kernel is looking for them. However, when
the first kernel crashes, it saves the notes to the unaligned
addresses, where they are not found.

Fix it by adding offset_in_page() to the translated page address.

-tj: Combined Eugene's and Petr's commit messages.

Signed-off-by: Eugene Surovegin
Signed-off-by: Tejun Heo
Reported-by: Petr Tesarik
Signed-off-by: Greg Kroah-Hartman

Eugene Surovegin
2012-01-07 06:13:50 +0800

22 Dec, 2011

3 commits

6400c8a38 percpu: fix chunk range calculation ... Browse Code »

commit a855b84c3d8c73220d4d3cd392a7bee7c83de70e upstream.

Percpu allocator recorded the cpus which map to the first and last
units in pcpu_first/last_unit_cpu respectively and used them to
determine the address range of a chunk - e.g. it assumed that the
first unit has the lowest address in a chunk while the last unit has
the highest address.

This simply isn't true. Groups in a chunk can have arbitrary positive
or negative offsets from the previous one and there is no guarantee
that the first unit occupies the lowest offset while the last one the
highest.

Fix it by actually comparing unit offsets to determine cpus occupying
the lowest and highest offsets. Also, rename pcu_first/last_unit_cpu
to pcpu_low/high_unit_cpu to avoid confusion.

The chunk address range is used to flush cache on vmalloc area
map/unmap and decide whether a given address is in the first chunk by
per_cpu_ptr_to_phys() and the bug was discovered by invalid
per_cpu_ptr_to_phys() translation for crash_note.

Kudos to Dave Young for tracking down the problem.

Signed-off-by: Tejun Heo
Reported-by: WANG Cong
Reported-by: Dave Young
Tested-by: Dave Young
LKML-Reference:
Signed-off-by: Thomas Renninger
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2011-12-22 04:57:37 +0800
3a15d7377 mm: vmalloc: check for page allocation failure before vmlist insertion ... Browse Code »

commit 1368edf0647ac112d8cfa6ce47257dc950c50f5c upstream.

Commit f5252e00 ("mm: avoid null pointer access in vm_struct via
/proc/vmallocinfo") adds newly allocated vm_structs to the vmlist after
it is fully initialised. Unfortunately, it did not check that
__vmalloc_area_node() successfully populated the area. In the event of
allocation failure, the vmalloc area is freed but the pointer to freed
memory is inserted into the vmlist leading to a a crash later in
get_vmalloc_info().

This patch adds a check for ____vmalloc_area_node() failure within
__vmalloc_node_range. It does not use "goto fail" as in the previous
error path as a warning was already displayed by __vmalloc_area_node()
before it called vfree in its failure path.

Credit goes to Luciano Chavez for doing all the real work of identifying
exactly where the problem was.

Signed-off-by: Mel Gorman
Reported-by: Luciano Chavez
Tested-by: Luciano Chavez
Reviewed-by: Rik van Riel
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Mel Gorman
2011-12-22 04:57:36 +0800
21830d75b mm: Ensure that pfn_valid() is called once per pageblock when reserving pageblocks ... Browse Code »

commit d021563888312018ca65681096f62e36c20e63cc upstream.

setup_zone_migrate_reserve() expects that zone->start_pfn starts at
pageblock_nr_pages aligned pfn otherwise we could access beyond an
existing memblock resulting in the following panic if
CONFIG_HOLES_IN_ZONE is not configured and we do not check pfn_valid:

IP: [] setup_zone_migrate_reserve+0xcd/0x180
*pdpt = 0000000000000000 *pde = f000ff53f000ff53
Oops: 0000 [#1] SMP
Pid: 1, comm: swapper Not tainted 3.0.7-0.7-pae #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
EIP: 0060:[] EFLAGS: 00010006 CPU: 0
EIP is at setup_zone_migrate_reserve+0xcd/0x180
EAX: 000c0000 EBX: f5801fc0 ECX: 000c0000 EDX: 00000000
ESI: 000c01fe EDI: 000c01fe EBP: 00140000 ESP: f2475f58
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process swapper (pid: 1, ti=f2474000 task=f2472cd0 task.ti=f2474000)
Call Trace:
[] __setup_per_zone_wmarks+0xec/0x160
[] setup_per_zone_wmarks+0xf/0x20
[] init_per_zone_wmark_min+0x27/0x86
[] do_one_initcall+0x2b/0x160
[] kernel_init+0xbe/0x157
[] kernel_thread_helper+0x6/0xd
Code: a5 39 f5 89 f7 0f 46 fd 39 cf 76 40 8b 03 f6 c4 08 74 32 eb 91 90 89 c8 c1 e8 0e 0f be 80 80 2f 86 c0 8b 14 85 60 2f 86 c0 89 c8 82 b4 12 00 00 c1 e0 05 03 82 ac 12 00 00 8b 00 f6 c4 08 0f
EIP: [] setup_zone_migrate_reserve+0xcd/0x180 SS:ESP 0068:f2475f58
CR2: 00000000000012b4

We crashed in pageblock_is_reserved() when accessing pfn 0xc0000 because
highstart_pfn = 0x36ffe.

The issue was introduced in 3.0-rc1 by 6d3163ce ("mm: check if any page
in a pageblock is reserved before marking it MIGRATE_RESERVE").

Make sure that start_pfn is always aligned to pageblock_nr_pages to
ensure that pfn_valid s always called at the start of each pageblock.
Architectures with holes in pageblocks will be correctly handled by
pfn_valid_within in pageblock_is_reserved.

Signed-off-by: Michal Hocko
Signed-off-by: Mel Gorman
Tested-by: Dang Bo
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Andrea Arcangeli
Cc: David Rientjes
Cc: Arve Hjnnevg
Cc: KOSAKI Motohiro
Cc: John Stultz
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Michal Hocko
2011-12-22 04:57:36 +0800