Doug / smarc-fsl-linux-kernel | Embedian Git Server

21 May, 2011

1 commit

268bb0ce3 sanitize <linux/prefetch.h> usage ... Browse Code »

Commit e66eed651fd1 ("list: remove prefetching from regular list
iterators") removed the include of prefetch.h from list.h, which
uncovered several cases that had apparently relied on that rather
obscure header file dependency.

So this fixes things up a bit, using

grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')

to guide us in finding files that either need
inclusion, or have it despite not needing it.

There are more of them around (mostly network drivers), but this gets
many core ones.

Reported-by: Stephen Rothwell
Signed-off-by: Linus Torvalds

Linus Torvalds
2011-05-21 03:50:29 +0800

20 May, 2011

2 commits

83d7e9487 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-2.6-cm ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-2.6-cm:
kmemleak: Initialise kmemleak after debug_objects_mem_init()
kmemleak: Select DEBUG_FS unconditionally in DEBUG_KMEMLEAK
kmemleak: Do not return a pointer to an object that kmemleak did not get

Linus Torvalds
2011-05-20 07:44:13 +0800
52c3ce4ec kmemleak: Do not return a pointer to an object that kmemleak did not get ... Browse Code »

The kmemleak_seq_next() function tries to get an object (and increment
its use count) before returning it. If it could not get the last object
during list traversal (because it may have been freed), the function
should return NULL rather than a pointer to such object that it did not
get.

Signed-off-by: Catalin Marinas
Reported-by: Phil Carmody
Acked-by: Phil Carmody
Cc:

Catalin Marinas
2011-05-20 00:35:28 +0800

18 May, 2011

1 commit

d6c438b6c memcg: fix zone congestion ... Browse Code »

ZONE_CONGESTED should be a state of global memory reclaim. If not, a busy
memcg sets this and give unnecessary throttoling in wait_iff_congested()
against memory recalim in other contexts. This makes system performance
bad.

I'll think about "memcg is congested!" flag is required or not, later.
But this fix is required first.

Signed-off-by: KAMEZAWA Hiroyuki
Reviewed-by: Minchan Kim
Cc: Daisuke Nishimura
Acked-by: Ying Han
Cc: Balbir Singh
Cc: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-05-18 17:55:23 +0800

17 May, 2011

1 commit

b5e6ab589 mm: fix kernel-doc warning in page_alloc.c ... Browse Code »

Fix new kernel-doc warning in mm/page_alloc.c:

Warning(mm/page_alloc.c:2370): No description found for parameter 'nid'

Signed-off-by: Randy Dunlap
Signed-off-by: Linus Torvalds

Randy Dunlap
2011-05-17 09:34:30 +0800

15 May, 2011

1 commit

05bf86b4c tmpfs: fix race between swapoff and writepage ... Browse Code »

Shame on me! Commit b1dea800ac39 "tmpfs: fix race between umount and
writepage" fixed the advertized race, but introduced another: as even
its comment makes clear, we cannot safely rely on a peek at list_empty()
while holding no lock - until info->swapped is set, shmem_unuse_inode()
may delete any formerly-swapped inode from the shmem_swaplist, which
in this case would leave a swap area impossible to swapoff.

Although I don't relish taking the mutex every time, I don't care much
for the alternatives either; and at least the peek at list_empty() in
shmem_evict_inode() (a hotter path since most inodes would never have
been swapped) remains safe, because we already truncated the whole file.

Signed-off-by: Hugh Dickins
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-05-15 03:18:55 +0800

12 May, 2011

7 commits

59a16ead5 tmpfs: fix spurious ENOSPC when racing with unswap ... Browse Code »

Testing the shmem_swaplist replacements for igrab() revealed another bug:
writes to /dev/loop0 on a tmpfs file which fills its filesystem were
sometimes failing with "Buffer I/O error"s.

These came from ENOSPC failures of shmem_getpage(), when racing with
swapoff: the same could happen when racing with another shmem_getpage(),
pulling the page in from swap in between our find_lock_page() and our
taking the info->lock (though not in the single-threaded loop case).

This is unacceptable, and surprising that I've not noticed it before:
it dates back many years, but (presumably) was made a lot easier to
reproduce in 2.6.36, which sited a page preallocation in the race window.

Fix it by rechecking the page cache before settling on an ENOSPC error.

Signed-off-by: Hugh Dickins
Cc: Konstantin Khlebnikov
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-05-12 09:50:45 +0800
778dd893a tmpfs: fix race between umount and swapoff ... Browse Code »

The use of igrab() in swapoff's shmem_unuse_inode() is just as vulnerable
to umount as that in shmem_writepage().

Fix this instance by extending the protection of shmem_swaplist_mutex
right across shmem_unuse_inode(): while it's on the list, the inode cannot
be evicted (and the filesystem cannot be unmounted) without
shmem_evict_inode() taking that mutex to remove it from the list.

But since shmem_writepage() might take that mutex, we should avoid making
memory allocations or memcg charges while holding it: prepare them at the
outer level in shmem_unuse(). When mem_cgroup_cache_charge() was
originally placed, we didn't know until that point that the page from swap
was actually a shmem page; but nowadays it's noted in the swap_map, so
we're safe to charge upfront. For the radix_tree, do as is done in
shmem_getpage(): preload upfront, but don't pin to the cpu; so we make a
habit of refreshing the node pool, but might dip into GFP_NOWAIT reserves
on occasion if subsequently preempted.

With the allocation and charge moved out from shmem_unuse_inode(),
we can also hold index map and info->lock over from finding the entry.

Signed-off-by: Hugh Dickins
Cc: Konstantin Khlebnikov
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-05-12 09:50:45 +0800
b1dea800a tmpfs: fix race between umount and writepage ... Browse Code »

Konstanin Khlebnikov reports that a dangerous race between umount and
shmem_writepage can be reproduced by this script:

for i in {1..300} ; do
mkdir $i
while true ; do
mount -t tmpfs none $i
dd if=/dev/zero of=$i/test bs=1M count=$(($RANDOM % 100))
umount $i
done &
done

on a 6xCPU node with 8Gb RAM: kernel very unstable after this accident. =)

Kernel log:

VFS: Busy inodes after unmount of tmpfs.
Self-destruct in 5 seconds. Have a nice day...

WARNING: at lib/list_debug.c:53 __list_del_entry+0x8d/0x98()
list_del corruption. prev->next should be ffff880222fdaac8, but was (null)
Pid: 11222, comm: mount.tmpfs Not tainted 2.6.39-rc2+ #4
Call Trace:
warn_slowpath_common+0x80/0x98
warn_slowpath_fmt+0x41/0x43
__list_del_entry+0x8d/0x98
evict+0x50/0x113
iput+0x138/0x141
...
BUG: unable to handle kernel paging request at ffffffffffffffff
IP: shmem_free_blocks+0x18/0x4c
Pid: 10422, comm: dd Tainted: G W 2.6.39-rc2+ #4
Call Trace:
shmem_recalc_inode+0x61/0x66
shmem_writepage+0xba/0x1dc
pageout+0x13c/0x24c
shrink_page_list+0x28e/0x4be
shrink_inactive_list+0x21f/0x382
...

shmem_writepage() calls igrab() on the inode for the page which came from
page reclaim, to add it later into shmem_swaplist for swapoff operation.

This igrab() can race with super-block deactivating process:

shrink_inactive_list() deactivate_super()
pageout() tmpfs_fs_type->kill_sb()
shmem_writepage() kill_litter_super()
generic_shutdown_super()
evict_inodes()
igrab()
atomic_read(&inode->i_count)
skip-inode
iput()
if (!list_empty(&sb->s_inodes))
printk("VFS: Busy inodes after...

This igrap-iput pair was added in commit 1b1b32f2c6f6 "tmpfs: fix
shmem_swaplist races" based on incorrect assumptions: igrab() protects the
inode from concurrent eviction by deletion, but it does nothing to protect
it from concurrent unmounting, which goes ahead despite the raised
i_count.

So this use of igrab() was wrong all along, but the race made much worse
in 2.6.37 when commit 63997e98a3be "split invalidate_inodes()" replaced
two attempts at invalidate_inodes() by a single evict_inodes().

Konstantin posted a plausible patch, raising sb->s_active too: I'm unsure
whether it was correct or not; but burnt once by igrab(), I am sure that
we don't want to rely more deeply upon externals here.

Fix it by adding the inode to shmem_swaplist earlier, while the page lock
on page in page cache still secures the inode against eviction, without
artifically raising i_count. It was originally added later because
shmem_unuse_inode() is liable to remove an inode from the list while it's
unswapped; but we can guard against that by taking spinlock before
dropping mutex.

Reported-by: Konstantin Khlebnikov
Signed-off-by: Hugh Dickins
Tested-by: Konstantin Khlebnikov
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-05-12 09:50:45 +0800
21a3c9646 memcg: allocate memory cgroup structures in local nodes ... Browse Code »

Commit dde79e005a769 ("page_cgroup: reduce allocation overhead for
page_cgroup array for CONFIG_SPARSEMEM") added a regression that the
memory cgroup data structures all end up in node 0 because the first
attempt at allocating them would not pass in a node hint. Since the
initialization runs on CPU #0 it would all end up node 0. This is a
problem on large memory systems, where node 0 would lose a lot of
memory.

Change the alloc_pages_exact() to alloc_pages_exact_nid(). This will
still fall back to other nodes if not enough memory is available.

[ RED-PEN: right now it would fall back first before trying
vmalloc_node. Probably not the best strategy ... But I left it like
that for now. ]

Signed-off-by: Andi Kleen
Reported-by: Doug Nelson
Cc: David Rientjes
Reviewed-by: Michal Hocko
Cc: Dave Hansen
Acked-by: Balbir Singh
Acked-by: Johannes Weiner
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2011-05-12 09:50:45 +0800
ee85c2e14 mm: add alloc_pages_exact_nid() ... Browse Code »

Add a alloc_pages_exact_nid() that allocates on a specific node.

The naming is quite broken, but fixing that would need a larger renaming
action.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Andi Kleen
Cc: Michal Hocko
Cc: Balbir Singh
Cc: KOSAKI Motohiro
Cc: Dave Hansen
Cc: David Rientjes
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2011-05-12 09:50:45 +0800
8f389a99b mm: use alloc_bootmem_node_nopanic() on really needed path ... Browse Code »

Stefan found nobootmem does not work on his system that has only 8M of
RAM. This causes an early panic:

BIOS-provided physical RAM map:
BIOS-88: 0000000000000000 - 000000000009f000 (usable)
BIOS-88: 0000000000100000 - 0000000000840000 (usable)
bootconsole [earlyser0] enabled
Notice: NX (Execute Disable) protection missing in CPU or disabled in BIOS!
DMI not present or invalid.
last_pfn = 0x840 max_arch_pfn = 0x100000
init_memory_mapping: 0000000000000000-0000000000840000
8MB LOWMEM available.
mapped low ram: 0 - 00840000
low ram: 0 - 00840000
Zone PFN ranges:
DMA 0x00000001 -> 0x00001000
Normal empty
Movable zone start PFN for each node
early_node_map[2] active PFN ranges
0: 0x00000001 -> 0x0000009f
0: 0x00000100 -> 0x00000840
BUG: Int 6: CR2 (null)
EDI c034663c ESI (null) EBP c0329f38 ESP c0329ef4
EBX c0346380 EDX 00000006 ECX ffffffff EAX fffffff4
err (null) EIP c0353191 CS c0320060 flg 00010082
Stack: (null) c030c533 000007cd (null) c030c533 00000001 (null) (null)
00000003 0000083f 00000018 00000002 00000002 c0329f6c c03534d6 (null)
(null) 00000100 00000840 (null) c0329f64 00000001 00001000 (null)
Pid: 0, comm: swapper Not tainted 2.6.36 #5
Call Trace:
[] ? 0xc02e3707
[] 0xc035e6e5
[] ? 0xc0353191
[] 0xc03534d6
[] 0xc034f1cd
[] 0xc034a824
[] ? 0xc03513cb
[] 0xc0349432
[] 0xc0349066

It turns out that we should ignore the low limit of 16M.

Use alloc_bootmem_node_nopanic() in this case.

[akpm@linux-foundation.org: less mess]
Signed-off-by: Yinghai LU
Reported-by: Stefan Hellermann
Tested-by: Stefan Hellermann
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Cc: [2.6.34+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yinghai Lu
2011-05-12 09:50:44 +0800
bad49d9c8 mm: check PageUnevictable in lru_deactivate_fn() ... Browse Code »

The lru_deactivate_fn should not move page which in on unevictable lru
into inactive list. Otherwise, we can meet BUG when we use
isolate_lru_pages as __isolate_lru_page could return -EINVAL.

Reported-by: Ying Han
Tested-by: Ying Han
Signed-off-by: Minchan Kim
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2011-05-12 09:50:44 +0800

10 May, 2011

2 commits

42c36f63a vm: fix vm_pgoff wrap in upward expansion ... Browse Code »

Commit a626ca6a6564 ("vm: fix vm_pgoff wrap in stack expansion") fixed
the case of an expanding mapping causing vm_pgoff wrapping when you had
downward stack expansion. But there was another case where IA64 and
PA-RISC expand mappings: upward expansion.

This fixes that case too.

Signed-off-by: Hugh Dickins
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-05-10 08:52:17 +0800
a09a79f66 Don't lock guardpage if the stack is growing up ... Browse Code »

Linux kernel excludes guard page when performing mlock on a VMA with
down-growing stack. However, some architectures have up-growing stack
and locking the guard page should be excluded in this case too.

This patch fixes lvm2 on PA-RISC (and possibly other architectures with
up-growing stack). lvm2 calculates number of used pages when locking and
when unlocking and reports an internal error if the numbers mismatch.

[ Patch changed fairly extensively to also fix /proc//maps for the
grows-up case, and to move things around a bit to clean it all up and
share the infrstructure with the /proc bits.

Tested on ia64 that has both grow-up and grow-down segments - Linus ]

Signed-off-by: Mikulas Patocka
Tested-by: Tony Luck
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Mikulas Patocka
2011-05-10 07:22:07 +0800

05 May, 2011

2 commits

a1fde08c7 VM: skip the stack guard page lookup in get_user_pages only for mlock ... Browse Code »

The logic in __get_user_pages() used to skip the stack guard page lookup
whenever the caller wasn't interested in seeing what the actual page
was. But Michel Lespinasse points out that there are cases where we
don't care about the physical page itself (so 'pages' may be NULL), but
do want to make sure a page is mapped into the virtual address space.

So using the existence of the "pages" array as an indication of whether
to look up the guard page or not isn't actually so great, and we really
should just use the FOLL_MLOCK bit. But because that bit was only set
for the VM_LOCKED case (and not all vma's necessarily have it, even for
mlock()), we couldn't do that originally.

Fix that by moving the VM_LOCKED check deeper into the call-chain, which
actually simplifies many things. Now mlock() gets simpler, and we can
also check for FOLL_MLOCK in __get_user_pages() and the code ends up
much more straightforward.

Reported-and-reviewed-by: Michel Lespinasse
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Linus Torvalds
2011-05-05 12:30:28 +0800
30106b8ce slub: Fix the lockless code on 32-bit platforms with no 64-bit cmpxchg ... Browse Code »

The SLUB allocator use of the cmpxchg_double logic was wrong: it
actually needs the irq-safe one.

That happens automatically when we use the native unlocked 'cmpxchg8b'
instruction, but when compiling the kernel for older x86 CPUs that do
not support that instruction, we fall back to the generic emulation
code.

And if you don't specify that you want the irq-safe version, the generic
code ends up just open-coding the cmpxchg8b equivalent without any
protection against interrupts or preemption. Which definitely doesn't
work for SLUB.

This was reported by Werner Landgraf , who saw
instability with his distro-kernel that was compiled to support pretty
much everything under the sun. Most big Linux distributions tend to
compile for PPro and later, and would never have noticed this problem.

This also fixes the prototypes for the irqsafe cmpxchg_double functions
to use 'bool' like they should.

[ Btw, that whole "generic code defaults to no protection" design just
sounds stupid - if the code needs no protection, there is no reason to
use "cmpxchg_double" to begin with. So we should probably just remove
the unprotected version entirely as pointless. - Linus ]

Signed-off-by: Thomas Gleixner
Reported-and-tested-by: werner
Acked-and-tested-by: Ingo Molnar
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: Jens Axboe
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1105041539050.3005@ionos
Signed-off-by: Ingo Molnar
Signed-off-by: Linus Torvalds

Thomas Gleixner
2011-05-05 05:20:20 +0800

29 Apr, 2011

3 commits

cc03638df mm: check if PTE is already allocated during page fault ... Browse Code »

With transparent hugepage support, handle_mm_fault() has to be careful
that a normal PMD has been established before handling a PTE fault. To
achieve this, it used __pte_alloc() directly instead of pte_alloc_map as
pte_alloc_map is unsafe to run against a huge PMD. pte_offset_map() is
called once it is known the PMD is safe.

pte_alloc_map() is smart enough to check if a PTE is already present
before calling __pte_alloc but this check was lost. As a consequence,
PTEs may be allocated unnecessarily and the page table lock taken. Thi
useless PTE does get cleaned up but it's a performance hit which is
visible in page_test from aim9.

This patch simply re-adds the check normally done by pte_alloc_map to
check if the PTE needs to be allocated before taking the page table lock.
The effect is noticable in page_test from aim9.

AIM9
2.6.38-vanilla 2.6.38-checkptenone
creat-clo 446.10 ( 0.00%) 424.47 (-5.10%)
page_test 38.10 ( 0.00%) 42.04 ( 9.37%)
brk_test 52.45 ( 0.00%) 51.57 (-1.71%)
exec_test 382.00 ( 0.00%) 456.90 (16.39%)
fork_test 60.11 ( 0.00%) 67.79 (11.34%)
MMTests Statistics: duration
Total Elapsed Time (seconds) 611.90 612.22

(While this affects 2.6.38, it is a performance rather than a functional
bug and normally outside the rules -stable. While the big performance
differences are to a microbench, the difference in fork and exec
performance may be significant enough that -stable wants to consider the
patch)

Reported-by: Raz Ben Yehuda
Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Reviewed-by: Andrea Arcangeli
Reviewed-by: Minchan Kim
Acked-by: Johannes Weiner
Cc: [2.6.38.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2011-04-29 02:28:21 +0800
f755a042d oom: use pte pages in OOM score ... Browse Code »

PTE pages eat up memory just like anything else, but we do not account for
them in any way in the OOM scores. They are also _guaranteed_ to get
freed up when a process is OOM killed, while RSS is not.

Reported-by: Dave Hansen
Signed-off-by: KOSAKI Motohiro
Cc: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Cc: Oleg Nesterov
Acked-by: David Rientjes
Cc: [2.6.36+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2011-04-29 02:28:21 +0800
78f11a255 mm: thp: fix /dev/zero MAP_PRIVATE and vm_flags cleanups ... Browse Code »

The huge_memory.c THP page fault was allowed to run if vm_ops was null
(which would succeed for /dev/zero MAP_PRIVATE, as the f_op->mmap wouldn't
setup a special vma->vm_ops and it would fallback to regular anonymous
memory) but other THP logics weren't fully activated for vmas with vm_file
not NULL (/dev/zero has a not NULL vma->vm_file).

So this removes the vm_file checks so that /dev/zero also can safely use
THP (the other albeit safer approach to fix this bug would have been to
prevent the THP initial page fault to run if vm_file was set).

After removing the vm_file checks, this also makes huge_memory.c stricter
in khugepaged for the DEBUG_VM=y case. It doesn't replace the vm_file
check with a is_pfn_mapping check (but it keeps checking for VM_PFNMAP
under VM_BUG_ON) because for a is_cow_mapping() mapping VM_PFNMAP should
only be allowed to exist before the first page fault, and in turn when
vma->anon_vma is null (so preventing khugepaged registration). So I tend
to think the previous comment saying if vm_file was set, VM_PFNMAP might
have been set and we could still be registered in khugepaged (despite
anon_vma was not NULL to be registered in khugepaged) was too paranoid.
The is_linear_pfn_mapping check is also I think superfluous (as described
by comment) but under DEBUG_VM it is safe to stay.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=33682

Signed-off-by: Andrea Arcangeli
Reported-by: Caspar Zhang
Acked-by: Mel Gorman
Acked-by: Rik van Riel
Cc: [2.6.38.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-04-29 02:28:20 +0800

15 Apr, 2011

10 commits

e27e6151b mm/thp: use conventional format for boolean attributes ... Browse Code »

The conventional format for boolean attributes in sysfs is numeric ("0" or
"1" followed by new-line). Any boolean attribute can then be read and
written using a generic function. Using the strings "yes [no]", "[yes]
no" (read), "yes" and "no" (write) will frustrate this.

[akpm@linux-foundation.org: use kstrtoul()]
[akpm@linux-foundation.org: test_bit() doesn't return 1/0, per Neil]
Signed-off-by: Ben Hutchings
Cc: Andrea Arcangeli
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: Rik van Riel
Cc: Hugh Dickins
Tested-by: David Rientjes
Cc: NeilBrown
Cc: [2.6.38.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ben Hutchings
2011-04-15 07:06:56 +0800
341aea2bc oom-kill: remove boost_dying_task_prio() ... Browse Code »

This is an almost-revert of commit 93b43fa ("oom: give the dying task a
higher priority").

That commit dramatically improved oom killer logic when a fork-bomb
occurs. But I've found that it has nasty corner case. Now cpu cgroup has
strange default RT runtime. It's 0! That said, if a process under cpu
cgroup promote RT scheduling class, the process never run at all.

If an admin inserts a !RT process into a cpu cgroup by setting
rtruntime=0, usually it runs perfectly because a !RT task isn't affected
by the rtruntime knob. But if it promotes an RT task via an explicit
setscheduler() syscall or an OOM, the task can't run at all. In short,
the oom killer doesn't work at all if admins are using cpu cgroup and don't
touch the rtruntime knob.

Eventually, kernel may hang up when oom kill occur. I and the original
author Luis agreed to disable this logic.

Signed-off-by: KOSAKI Motohiro
Acked-by: Luis Claudio R. Goncalves
Acked-by: KAMEZAWA Hiroyuki
Reviewed-by: Minchan Kim
Acked-by: David Rientjes
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2011-04-15 07:06:56 +0800
929bea7c7 vmscan: all_unreclaimable() use zone->all_unreclaimable as a name ... Browse Code »

all_unreclaimable check in direct reclaim has been introduced at 2.6.19
by following commit.

2006 Sep 25; commit 408d8544; oom: use unreclaimable info

And it went through strange history. firstly, following commit broke
the logic unintentionally.

2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
costly-order allocations

Two years later, I've found obvious meaningless code fragment and
restored original intention by following commit.

2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
return value when priority==0

But, the logic didn't works when 32bit highmem system goes hibernation
and Minchan slightly changed the algorithm and fixed it .

2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
in direct reclaim path

But, recently, Andrey Vagin found the new corner case. Look,

struct zone {
..
int all_unreclaimable;
..
unsigned long pages_scanned;
..
}

zone->all_unreclaimable and zone->pages_scanned are neigher atomic
variables nor protected by lock. Therefore zones can become a state of
zone->page_scanned=0 and zone->all_unreclaimable=1. In this case, current
all_unreclaimable() return false even though zone->all_unreclaimabe=1.

This resulted in the kernel hanging up when executing a loop of the form

1. fork
2. mmap
3. touch memory
4. read memory
5. munmmap

as described in
http://www.gossamer-threads.com/lists/linux/kernel/1348725#1348725

Is this ignorable minor issue? No. Unfortunately, x86 has very small dma
zone and it become zone->all_unreclamble=1 easily. and if it become
all_unreclaimable=1, it never restore all_unreclaimable=0. Why? if
all_unreclaimable=1, vmscan only try DEF_PRIORITY reclaim and
a-few-lru-pages>>DEF_PRIORITY always makes 0. that mean no page scan at
all!

Eventually, oom-killer never works on such systems. That said, we can't
use zone->pages_scanned for this purpose. This patch restore
all_unreclaimable() use zone->all_unreclaimable as old. and in addition,
to add oom_killer_disabled check to avoid reintroduce the issue of commit
d1908362 ("vmscan: check all_unreclaimable in direct reclaim path").

Reported-by: Andrey Vagin
Signed-off-by: KOSAKI Motohiro
Cc: Nick Piggin
Reviewed-by: Minchan Kim
Reviewed-by: KAMEZAWA Hiroyuki
Acked-by: David Rientjes
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2011-04-15 07:06:56 +0800
fe936dfc2 mm: check that we have the right vma in __access_remote_vm() ... Browse Code »

In __access_remote_vm() we need to check that we have found the right
vma, not the following vma before we try to access it. Otherwise we
might call the vma's access routine with an address which does not fall
inside the vma.

It was discovered on a current kernel but with an unreleased driver,
from memory it was strace leading to a kernel bad access, but it
obviously depends on what the access implementation does.

Looking at other access implementations I only see:

$ git grep -A 5 vm_operations|grep access
arch/powerpc/platforms/cell/spufs/file.c- .access = spufs_mem_mmap_access,
arch/x86/pci/i386.c- .access = generic_access_phys,
drivers/char/mem.c- .access = generic_access_phys
fs/sysfs/bin.c- .access = bin_access,

The spufs one looks like it might behave badly given the wrong vma, it
assumes vma->vm_file->private_data is a spu_context, and looks like it
would probably blow up pretty quickly if it wasn't.

generic_access_phys() only uses the vma to check vm_flags and get the
mm, and then walks page tables using the address. So it should bail on
the vm_flags check, or at worst let you access some other VM_IO mapping.

And bin_access() just proxies to another access implementation.

Signed-off-by: Michael Ellerman
Reviewed-by: KOSAKI Motohiro
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michael Ellerman
2011-04-15 07:06:55 +0800
4471a675d brk: COMPAT_BRK: fix detection of randomized brk ... Browse Code »

5520e89 ("brk: fix min_brk lower bound computation for COMPAT_BRK")
tried to get the whole logic of brk randomization for legacy
(libc5-based) applications finally right.

It turns out that the way to detect whether brk has actually been
randomized in the end or not introduced by that patch still doesn't work
for those binaries, as reported by Geert:

: /sbin/init from my old m68k ramdisk exists prematurely.
:
: Before the patch:
:
: | brk(0x80005c8e) = 0x80006000
:
: After the patch:
:
: | brk(0x80005c8e) = 0x80005c8e
:
: Old libc5 considers brk() to have failed if the return value is not
: identical to the requested value.

I don't like it, but currently see no better option than a bit flag in
task_struct to catch the CONFIG_COMPAT_BRK && randomize_va_space == 2
case.

Signed-off-by: Jiri Kosina
Tested-by: Geert Uytterhoeven
Reported-by: Geert Uytterhoeven
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Kosina
2011-04-15 07:06:55 +0800
fc5da22ae tmpfs: fix off-by-one in max_blocks checks ... Browse Code »

If you fill up a tmpfs, df was showing

tmpfs 460800 - - - /tmp

because of an off-by-one in the max_blocks checks. Fix it so df shows

tmpfs 460800 460800 0 100% /tmp

Signed-off-by: Hugh Dickins
Cc: Tim Chen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-04-15 07:06:55 +0800
81ab4201f mm: add VM counters for transparent hugepages ... Browse Code »

I found it difficult to make sense of transparent huge pages without
having any counters for its actions. Add some counters to vmstat for
allocation of transparent hugepages and fallback to smaller pages.

Optional patch, but useful for development and understanding the system.

Contains improvements from Andrea Arcangeli and Johannes Weiner

[akpm@linux-foundation.org: coding-style fixes]
[hannes@cmpxchg.org: fix vmstat_text[] entries]
Signed-off-by: Andi Kleen
Acked-by: Andrea Arcangeli
Reviewed-by: KAMEZAWA Hiroyuki
Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2011-04-15 07:06:55 +0800
d3bc23671 vmstat: update comment regarding stat_threshold ... Browse Code »

Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2011-04-15 07:06:54 +0800
9f6ae448b mm/page_alloc.c: silence build_all_zonelists() section mismatch ... Browse Code »

The memory hotplug case involves calling to build_all_zonelists() which
in turns calls in to setup_zone_pageset(). The latter is marked
__meminit while build_all_zonelists() itself has no particular
annotation. build_all_zonelists() is only handed a non-NULL pointer in
the case of memory hotplug through an existing __meminit path, so the
setup_zone_pageset() reference is always safe.

The options as such are either to flag build_all_zonelists() as __ref (as
per __build_all_zonelists()), or to simply discard the __meminit
annotation from setup_zone_pageset().

Signed-off-by: Paul Mundt
Acked-by: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Acked-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Mundt
2011-04-15 07:06:54 +0800
584208e6b mm: optimize pfn calculation in online_page() ... Browse Code »

If CONFIG_FLATMEM is enabled pfn is calculated in online_page() more than
once. It is possible to optimize that and use value established at
beginning of that function.

Signed-off-by: Daniel Kiper
Acked-by: Dave Hansen
Cc: KAMEZAWA Hiroyuki
Cc: Wu Fengguang
Cc: Mel Gorman
Cc: Christoph Lameter
Acked-by: David Rientjes
Reviewed-by: Jesper Juhl
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daniel Kiper
2011-04-15 07:06:54 +0800

13 Apr, 2011

2 commits

a626ca6a6 vm: fix vm_pgoff wrap in stack expansion ... Browse Code »

Commit 982134ba6261 ("mm: avoid wrapping vm_pgoff in mremap()") fixed
the case of a expanding mapping causing vm_pgoff wrapping when you used
mremap. But there was another case where we expand mappings hiding in
plain sight: the automatic stack expansion.

This fixes that case too.

This one also found by Robert Święcki, using his nasty system call
fuzzer tool. Good job.

Reported-and-tested-by: Robert Święcki
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Linus Torvalds
2011-04-13 23:07:28 +0800
95042f9eb vm: fix mlock() on stack guard page ... Browse Code »

Commit 53a7706d5ed8 ("mlock: do not hold mmap_sem for extended periods
of time") changed mlock() to care about the exact number of pages that
__get_user_pages() had brought it. Before, it would only care about
errors.

And that doesn't work, because we also handled one page specially in
__mlock_vma_pages_range(), namely the stack guard page. So when that
case was handled, the number of pages that the function returned was off
by one. In particular, it could be zero, and then the caller would end
up not making any progress at all.

Rather than try to fix up that off-by-one error for the mlock case
specially, this just moves the logic to handle the stack guard page
into__get_user_pages() itself, thus making all the counts come out
right automatically.

Reported-by: Robert Święcki
Cc: Hugh Dickins
Cc: Oleg Nesterov
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Linus Torvalds
2011-04-13 05:15:51 +0800

08 Apr, 2011

1 commit

42933bac1 Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6 ... Browse Code »

* 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6:
Fix common misspellings

Linus Torvalds
2011-04-08 02:14:49 +0800

07 Apr, 2011

1 commit

982134ba6 mm: avoid wrapping vm_pgoff in mremap() ... Browse Code »

The normal mmap paths all avoid creating a mapping where the pgoff
inside the mapping could wrap around due to overflow. However, an
expanding mremap() can take such a non-wrapping mapping and make it
bigger and cause a wrapping condition.

Noticed by Robert Swiecki when running a system call fuzzer, where it
caused a BUG_ON() due to terminally confusing the vma_prio_tree code. A
vma dumping patch by Hugh then pinpointed the crazy wrapped case.

Reported-and-tested-by: Robert Swiecki
Acked-by: Hugh Dickins
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Linus Torvalds
2011-04-07 22:35:51 +0800

31 Mar, 2011

1 commit

25985edce Fix common misspellings ... Browse Code »

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi

Lucas De Marchi
2011-03-31 22:26:23 +0800

30 Mar, 2011

1 commit

eefbab599 Merge branch 'frv' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-frv ... Browse Code »

* 'frv' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-frv:
FRV: Use generic show_interrupts()
FRV: Convert genirq namespace
frv: Select GENERIC_HARDIRQS_NO_DEPRECATED
frv: Convert cpu irq_chip to new functions
frv: Convert mb93493 irq_chip to new functions
frv: Convert mb93093 irq_chip to new function
frv: Convert mb93091 irq_chip to new functions
frv: Fix typo from __do_IRQ overhaul
frv: Remove stale irq_chip.end
FRV: Do some cleanups
FRV: Missing node arg in alloc_thread_info_node() macro
NOMMU: implement access_remote_vm
NOMMU: support SMP dynamic percpu_alloc
NOMMU: percpu should use is_vmalloc_addr().

Linus Torvalds
2011-03-30 02:43:30 +0800

29 Mar, 2011

1 commit

f55f199b7 NOMMU: implement access_remote_vm ... Browse Code »

Recent vm changes brought in a new function which the core procfs code
utilizes. So implement it for nommu systems too to avoid link failures.

Signed-off-by: Mike Frysinger
Signed-off-by: David Howells
Tested-by: Simon Horman
Tested-by: Ithamar Adema
Acked-by: Greg Ungerer

Mike Frysinger
2011-03-29 21:05:12 +0800

28 Mar, 2011

2 commits

eac522ef4 NOMMU: percpu should use is_vmalloc_addr(). ... Browse Code »

per_cpu_ptr_to_phys() uses VMALLOC_START and VMALLOC_END to determine if an
address is in the vmalloc() region or not. This is incorrect on NOMMU as
there is no real vmalloc() capability (vmalloc() is emulated by kmalloc()).

The correct way to do this is to use is_vmalloc_addr(). This encapsulates the
vmalloc() region test in MMU mode and just returns 0 in NOMMU mode.

On FRV in NOMMU mode, the percpu compilation fails without this patch:

mm/percpu.c: In function 'per_cpu_ptr_to_phys':
mm/percpu.c:1011: error: 'VMALLOC_START' undeclared (first use in this function)
mm/percpu.c:1011: error: (Each undeclared identifier is reported only once
mm/percpu.c:1011: error: for each function it appears in.)
mm/percpu.c:1012: error: 'VMALLOC_END' undeclared (first use in this function)
mm/percpu.c:1018: warning: control reaches end of non-void function

Signed-off-by: David Howells

David Howells
2011-03-28 19:53:29 +0800
ae91dbfc9 mm: fix memory.c incorrect kernel-doc ... Browse Code »

Fix mm/memory.c incorrect kernel-doc function notation:

Warning(mm/memory.c:3718): Cannot understand * @access_remote_vm - access another process' address space
on line 3718 - I thought it was a doc line

Signed-off-by: Randy Dunlap
Signed-off-by: Linus Torvalds

Randy Dunlap
2011-03-28 10:30:18 +0800

25 Mar, 2011

1 commit

d39dd11c3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
fs: simplify iget & friends
fs: pull inode->i_lock up out of writeback_single_inode
fs: rename inode_lock to inode_hash_lock
fs: move i_wb_list out from under inode_lock
fs: move i_sb_list out from under inode_lock
fs: remove inode_lock from iput_final and prune_icache
fs: Lock the inode LRU list separately
fs: factor inode disposal
fs: protect inode->i_state with inode->i_lock
autofs4: Do not potentially dereference NULL pointer returned by fget() in autofs_dev_ioctl_setpipefd()
autofs4 - remove autofs4_lock
autofs4 - fix d_manage() return on rcu-walk
autofs4 - fix autofs4_expire_indirect() traversal
autofs4 - fix dentry leak in autofs4_expire_direct()
autofs4 - reinstate last used update on access
vfs - check non-mountpoint dentry might block in __follow_mount_rcu()

Linus Torvalds
2011-03-25 10:01:30 +0800