Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

13 Oct, 2012

3 commits

8418263e3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull third pile of VFS updates from Al Viro:
"Stuff from Jeff Layton, mostly. Sanitizing interplay between audit
and namei, removing a lot of insanity from audit_inode() mess and
getting things ready for his ESTALE patchset."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
procfs: don't need a PATH_MAX allocation to hold a string representation of an int
vfs: embed struct filename inside of names_cache allocation if possible
audit: make audit_inode take struct filename
vfs: make path_openat take a struct filename pointer
vfs: turn do_path_lookup into wrapper around struct filename variant
audit: allow audit code to satisfy getname requests from its names_list
vfs: define struct filename and have getname() return it
vfs: unexport getname and putname symbols
acct: constify the name arg to acct_on
vfs: allocate page instead of names_cache buffer in mount_block_root
audit: overhaul __audit_inode_child to accomodate retrying
audit: optimize audit_compare_dname_path
audit: make audit_compare_dname_path use parent_len helper
audit: remove dirlen argument to audit_compare_dname_path
audit: set the name_len in audit_inode for parent lookups
audit: add a new "type" field to audit_names struct
audit: reverse arguments to audit_inode_child
audit: no need to walk list in audit_inode if name is NULL
audit: pass in dentry to audit_copy_inode wherever possible
audit: remove unnecessary NULL ptr checks from do_path_lookup

Linus Torvalds
2012-10-13 09:04:42 +0800
669abf4e5 vfs: make path_openat take a struct filename pointer ... Browse Code »

...and fix up the callers. For do_file_open_root, just declare a
struct filename on the stack and fill out the .name field. For
do_filp_open, make it also take a struct filename pointer, and fix up its
callers to call it appropriately.

For filp_open, add a variant that takes a struct filename pointer and turn
filp_open into a wrapper around it.

Signed-off-by: Jeff Layton
Signed-off-by: Al Viro

Jeff Layton
2012-10-13 08:15:09 +0800
91a27b2a7 vfs: define struct filename and have getname() return it ... Browse Code »

getname() is intended to copy pathname strings from userspace into a
kernel buffer. The result is just a string in kernel space. It would
however be quite helpful to be able to attach some ancillary info to
the string.

For instance, we could attach some audit-related info to reduce the
amount of audit-related processing needed. When auditing is enabled,
we could also call getname() on the string more than once and not
need to recopy it from userspace.

This patchset converts the getname()/putname() interfaces to return
a struct instead of a string. For now, the struct just tracks the
string in kernel space and the original userland pointer for it.

Later, we'll add other information to the struct as it becomes
convenient.

Signed-off-by: Jeff Layton
Signed-off-by: Al Viro

Jeff Layton
2012-10-13 08:14:55 +0800

12 Oct, 2012

3 commits

3dc329baa Merge branch 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux ... Browse Code »

Pull SLAB fix from Pekka Enberg:
"This contains a lockdep false positive fix from Jiri Kosina I missed
from the previous pull request."

* 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
mm, slab: release slab_mutex earlier in kmem_cache_destroy()

Linus Torvalds
2012-10-12 21:19:28 +0800
79360ddd7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull pile 2 of vfs updates from Al Viro:
"Stuff in this one - assorted fixes, lglock tidy-up, death to
lock_super().

There'll be a VFS pile tomorrow (with patches from Jeff Layton,
sanitizing getname() and related parts of audit and preparing for
ESTALE fixes), but I'd rather push the stuff in this one ASAP - some
of the bugs closed here are quite unpleasant."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: bogus warnings in fs/namei.c
consitify do_mount() arguments
lglock: add DEFINE_STATIC_LGLOCK()
lglock: make the per_cpu locks static
lglock: remove unused DEFINE_LGLOCK_LOCKDEP()
MAX_LFS_FILESIZE definition for 64bit needs LL...
tmpfs,ceph,gfs2,isofs,reiserfs,xfs: fix fh_len checking
vfs: drop lock/unlock super
ufs: drop lock/unlock super
sysv: drop lock/unlock super
hpfs: drop lock/unlock super
fat: drop lock/unlock super
ext3: drop lock/unlock super
exofs: drop lock/unlock super
dup3: Return an error when oldfd == newfd.
fs: handle failed audit_log_start properly
fs: prevent use after free in auditing when symlink following was denied

Linus Torvalds
2012-10-12 09:52:03 +0800
40924754f Merge branch 'writeback-for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux ... Browse Code »

Pull writeback fixes from Fengguang Wu:
"Three trivial writeback fixes"

* 'writeback-for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
CPU hotplug, writeback: Don't call writeback_set_ratelimit() too often during hotplug
writeback: correct comment for move_expired_inodes()
backing-dev: use kstrto* in preference to simple_strtoul

Linus Torvalds
2012-10-12 09:46:03 +0800

10 Oct, 2012

2 commits

210ed9def mm, slab: release slab_mutex earlier in kmem_cache_destroy() ... Browse Code »

Commit 1331e7a1bbe1 ("rcu: Remove _rcu_barrier() dependency on
__stop_machine()") introduced slab_mutex -> cpu_hotplug.lock dependency
through kmem_cache_destroy() -> rcu_barrier() -> _rcu_barrier() ->
get_online_cpus().

Lockdep thinks that this might actually result in ABBA deadlock,
and reports it as below:

=== [ cut here ] ===
======================================================
[ INFO: possible circular locking dependency detected ]
3.6.0-rc5-00004-g0d8ee37 #143 Not tainted
-------------------------------------------------------
kworker/u:2/40 is trying to acquire lock:
(rcu_sched_state.barrier_mutex){+.+...}, at: [] _rcu_barrier+0x26/0x1e0

but task is already holding lock:
(slab_mutex){+.+.+.}, at: [] kmem_cache_destroy+0x45/0xe0

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (slab_mutex){+.+.+.}:
[] validate_chain+0x632/0x720
[] __lock_acquire+0x309/0x530
[] lock_acquire+0x121/0x190
[] __mutex_lock_common+0x5c/0x450
[] mutex_lock_nested+0x3e/0x50
[] cpuup_callback+0x2f/0xbe
[] notifier_call_chain+0x93/0x140
[] __raw_notifier_call_chain+0x9/0x10
[] _cpu_up+0xba/0x14e
[] cpu_up+0xbc/0x117
[] smp_init+0x6b/0x9f
[] kernel_init+0x147/0x1dc
[] kernel_thread_helper+0x4/0x10

-> #1 (cpu_hotplug.lock){+.+.+.}:
[] validate_chain+0x632/0x720
[] __lock_acquire+0x309/0x530
[] lock_acquire+0x121/0x190
[] __mutex_lock_common+0x5c/0x450
[] mutex_lock_nested+0x3e/0x50
[] get_online_cpus+0x37/0x50
[] _rcu_barrier+0xbb/0x1e0
[] rcu_barrier_sched+0x10/0x20
[] rcu_barrier+0x9/0x10
[] deactivate_locked_super+0x49/0x90
[] deactivate_super+0x61/0x70
[] mntput_no_expire+0x127/0x180
[] sys_umount+0x6e/0xd0
[] system_call_fastpath+0x16/0x1b

-> #0 (rcu_sched_state.barrier_mutex){+.+...}:
[] check_prev_add+0x3de/0x440
[] validate_chain+0x632/0x720
[] __lock_acquire+0x309/0x530
[] lock_acquire+0x121/0x190
[] __mutex_lock_common+0x5c/0x450
[] mutex_lock_nested+0x3e/0x50
[] _rcu_barrier+0x26/0x1e0
[] rcu_barrier_sched+0x10/0x20
[] rcu_barrier+0x9/0x10
[] kmem_cache_destroy+0xd1/0xe0
[] nf_conntrack_cleanup_net+0xe4/0x110 [nf_conntrack]
[] nf_conntrack_cleanup+0x2a/0x70 [nf_conntrack]
[] nf_conntrack_net_exit+0x5e/0x80 [nf_conntrack]
[] ops_exit_list+0x39/0x60
[] cleanup_net+0xfb/0x1b0
[] process_one_work+0x26b/0x4c0
[] worker_thread+0x12e/0x320
[] kthread+0x9e/0xb0
[] kernel_thread_helper+0x4/0x10

other info that might help us debug this:

Chain exists of:
rcu_sched_state.barrier_mutex --> cpu_hotplug.lock --> slab_mutex

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(slab_mutex);
lock(cpu_hotplug.lock);
lock(slab_mutex);
lock(rcu_sched_state.barrier_mutex);

*** DEADLOCK ***
=== [ cut here ] ===

This is actually a false positive. Lockdep has no way of knowing the fact
that the ABBA can actually never happen, because of special semantics of
cpu_hotplug.refcount and its handling in cpu_hotplug_begin(); the mutual
exclusion there is not achieved through mutex, but through
cpu_hotplug.refcount.

The "neither cpu_up() nor cpu_down() will proceed past cpu_hotplug_begin()
until everyone who called get_online_cpus() will call put_online_cpus()"
semantics is totally invisible to lockdep.

This patch therefore moves the unlock of slab_mutex so that rcu_barrier()
is being called with it unlocked. It has two advantages:

- it slightly reduces hold time of slab_mutex; as it's used to protect
the cachep list, it's not necessary to hold it over kmem_cache_free()
call any more
- it silences the lockdep false positive warning, as it avoids lockdep ever
learning about slab_mutex -> cpu_hotplug.lock dependency

Reviewed-by: Paul E. McKenney
Reviewed-by: Srivatsa S. Bhat
Acked-by: David Rientjes
Signed-off-by: Jiri Kosina
Signed-off-by: Pekka Enberg

Jiri Kosina
2012-10-10 14:25:08 +0800
35c2a7f49 tmpfs,ceph,gfs2,isofs,reiserfs,xfs: fix fh_len checking ... Browse Code »

Fuzzing with trinity oopsed on the 1st instruction of shmem_fh_to_dentry(),
u64 inum = fid->raw[2];
which is unhelpfully reported as at the end of shmem_alloc_inode():

BUG: unable to handle kernel paging request at ffff880061cd3000
IP: [] shmem_alloc_inode+0x40/0x40
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Call Trace:
[] ? exportfs_decode_fh+0x79/0x2d0
[] do_handle_open+0x163/0x2c0
[] sys_open_by_handle_at+0xc/0x10
[] tracesys+0xe1/0xe6

Right, tmpfs is being stupid to access fid->raw[2] before validating that
fh_len includes it: the buffer kmalloc'ed by do_sys_name_to_handle() may
fall at the end of a page, and the next page not be present.

But some other filesystems (ceph, gfs2, isofs, reiserfs, xfs) are being
careless about fh_len too, in fh_to_dentry() and/or fh_to_parent(), and
could oops in the same way: add the missing fh_len checks to those.

Reported-by: Sasha Levin
Signed-off-by: Hugh Dickins
Cc: Al Viro
Cc: Sage Weil
Cc: Steven Whitehouse
Cc: Christoph Hellwig
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro

Hugh Dickins
2012-10-10 11:33:55 +0800

09 Oct, 2012

32 commits

f5c8ad472 mm: thp: Use more portable PMD clearing sequenece in zap_huge_pmd(). ... Browse Code »

Invalidation sequences are handled in various ways on various
architectures.

One way, which sparc64 uses, is to let the set_*_at() functions accumulate
pending flushes into a per-cpu array. Then the flush_tlb_range() et al.
calls process the pending TLB flushes.

In this regime, the __tlb_remove_*tlb_entry() implementations are
essentially NOPs.

The canonical PTE zap in mm/memory.c is:

ptent = ptep_get_and_clear_full(mm, addr, pte,
tlb->fullmm);
tlb_remove_tlb_entry(tlb, pte, addr);

With a subsequent tlb_flush_mmu() if needed.

Mirror this in the THP PMD zapping using:

orig_pmd = pmdp_get_and_clear(tlb->mm, addr, pmd);
page = pmd_page(orig_pmd);
tlb_remove_pmd_tlb_entry(tlb, pmd, addr);

And we properly accomodate TLB flush mechanims like the one described
above.

Signed-off-by: David S. Miller
Cc: Andrea Arcangeli
Cc: Johannes Weiner
Cc: Gerald Schaefer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Miller
2012-10-09 15:23:06 +0800
b113da657 mm: Add and use update_mmu_cache_pmd() in transparent huge page code. ... Browse Code »

The transparent huge page code passes a PMD pointer in as the third
argument of update_mmu_cache(), which expects a PTE pointer.

This never got noticed because X86 implements update_mmu_cache() as a
macro and thus we don't get any type checking, and X86 is the only
architecture which supports transparent huge pages currently.

Before other architectures can support transparent huge pages properly we
need to add a new interface which will take a PMD pointer as the third
argument rather than a PTE pointer.

[akpm@linux-foundation.org: implement update_mm_cache_pmd() for s390]
Signed-off-by: David S. Miller
Cc: Andrea Arcangeli
Cc: Johannes Weiner
Cc: Gerald Schaefer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Miller
2012-10-09 15:23:05 +0800
d760afd4d memory-hotplug: suppress "Trying to free nonexistent resource <XXXXXXXXXXXXXXXX-… ... Browse Code »

…YYYYYYYYYYYYYYYY>" warning

When our x86 box calls __remove_pages(), release_mem_region() shows many
warnings. And x86 box cannot unregister iomem_resource.

"Trying to free nonexistent resource <XXXXXXXXXXXXXXXX-YYYYYYYYYYYYYYYY>"

release_mem_region() has been changed to be called in each
PAGES_PER_SECTION by commit de7f0cba9678 ("memory hotplug: release
memory regions in PAGES_PER_SECTION chunks"). Because powerpc registers
iomem_resource in each PAGES_PER_SECTION chunk. But when I hot add
memory on x86 box, iomem_resource is register in each _CRS not
PAGES_PER_SECTION chunk. So x86 box unregisters iomem_resource.

The patch fixes the problem.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Nathan Fontenot <nfont@austin.ibm.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Yasuaki Ishimatsu
2012-10-09 15:23:04 +0800
7795912c2 mm: document PageHuge somewhat ... Browse Code »

Acked-by: David Rientjes
Cc: Mel Gorman
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2012-10-09 15:23:03 +0800
45ec16908 mm: use %pK for /proc/vmallocinfo ... Browse Code »

In the paranoid case of sysctl kernel.kptr_restrict=2, mask the kernel
virtual addresses in /proc/vmallocinfo too.

Signed-off-by: Kees Cook
Reported-by: Brad Spengler
Acked-by: KOSAKI Motohiro
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kees Cook
2012-10-09 15:23:03 +0800
8449d21fb mm, thp: fix mlock statistics ... Browse Code »

NR_MLOCK is only accounted in single page units: there's no logic to
handle transparent hugepages. This patch checks the appropriate number of
pages to adjust the statistics by so that the correct amount of memory is
reflected.

Currently:

$ grep Mlocked /proc/meminfo
Mlocked: 19636 kB

#define MAP_SIZE (4 << 30) /* 4GB */

void *ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
mlock(ptr, MAP_SIZE);

$ grep Mlocked /proc/meminfo
Mlocked: 29844 kB

munlock(ptr, MAP_SIZE);

$ grep Mlocked /proc/meminfo
Mlocked: 19636 kB

And with this patch:

$ grep Mlock /proc/meminfo
Mlocked: 19636 kB

mlock(ptr, MAP_SIZE);

$ grep Mlock /proc/meminfo
Mlocked: 4213664 kB

munlock(ptr, MAP_SIZE);

$ grep Mlock /proc/meminfo
Mlocked: 19636 kB

Signed-off-by: David Rientjes
Reported-by: Hugh Dickens
Acked-by: Hugh Dickins
Reviewed-by: Andrea Arcangeli
Cc: Naoya Horiguchi
Cc: KAMEZAWA Hiroyuki
Cc: Johannes Weiner
Reviewed-by: Michel Lespinasse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-10-09 15:23:03 +0800
b676b293f mm, thp: fix mapped pages avoiding unevictable list on mlock ... Browse Code »

When a transparent hugepage is mapped and it is included in an mlock()
range, follow_page() incorrectly avoids setting the page's mlock bit and
moving it to the unevictable lru.

This is evident if you try to mlock(), munlock(), and then mlock() a
range again. Currently:

#define MAP_SIZE (4 << 30) /* 4GB */

void *ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
mlock(ptr, MAP_SIZE);

$ grep -E "Unevictable|Inactive\(anon" /proc/meminfo
Inactive(anon): 6304 kB
Unevictable: 4213924 kB

munlock(ptr, MAP_SIZE);

Inactive(anon): 4186252 kB
Unevictable: 19652 kB

mlock(ptr, MAP_SIZE);

Inactive(anon): 4198556 kB
Unevictable: 21684 kB

Notice that less than 2MB was added to the unevictable list; this is
because these pages in the range are not transparent hugepages since the
4GB range was allocated with mmap() and has no specific alignment. If
posix_memalign() were used instead, unevictable would not have grown at
all on the second mlock().

The fix is to call mlock_vma_page() so that the mlock bit is set and the
page is added to the unevictable list. With this patch:

mlock(ptr, MAP_SIZE);

Inactive(anon): 4056 kB
Unevictable: 4213940 kB

munlock(ptr, MAP_SIZE);

Inactive(anon): 4198268 kB
Unevictable: 19636 kB

mlock(ptr, MAP_SIZE);

Inactive(anon): 4008 kB
Unevictable: 4213940 kB

Signed-off-by: David Rientjes
Acked-by: Hugh Dickins
Reviewed-by: Andrea Arcangeli
Cc: Naoya Horiguchi
Cc: KAMEZAWA Hiroyuki
Cc: Johannes Weiner
Cc: Michel Lespinasse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-10-09 15:23:02 +0800
e90bdb7f5 memory-hotplug: update memory block's state and notify userspace ... Browse Code »

remove_memory() will be called when hot removing a memory device. But
even if offlining memory, we cannot notice it. So the patch updates the
memory block's state and sends notification to userspace.

Additionally, the memory device may contain more than one memory block.
If the memory block has been offlined, __offline_pages() will fail. So we
should try to offline one memory block at a time.

Thus remove_memory() also check each memory block's state. So there is no
need to check the memory block's state before calling remove_memory().

Signed-off-by: Wen Congyang
Signed-off-by: Yasuaki Ishimatsu
Cc: David Rientjes
Cc: Jiang Liu
Cc: Len Brown
Cc: Christoph Lameter
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wen Congyang
2012-10-09 15:23:02 +0800
a16cee10c memory-hotplug: preparation to notify memory block's state at memory hot remove ... Browse Code »

remove_memory() is called in two cases:
1. echo offline >/sys/devices/system/memory/memoryXX/state
2. hot remove a memory device

In the 1st case, the memory block's state is changed and the notification
that memory block's state changed is sent to userland after calling
remove_memory(). So user can notice memory block is changed.

But in the 2nd case, the memory block's state is not changed and the
notification is not also sent to userspcae even if calling
remove_memory(). So user cannot notice memory block is changed.

For adding the notification at memory hot remove, the patch just prepare
as follows:
1st case uses offline_pages() for offlining memory.
2nd case uses remove_memory() for offlining memory and changing memory block's
state and notifing the information.

The patch does not implement notification to remove_memory().

Signed-off-by: Wen Congyang
Signed-off-by: Yasuaki Ishimatsu
Cc: David Rientjes
Cc: Jiang Liu
Cc: Len Brown
Cc: Christoph Lameter
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wen Congyang
2012-10-09 15:23:02 +0800
c22331166 mm: avoid section mismatch warning for memblock_type_name ... Browse Code »

Following section mismatch warning is thrown during build;

WARNING: vmlinux.o(.text+0x32408f): Section mismatch in reference from the function memblock_type_name() to the variable .meminit.data:memblock
The function memblock_type_name() references
the variable __meminitdata memblock.
This is often because memblock_type_name lacks a __meminitdata
annotation or the annotation of memblock is wrong.

This is because memblock_type_name makes reference to memblock variable
with attribute __meminitdata. Hence, the warning (even if the function is
inline).

[akpm@linux-foundation.org: remove inline]
Signed-off-by: Raghavendra D Prabhu
Cc: Tejun Heo
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Raghavendra D Prabhu
2012-10-09 15:23:01 +0800
beb51eaa8 cma: decrease cc.nr_migratepages after reclaiming pagelist ... Browse Code »

reclaim_clean_pages_from_list() reclaims clean pages before migration so
cc.nr_migratepages should be updated. Currently, there is no problem but
it can be wrong if we try to use the value in future.

Signed-off-by: Minchan Kim
Acked-by: Mel Gorman
Cc: Michal Nazarewicz
Cc: Bartlomiej Zolnierkiewicz
Cc: Marek Szyprowski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2012-10-09 15:23:01 +0800
e46a28790 CMA: migrate mlocked pages ... Browse Code »

Presently CMA cannot migrate mlocked pages so it ends up failing to allocate
contiguous memory space.

This patch makes mlocked pages be migrated out. Of course, it can affect
realtime processes but in CMA usecase, contiguous memory allocation failing
is far worse than access latency to an mlocked page being variable while
CMA is running. If someone wants to make the system realtime, he shouldn't
enable CMA because stalls can still happen at random times.

[akpm@linux-foundation.org: tweak comment text, per Mel]
Signed-off-by: Minchan Kim
Acked-by: Mel Gorman
Cc: Michal Nazarewicz
Cc: Bartlomiej Zolnierkiewicz
Cc: Marek Szyprowski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2012-10-09 15:23:00 +0800
c462f179e mm/memory.c: fix typo in comment ... Browse Code »

Signed-off-by: Robert P. J. Day
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Robert P. J. Day
2012-10-09 15:22:59 +0800
8befedfe6 mm: remove unevictable_pgs_mlockfreed ... Browse Code »

Simply remove UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed line
from /proc/vmstat: Johannes and Mel point out that it was very unlikely to
have been used by any tool, and of course we can restore it easily enough
if that turns out to be wrong.

Signed-off-by: Hugh Dickins
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Michel Lespinasse
Cc: Ying Han
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-10-09 15:22:59 +0800
5a8838138 memory-hotplug: fix zone stat mismatch ... Browse Code »

During memory-hotplug, I found NR_ISOLATED_[ANON|FILE] are increasing,
causing the kernel to hang. When the system doesn't have enough free
pages, it enters reclaim but never reclaim any pages due to
too_many_isolated()==true and loops forever.

The cause is that when we do memory-hotadd after memory-remove,
__zone_pcp_update() clears a zone's ZONE_STAT_ITEMS in setup_pageset()
although the vm_stat_diff of all CPUs still have values.

In addtion, when we offline all pages of the zone, we reset them in
zone_pcp_reset without draining so we loss some zone stat item.

Reviewed-by: Wen Congyang
Signed-off-by: Minchan Kim
Cc: Kamezawa Hiroyuki
Cc: Yasuaki Ishimatsu
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2012-10-09 15:22:59 +0800
082708072 mm: revert 0def08e3 ("mm/mempolicy.c: check return code of check_range") ... Browse Code »

Revert commit 0def08e3acc2 because check_range can't fail in
migrate_to_node with considering current usecases.

Quote from Johannes

: I think it makes sense to revert. Not because of the semantics, but I
: just don't see how check_range() could even fail for this callsite:
:
: 1. we pass mm->mmap->vm_start in there, so we should not fail due to
: find_vma()
:
: 2. we pass MPOL_MF_DISCONTIG_OK, so the discontig checks do not apply
: and so can not fail
:
: 3. we pass MPOL_MF_MOVE | MPOL_MF_MOVE_ALL, the page table loops will
: continue until addr == end, so we never fail with -EIO

And I added a new VM_BUG_ON for checking migrate_to_node's future usecase
which might pass to MPOL_MF_STRICT.

Suggested-by: Johannes Weiner
Signed-off-by: Minchan Kim
Acked-by: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Vasiliy Kulikov
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2012-10-09 15:22:58 +0800
6bdb913f0 mm: wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end ... Browse Code »

In order to allow sleeping during invalidate_page mmu notifier calls, we
need to avoid calling when holding the PT lock. In addition to its direct
calls, invalidate_page can also be called as a substitute for a change_pte
call, in case the notifier client hasn't implemented change_pte.

This patch drops the invalidate_page call from change_pte, and instead
wraps all calls to change_pte with invalidate_range_start and
invalidate_range_end calls.

Note that change_pte still cannot sleep after this patch, and that clients
implementing change_pte should not take action on it in case the number of
outstanding invalidate_range_start calls is larger than one, otherwise
they might miss a later invalidation.

Signed-off-by: Haggai Eran
Cc: Andrea Arcangeli
Cc: Sagi Grimberg
Cc: Peter Zijlstra
Cc: Xiao Guangrong
Cc: Or Gerlitz
Cc: Haggai Eran
Cc: Shachar Raindel
Cc: Liran Liss
Cc: Christoph Lameter
Cc: Avi Kivity
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Haggai Eran
2012-10-09 15:22:58 +0800
2ec74c3ef mm: move all mmu notifier invocations to be done outside the PT lock ... Browse Code »

In order to allow sleeping during mmu notifier calls, we need to avoid
invoking them under the page table spinlock. This patch solves the
problem by calling invalidate_page notification after releasing the lock
(but before freeing the page itself), or by wrapping the page invalidation
with calls to invalidate_range_begin and invalidate_range_end.

To prevent accidental changes to the invalidate_range_end arguments after
the call to invalidate_range_begin, the patch introduces a convention of
saving the arguments in consistently named locals:

unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */

...

mmun_start = ...
mmun_end = ...
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);

...

mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

The patch changes code to use this convention for all calls to
mmu_notifier_invalidate_range_start/end, except those where the calls are
close enough so that anyone who glances at the code can see the values
aren't changing.

This patchset is a preliminary step towards on-demand paging design to be
added to the RDMA stack.

Why do we want on-demand paging for Infiniband?

Applications register memory with an RDMA adapter using system calls,
and subsequently post IO operations that refer to the corresponding
virtual addresses directly to HW. Until now, this was achieved by
pinning the memory during the registration calls. The goal of on demand
paging is to avoid pinning the pages of registered memory regions (MRs).
This will allow users the same flexibility they get when swapping any
other part of their processes address spaces. Instead of requiring the
entire MR to fit in physical memory, we can allow the MR to be larger,
and only fit the current working set in physical memory.

Why should anyone care? What problems are users currently experiencing?

This can make programming with RDMA much simpler. Today, developers
that are working with more data than their RAM can hold need either to
deregister and reregister memory regions throughout their process's
life, or keep a single memory region and copy the data to it. On demand
paging will allow these developers to register a single MR at the
beginning of their process's life, and let the operating system manage
which pages needs to be fetched at a given time. In the future, we
might be able to provide a single memory access key for each process
that would provide the entire process's address as one large memory
region, and the developers wouldn't need to register memory regions at
all.

Is there any prospect that any other subsystems will utilise these
infrastructural changes? If so, which and how, etc?

As for other subsystems, I understand that XPMEM wanted to sleep in
MMU notifiers, as Christoph Lameter wrote at
http://lkml.indiana.edu/hypermail/linux/kernel/0802.1/0460.html and
perhaps Andrea knows about other use cases.

Scheduling in mmu notifications is required since we need to sync the
hardware with the secondary page tables change. A TLB flush of an IO
device is inherently slower than a CPU TLB flush, so our design works by
sending the invalidation request to the device, and waiting for an
interrupt before exiting the mmu notifier handler.

Avi said:

kvm may be a buyer. kvm::mmu_lock, which serializes guest page
faults, also protects long operations such as destroying large ranges.
It would be good to convert it into a spinlock, but as it is used inside
mmu notifiers, this cannot be done.

(there are alternatives, such as keeping the spinlock and using a
generation counter to do the teardown in O(1), which is what the "may"
is doing up there).

[akpm@linux-foundation.orgpossible speed tweak in hugetlb_cow(), cleanups]
Signed-off-by: Andrea Arcangeli
Signed-off-by: Sagi Grimberg
Signed-off-by: Haggai Eran
Cc: Peter Zijlstra
Cc: Xiao Guangrong
Cc: Or Gerlitz
Cc: Haggai Eran
Cc: Shachar Raindel
Cc: Liran Liss
Cc: Christoph Lameter
Cc: Avi Kivity
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sagi Grimberg
2012-10-09 15:22:58 +0800
36e4f20af hugetlb: do not use vma_hugecache_offset() for vma_prio_tree_foreach ... Browse Code »

Commit 0c176d52b0b2 ("mm: hugetlb: fix pgoff computation when unmapping
page from vma") fixed pgoff calculation but it has replaced it by
vma_hugecache_offset() which is not approapriate for offsets used for
vma_prio_tree_foreach() because that one expects index in page units
rather than in huge_page_shift.

Johannes said:

: The resulting index may not be too big, but it can be too small: assume
: hpage size of 2M and the address to unmap to be 0x200000. This is regular
: page index 512 and hpage index 1. If you have a VMA that maps the file
: only starting at the second huge page, that VMAs vm_pgoff will be 512 but
: you ask for offset 1 and miss it even though it does map the page of
: interest. hugetlb_cow() will try to unmap, miss the vma, and retry the
: cow until the allocation succeeds or the skipped vma(s) go away.

Signed-off-by: Michal Hocko
Acked-by: Hillf Danton
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: Andrea Arcangeli
Cc: David Rientjes
Acked-by: Johannes Weiner
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2012-10-09 15:22:57 +0800
957f822a0 mm, numa: reclaim from all nodes within reclaim distance ... Browse Code »

RECLAIM_DISTANCE represents the distance between nodes at which it is
deemed too costly to allocate from; it's preferred to try to reclaim from
a local zone before falling back to allocating on a remote node with such
a distance.

To do this, zone_reclaim_mode is set if the distance between any two
nodes on the system is greather than this distance. This, however, ends
up causing the page allocator to reclaim from every zone regardless of
its affinity.

What we really want is to reclaim only from zones that are closer than
RECLAIM_DISTANCE. This patch adds a nodemask to each node that
represents the set of nodes that are within this distance. During the
zone iteration, if the bit for a zone's node is set for the local node,
then reclaim is attempted; otherwise, the zone is skipped.

[akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
Signed-off-by: David Rientjes
Cc: Mel Gorman
Cc: Minchan Kim
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-10-09 15:22:56 +0800
a0c5e813f mm: remove free_page_mlock ... Browse Code »

We should not be seeing non-0 unevictable_pgs_mlockfreed any longer. So
remove free_page_mlock() from the page freeing paths: __PG_MLOCKED is
already in PAGE_FLAGS_CHECK_AT_FREE, so free_pages_check() will now be
checking it, reporting "BUG: Bad page state" if it's ever found set.
Comment UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed always 0.

Signed-off-by: Hugh Dickins
Acked-by: Mel Gorman
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Michel Lespinasse
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-10-09 15:22:56 +0800
e6c509f85 mm: use clear_page_mlock() in page_remove_rmap() ... Browse Code »
5

We had thought that pages could no longer get freed while still marked as
mlocked; but Johannes Weiner posted this program to demonstrate that
truncating an mlocked private file mapping containing COWed pages is still
mishandled:

#include
#include
#include
#include
#include
#include
#include

int main(void)
{
char *map;
int fd;

system("grep mlockfreed /proc/vmstat");
fd = open("chigurh", O_CREAT|O_EXCL|O_RDWR);
unlink("chigurh");
ftruncate(fd, 4096);
map = mmap(NULL, 4096, PROT_WRITE, MAP_PRIVATE, fd, 0);
map[0] = 11;
mlock(map, sizeof(fd));
ftruncate(fd, 0);
close(fd);
munlock(map, sizeof(fd));
munmap(map, 4096);
system("grep mlockfreed /proc/vmstat");
return 0;
}

The anon COWed pages are not caught by truncation's clear_page_mlock() of
the pagecache pages; but unmap_mapping_range() unmaps them, so we ought to
look out for them there in page_remove_rmap(). Indeed, why should
truncation or invalidation be doing the clear_page_mlock() when removing
from pagecache? mlock is a property of mapping in userspace, not a
property of pagecache: an mlocked unmapped page is nonsensical.

Reported-by: Johannes Weiner
Signed-off-by: Hugh Dickins
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Michel Lespinasse
Cc: Ying Han
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-10-09 15:22:56 +0800
39b5f29ac mm: remove vma arg from page_evictable ... Browse Code »

page_evictable(page, vma) is an irritant: almost all its callers pass
NULL for vma. Remove the vma arg and use mlocked_vma_newpage(vma, page)
explicitly in the couple of places it's needed. But in those places we
don't even need page_evictable() itself! They're dealing with a freshly
allocated anonymous page, which has no "mapping" and cannot be mlocked yet.

Signed-off-by: Hugh Dickins
Acked-by: Mel Gorman
Cc: Rik van Riel
Acked-by: Johannes Weiner
Cc: Michel Lespinasse
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-10-09 15:22:55 +0800
ec4d9f626 mm: fix invalidate_complete_page2() lock ordering ... Browse Code »

In fuzzing with trinity, lockdep protested "possible irq lock inversion
dependency detected" when isolate_lru_page() reenabled interrupts while
still holding the supposedly irq-safe tree_lock:

invalidate_inode_pages2
invalidate_complete_page2
spin_lock_irq(&mapping->tree_lock)
clear_page_mlock
isolate_lru_page
spin_unlock_irq(&zone->lru_lock)

isolate_lru_page() is correct to enable interrupts unconditionally:
invalidate_complete_page2() is incorrect to call clear_page_mlock() while
holding tree_lock, which is supposed to nest inside lru_lock.

Both truncate_complete_page() and invalidate_complete_page() call
clear_page_mlock() before taking tree_lock to remove page from radix_tree.
I guess invalidate_complete_page2() preferred to test PageDirty (again)
under tree_lock before committing to the munlock; but since the page has
already been unmapped, its state is already somewhat inconsistent, and no
worse if clear_page_mlock() moved up.

Reported-by: Sasha Levin
Deciphered-by: Andrew Morton
Signed-off-by: Hugh Dickins
Acked-by: Mel Gorman
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Michel Lespinasse
Cc: Ying Han
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-10-09 15:22:55 +0800
7ffc0edc4 memcg: move mem_cgroup_is_root upwards ... Browse Code »

kmem code uses this function and it is better to not use forward
declarations for static inline functions as some (older) compilers don't
like it:

gcc version 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux)

mm/memcontrol.c:421: warning: `mem_cgroup_is_root' declared inline after being called
mm/memcontrol.c:421: warning: previous declaration of `mem_cgroup_is_root' was here

Signed-off-by: Michal Hocko
Cc: Glauber Costa
Cc: Sachin Kamat
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2012-10-09 15:22:55 +0800
4bd2c1ee4 memcg: cleanup kmem tcp ifdefs ... Browse Code »

TCP kmem accounting is currently guarded by CONFIG_MEMCG_KMEM ifdefs but
the code is not used if !CONFIG_INET so we should rather test for both.
The same applies to net/sock.h, net/ip.h and net/tcp_memcontrol.h but
let's keep those outside of any ifdefs because it is considered safer wrt.
future maintainability.

Tested with
- CONFIG_INET && CONFIG_MEMCG_KMEM
- !CONFIG_INET && CONFIG_MEMCG_KMEM
- CONFIG_INET && !CONFIG_MEMCG_KMEM
- !CONFIG_INET && !CONFIG_MEMCG_KMEM

Signed-off-by: Sachin Kamat
Signed-off-by: Michal Hocko
Cc: Glauber Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2012-10-09 15:22:54 +0800
7f1290f2f mm: fix-up zone present pages ... Browse Code »

I think zone->present_pages indicates pages that buddy system can management,
it should be:

zone->present_pages = spanned pages - absent pages - bootmem pages,

but is now:
zone->present_pages = spanned pages - absent pages - memmap pages.

spanned pages: total size, including holes.
absent pages: holes.
bootmem pages: pages used in system boot, managed by bootmem allocator.
memmap pages: pages used by page structs.

This may cause zone->present_pages less than it should be. For example,
numa node 1 has ZONE_NORMAL and ZONE_MOVABLE, it's memmap and other
bootmem will be allocated from ZONE_MOVABLE, so ZONE_NORMAL's
present_pages should be spanned pages - absent pages, but now it also
minus memmap pages(free_area_init_core), which are actually allocated from
ZONE_MOVABLE. When offlining all memory of a zone, this will cause
zone->present_pages less than 0, because present_pages is unsigned long
type, it is actually a very large integer, it indirectly caused
zone->watermark[WMARK_MIN] becomes a large
integer(setup_per_zone_wmarks()), than cause totalreserve_pages become a
large integer(calculate_totalreserve_pages()), and finally cause memory
allocating failure when fork process(__vm_enough_memory()).

[root@localhost ~]# dmesg
-bash: fork: Cannot allocate memory

I think the bug described in

http://marc.info/?l=linux-mm&m=134502182714186&w=2

is also caused by wrong zone present pages.

This patch intends to fix-up zone->present_pages when memory are freed to
buddy system on x86_64 and IA64 platforms.

Signed-off-by: Jianguo Wu
Signed-off-by: Jiang Liu
Reported-by: Petr Tesarik
Tested-by: Petr Tesarik
Cc: "Luck, Tony"
Cc: Mel Gorman
Cc: Yinghai Lu
Cc: Minchan Kim
Cc: Johannes Weiner
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jianguo Wu
2012-10-09 15:22:54 +0800
05106e6a5 mm: enable CONFIG_COMPACTION by default ... Browse Code »

Now that lumpy reclaim has been removed, compaction is the only way left
to free up contiguous memory areas. It is time to just enable
CONFIG_COMPACTION by default.

Signed-off-by: Rik van Riel
Cc: Mel Gorman
Acked-by: Rafael Aquini
Acked-by: Johannes Weiner
Acked-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2012-10-09 15:22:53 +0800
eab1eef99 mm: thp: fix the update_mmu_cache() last argument passing in mm/huge_memory.c ... Browse Code »

The update_mmu_cache() takes a pointer (to pte_t by default) as the last
argument but the huge_memory.c passes a pmd_t value. The patch changes
the argument to the pmd_t * pointer.

Signed-off-by: Catalin Marinas
Signed-off-by: Steve Capper
Signed-off-by: Will Deacon
Cc: Arnd Bergmann
Reviewed-by: Kirill A. Shutemov
Cc: Michal Hocko
Cc: Gerald Schaefer
Reviewed-by: Andrea Arcangeli
Cc: Chris Metcalf
Cc: Ralf Baechle
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Catalin Marinas
2012-10-09 15:22:53 +0800
e3b4126c5 thp: khugepaged_prealloc_page() forgot to reset the page alloc indicator ... Browse Code »

If NUMA is enabled, the indicator is not reset if the previous page
request failed, ausing us to trigger the BUG_ON() in
khugepaged_alloc_page().

Signed-off-by: Xiao Guangrong
Cc: Hugh Dickins
Cc: Andrea Arcangeli
Cc: Michel Lespinasse
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xiao Guangrong
2012-10-09 15:22:52 +0800
74c08f982 memory-hotplug: don't replace lowmem pages with highmem ... Browse Code »

The changelog for commit 6a6dccba2fdc ("mm: cma: don't replace lowmem
pages with highmem") mentioned that lowmem pages can be replaced by
highmem pages during CMA migration. 6a6dccba2fdc fixed that issue.

Quote from that changelog:

: The filesystem layer expects pages in the block device's mapping to not
: be in highmem (the mapping's gfp mask is set in bdget()), but CMA can
: currently replace lowmem pages with highmem pages, leading to crashes in
: filesystem code such as the one below:
:
: Unable to handle kernel NULL pointer dereference at virtual address 00000400
: pgd = c0c98000
: [00000400] *pgd=00c91831, *pte=00000000, *ppte=00000000
: Internal error: Oops: 817 [#1] PREEMPT SMP ARM
: CPU: 0 Not tainted (3.5.0-rc5+ #80)
: PC is at __memzero+0x24/0x80
: ...
: Process fsstress (pid: 323, stack limit = 0xc0cbc2f0)
: Backtrace:
: [] (ext4_getblk+0x0/0x180) from [] (ext4_bread+0x1c/0x98)
: [] (ext4_bread+0x0/0x98) from [] (ext4_mkdir+0x160/0x3bc)
: r4:c15337f0
: [] (ext4_mkdir+0x0/0x3bc) from [] (vfs_mkdir+0x8c/0x98)
: [] (vfs_mkdir+0x0/0x98) from [] (sys_mkdirat+0x74/0xac)
: r6:00000000 r5:c152eb40 r4:000001ff r3:c14b43f0
: [] (sys_mkdirat+0x0/0xac) from [] (sys_mkdir+0x20/0x24)
: r6:beccdcf0 r5:00074000 r4:beccdbbc
: [] (sys_mkdir+0x0/0x24) from [] (ret_fast_syscall+0x0/0x30)

Memory-hotplug has same problem as CMA has so the same fix can be applied
to memory-hotplug as well.

Fix it by reusing.

Signed-off-by: Minchan Kim
Cc: Kamezawa Hiroyuki
Reviewed-by: Yasuaki Ishimatsu
Acked-by: Michal Nazarewicz
Cc: Marek Szyprowski
Cc: Wen Congyang
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2012-10-09 15:22:52 +0800
723a0644a mm/page_alloc: refactor out __alloc_contig_migrate_alloc() ... Browse Code »

__alloc_contig_migrate_alloc() can be used by memory-hotplug so refactor
it out (move + rename as a common name) into page_isolation.c.

[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Minchan Kim
Cc: Kamezawa Hiroyuki
Reviewed-by: Yasuaki Ishimatsu
Acked-by: Michal Nazarewicz
Cc: Marek Szyprowski
Cc: Wen Congyang
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2012-10-09 15:22:52 +0800