Eric Lee / smarc-fsl-linux-kernel

10 Mar, 2012

1 commit

be22aece6 memcg: revert fix to mapcount check for this release ... Browse Code »

Respectfully revert commit e6ca7b89dc76 "memcg: fix mapcount check
in move charge code for anonymous page" for the 3.3 release, so that
it behaves exactly like releases 2.6.35 through 3.2 in this respect.

Horiguchi-san's commit is correct in itself, 1 makes much more sense
than 2 in that check; but it does not go far enough - swapcount
should be considered too - if we really want such a check at all.

We appear to have reached agreement now, and expect that 3.4 will
remove the mapcount check, but had better not make 3.3 different.

Signed-off-by: Hugh Dickins
Reviewed-by: Naoya Horiguchi
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-03-10 07:32:20 +0800

07 Mar, 2012

4 commits

097d59106 vm: avoid using find_vma_prev() unnecessarily ... Browse Code »

Several users of "find_vma_prev()" were not in fact interested in the
previous vma if there was no primary vma to be found either. And in
those cases, we're much better off just using the regular "find_vma()",
and then "prev" can be looked up by just checking vma->vm_prev.

The find_vma_prev() semantics are fairly subtle (see Mikulas' recent
commit 83cd904d271b: "mm: fix find_vma_prev"), and the whole "return
prev by reference" means that it generates worse code too.

Thus this "let's avoid using this inconvenient and clearly too subtle
interface when we don't really have to" patch.

Cc: Mikulas Patocka
Cc: KOSAKI Motohiro
Signed-off-by: Linus Torvalds

Linus Torvalds
2012-03-07 10:23:36 +0800
83cd904d2 mm: fix find_vma_prev ... Browse Code »
43

Commit 6bd4837de96e ("mm: simplify find_vma_prev()") broke memory
management on PA-RISC.

After application of the patch, programs that allocate big arrays on the
stack crash with segfault, for example, this will crash if compiled
without optimization:

int main()
{
char array[200000];
array[199999] = 0;
return 0;
}

The reason is that PA-RISC has up-growing stack and the stack is usually
the last memory area. In the above example, a page fault happens above
the stack.

Previously, if we passed too high address to find_vma_prev, it returned
NULL and stored the last VMA in *pprev. After "simplify find_vma_prev"
change, it stores NULL in *pprev. Consequently, the stack area is not
found and it is not expanded, as it used to be before the change.

This patch restores the old behavior and makes it return the last VMA in
*pprev if the requested address is higher than address of any other VMA.

Signed-off-by: Mikulas Patocka
Acked-by: KOSAKI Motohiro
Signed-off-by: Linus Torvalds

Mikulas Patocka
2012-03-07 08:48:03 +0800
ce8fea7aa mmap: EINVAL not ENOMEM when rejecting VM_GROWS ... Browse Code »

Currently error is -ENOMEM when rejecting VM_GROWSDOWN|VM_GROWSUP
from shared anonymous: hoist the file case's -EINVAL up for both.

Signed-off-by: Hugh Dickins
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-03-07 05:49:08 +0800
c09ff089a page_cgroup: fix horrid swap accounting regression ... Browse Code »

Why is memcg's swap accounting so broken? Insane counts, wrong
ownership, unfreeable structures, which later get freed and then
accessed after free.

Turns out to be a tiny a little 3.3-rc1 regression in 9fb4b7cc0724
"page_cgroup: add helper function to get swap_cgroup": the helper
function (actually named lookup_swap_cgroup()) returns an address using
void* arithmetic, but the structure in question is a short.

Signed-off-by: Hugh Dickins
Reviewed-by: Bob Liu
Cc: Michal Hocko
Cc: KAMEZAWA Hiroyuki
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-03-07 00:18:23 +0800

06 Mar, 2012

8 commits

3e85fb9cd Merge branch 'akpm' (Andrew's patch bomb) ... Browse Code »

Merge the emailed seties of 19 patches from Andrew Morton

* akpm:
rapidio/tsi721: fix queue wrapping bug in inbound doorbell handler
memcg: fix mapcount check in move charge code for anonymous page
mm: thp: fix BUG on mm->nr_ptes
alpha: fix 32/64-bit bug in futex support
memcg: fix GPF when cgroup removal races with last exit
debugobjects: Fix selftest for static warnings
floppy/scsi: fix setting of BIO flags
memcg: fix deadlock by inverting lrucare nesting
drivers/rtc/rtc-r9701.c: fix crash in r9701_remove()
c2port: class_create() returns an ERR_PTR
pps: class_create() returns an ERR_PTR, not NULL
hung_task: fix the broken rcu_lock_break() logic
vfork: kill PF_STARTING
coredump_wait: don't call complete_vfork_done()
vfork: make it killable
vfork: introduce complete_vfork_done()
aio: wake up waiters when freeing unused kiocbs
kprobes: return proper error code from register_kprobe()
kmsg_dump: don't run on non-error paths by default

Linus Torvalds
2012-03-06 07:50:25 +0800
e6ca7b89d memcg: fix mapcount check in move charge code for anonymous page ... Browse Code »
43

Currently the charge on shared anonyous pages is supposed not to moved in
task migration. To implement this, we need to check that mapcount > 1,
instread of > 2. So this patch fixes it.

Signed-off-by: Naoya Horiguchi
Reviewed-by: Daisuke Nishimura
Cc: Andrea Arcangeli
Cc: KAMEZAWA Hiroyuki
Cc: Hillf Danton
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2012-03-06 07:49:43 +0800
1c641e847 mm: thp: fix BUG on mm->nr_ptes ... Browse Code »
1

Dave Jones reports a few Fedora users hitting the BUG_ON(mm->nr_ptes...)
in exit_mmap() recently.

Quoting Hugh's discovery and explanation of the SMP race condition:

"mm->nr_ptes had unusual locking: down_read mmap_sem plus
page_table_lock when incrementing, down_write mmap_sem (or mm_users
0) when decrementing; whereas THP is careful to increment and
decrement it under page_table_lock.

Now most of those paths in THP also hold mmap_sem for read or write
(with appropriate checks on mm_users), but two do not: when
split_huge_page() is called by hwpoison_user_mappings(), and when
called by add_to_swap().

It's conceivable that the latter case is responsible for the
exit_mmap() BUG_ON mm->nr_ptes that has been reported on Fedora."

The simplest way to fix it without having to alter the locking is to make
split_huge_page() a noop in nr_ptes terms, so by counting the preallocated
pagetables that exists for every mapped hugepage. It was an arbitrary
choice not to count them and either way is not wrong or right, because
they are not used but they're still allocated.

Reported-by: Dave Jones
Reported-by: Hugh Dickins
Signed-off-by: Andrea Arcangeli
Acked-by: Hugh Dickins
Cc: David Rientjes
Cc: Josh Boyer
Cc: [3.0.x, 3.1.x, 3.2.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2012-03-06 07:49:43 +0800
7512102cf memcg: fix GPF when cgroup removal races with last exit ... Browse Code »

When moving tasks from old memcg (with move_charge_at_immigrate on new
memcg), followed by removal of old memcg, hit General Protection Fault in
mem_cgroup_lru_del_list() (called from release_pages called from
free_pages_and_swap_cache from tlb_flush_mmu from tlb_finish_mmu from
exit_mmap from mmput from exit_mm from do_exit).

Somewhat reproducible, takes a few hours: the old struct mem_cgroup has
been freed and poisoned by SLAB_DEBUG, but mem_cgroup_lru_del_list() is
still trying to update its stats, and take page off lru before freeing.

A task, or a charge, or a page on lru: each secures a memcg against
removal. In this case, the last task has been moved out of the old memcg,
and it is exiting: anonymous pages are uncharged one by one from the
memcg, as they are zapped from its pagetables, so the charge gets down to
0; but the pages themselves are queued in an mmu_gather for freeing.

Most of those pages will be on lru (and force_empty is careful to
lru_add_drain_all, to add pages from pagevec to lru first), but not
necessarily all: perhaps some have been isolated for page reclaim, perhaps
some isolated for other reasons. So, force_empty may find no task, no
charge and no page on lru, and let the removal proceed.

There would still be no problem if these pages were immediately freed; but
typically (and the put_page_testzero protocol demands it) they have to be
added back to lru before they are found freeable, then removed from lru
and freed. We don't see the issue when adding, because the
mem_cgroup_iter() loops keep their own reference to the memcg being
scanned; but when it comes to mem_cgroup_lru_del_list().

I believe this was not an issue in v3.2: there, PageCgroupAcctLRU and
PageCgroupUsed flags were used (like a trick with mirrors) to deflect view
of pc->mem_cgroup to the stable root_mem_cgroup when neither set.
38c5d72f3ebe ("memcg: simplify LRU handling by new rule") mercifully
removed those convolutions, but left this General Protection Fault.

But it's surprisingly easy to restore the old behaviour: just check
PageCgroupUsed in mem_cgroup_lru_add_list() (which decides on which lruvec
to add), and reset pc to root_mem_cgroup if page is uncharged. A risky
change? just going back to how it worked before; testing, and an audit of
uses of pc->mem_cgroup, show no problem.

And there's a nice bonus: with mem_cgroup_lru_add_list() itself making
sure that an uncharged page goes to root lru, mem_cgroup_reset_owner() no
longer has any purpose, and we can safely revert 4e5f01c2b9b9 ("memcg:
clear pc->mem_cgroup if necessary").

Calling update_page_reclaim_stat() after add_page_to_lru_list() in swap.c
is not strictly necessary: the lru_lock there, with RCU before memcg
structures are freed, makes mem_cgroup_get_reclaim_stat_from_page safe
without that; but it seems cleaner to rely on one dependency less.

Signed-off-by: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Cc: Johannes Weiner
Cc: Konstantin Khlebnikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-03-06 07:49:43 +0800
9ce70c024 memcg: fix deadlock by inverting lrucare nesting ... Browse Code »
43

We have forgotten the rules of lock nesting: the irq-safe ones must be
taken inside the non-irq-safe ones, otherwise we are open to deadlock:

CPU0 CPU1
---- ----
lock(&(&pc->lock)->rlock);
local_irq_disable();
lock(&(&zone->lru_lock)->rlock);
lock(&(&pc->lock)->rlock);

lock(&(&zone->lru_lock)->rlock);

To check a different locking issue, I happened to add a spin_lock to
memcg's bit_spin_lock in lock_page_cgroup(), and lockdep very quickly
complained about __mem_cgroup_commit_charge_lrucare() (on CPU1 above).

So delete __mem_cgroup_commit_charge_lrucare(), passing a bool lrucare to
__mem_cgroup_commit_charge() instead, taking zone->lru_lock under
lock_page_cgroup() in the lrucare case.

The original was using spin_lock_irqsave, but we'd be in more trouble if
it were ever called at interrupt time: unconditional _irq is enough. And
ClearPageLRU before del from lru, SetPageLRU before add to lru: no strong
reason, but that is the ordering used consistently elsewhere.

Fixes 36b62ad539498d00c2d280a151a ("memcg: simplify corner case handling
of LRU").

Signed-off-by: Hugh Dickins
Acked-by: Johannes Weiner
Cc: Konstantin Khlebnikov
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-03-06 07:49:43 +0800
789ce9b9c Merge branch 'for-3.3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu ... Browse Code »

Pull per-cpu patches from Tejun Heo:
"This pull request contains four patches. One replaces manual clearing
with bitmap_clear(), two fix generic definition of __this_cpu ops so
that they don't choose unnecessarily strict arch version. One makes
_this_cpu definition use raw_local_irq_*() so that it doesn't end up
wrecking irq on/off state tracking when used from inside lockdep.

Of the four patches, the raw_local_irq_*() update is the most
important, so please feel free to cherry pick only that one patch and
ignore the rest if you want to - commit e920d5971d 'percpu: use
raw_local_irq_* in _this_cpu op'."

* 'for-3.3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: fix __this_cpu_{sub,inc,dec}_return() definition
percpu: use raw_local_irq_* in _this_cpu op
percpu: fix generic definition of __this_cpu_add_and_return()
percpu: use bitmap_clear

Linus Torvalds
2012-03-06 06:28:36 +0800
cd2934a3b flush_tlb_range() needs ->page_table_lock when ->mmap_sem is not held ... Browse Code »
43

All other callers already hold either ->mmap_sem (exclusive) or
->page_table_lock. And we need it because some page table flushing
instanced do work explicitly with ge tables.

See e.g. arch/powerpc/mm/tlb_hash32.c, flush_tlb_range() and
flush_range() in there. The same goes for uml, with a lot more
extensive playing with page tables.

Almost all callers are actually fine - flush_tlb_range() may have no
need to bother playing with page tables, but it can do so safely; again,
this caller is the sole exception - everything else either has exclusive
->mmap_sem on the mm in question, or mm->page_table_lock is held.

Signed-off-by: Al Viro
Signed-off-by: Linus Torvalds

Al Viro
2012-03-06 05:51:32 +0800
835ee7978 VM_GROWS{UP,DOWN} shouldn't be set on shmem VMAs ... Browse Code »

Signed-off-by: Al Viro
Signed-off-by: Linus Torvalds

Al Viro
2012-03-06 05:51:32 +0800

01 Mar, 2012

1 commit

847854f59 memblock: Fix size aligning of memblock_alloc_base_nid() ... Browse Code »

memblock allocator aligns @size to @align to reduce the amount
of fragmentation. Commit:

7bd0b0f0da ("memblock: Reimplement memblock allocation using reverse free area iterator")

Broke it by incorrectly relocating @size aligning to
memblock_find_in_range_node(). As the aligned size is not
propagated back to memblock_alloc_base_nid(), the actually
reserved size isn't aligned.

While this increases memory use for memblock reserved array,
this shouldn't cause any critical failure; however, it seems
that the size aligning was hiding a use-beyond-allocation bug in
sparc64 and losing the aligning causes boot failure.

The underlying problem is currently being debugged but this is a
proper fix in itself, it's already pretty late in -rc cycle for
boot failures and reverting the change for debugging isn't
difficult. Restore the size aligning moving it to
memblock_alloc_base_nid().

Reported-by: Meelis Roos
Signed-off-by: Tejun Heo
Cc: David S. Miller
Cc: Grant Likely
Cc: Rob Herring
Cc: Linus Torvalds
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/20120228205621.GC3252@dhcp-172-17-108-109.mtv.corp.google.com
Signed-off-by: Ingo Molnar
LKML-Reference:

Tejun Heo
2012-03-01 17:53:18 +0800

25 Feb, 2012

3 commits

b94cfaf66 NOMMU: Don't need to clear vm_mm when deleting a VMA ... Browse Code »
1

Don't clear vm_mm in a deleted VMA as it's unnecessary and might
conceivably break the filesystem or driver VMA close routine.

Reported-by: Al Viro
Signed-off-by: David Howells
Acked-by: Al Viro
cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

David Howells
2012-02-25 00:59:04 +0800
918e556ec NOMMU: Lock i_mmap_mutex for access to the VMA prio list ... Browse Code »
1

Lock i_mmap_mutex for access to the VMA prio list to prevent concurrent
access. Currently, certain parts of the mmap handling are protected by
the region mutex, but not all.

Reported-by: Al Viro
Signed-off-by: David Howells
Acked-by: Al Viro
cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

David Howells
2012-02-25 00:59:04 +0800
371528cae mm: memcg: Correct unregistring of events attached to the same eventfd ... Browse Code »
1

There is an issue when memcg unregisters events that were attached to
the same eventfd:

- On the first call mem_cgroup_usage_unregister_event() removes all
events attached to a given eventfd, and if there were no events left,
thresholds->primary would become NULL;

- Since there were several events registered, cgroups core will call
mem_cgroup_usage_unregister_event() again, but now kernel will oops,
as the function doesn't expect that threshold->primary may be NULL.

That's a good question whether mem_cgroup_usage_unregister_event()
should actually remove all events in one go, but nowadays it can't
do any better as cftype->unregister_event callback doesn't pass
any private event-associated cookie. So, let's fix the issue by
simply checking for threshold->primary.

FWIW, w/o the patch the following oops may be observed:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
IP: [] mem_cgroup_usage_unregister_event+0x9c/0x1f0
Pid: 574, comm: kworker/0:2 Not tainted 3.3.0-rc4+ #9 Bochs Bochs
RIP: 0010:[] [] mem_cgroup_usage_unregister_event+0x9c/0x1f0
RSP: 0018:ffff88001d0b9d60 EFLAGS: 00010246
Process kworker/0:2 (pid: 574, threadinfo ffff88001d0b8000, task ffff88001de91cc0)
Call Trace:
[] cgroup_event_remove+0x2b/0x60
[] process_one_work+0x174/0x450
[] worker_thread+0x123/0x2d0

Cc: stable
Signed-off-by: Anton Vorontsov
Acked-by: KAMEZAWA Hiroyuki
Cc: Kirill A. Shutemov
Cc: Michal Hocko
Signed-off-by: Linus Torvalds

Anton Vorontsov
2012-02-25 00:55:51 +0800

14 Feb, 2012

1 commit

074b85175 vfs: fix panic in __d_lookup() with high dentry hashtable counts ... Browse Code »

When the number of dentry cache hash table entries gets too high
(2147483648 entries), as happens by default on a 16TB system, use of a
signed integer in the dcache_init() initialization loop prevents the
dentry_hashtable from getting initialized, causing a panic in
__d_lookup(). Fix this in dcache_init() and similar areas.

Signed-off-by: Dimitri Sivanich
Acked-by: David S. Miller
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Al Viro

Dimitri Sivanich
2012-02-14 09:45:38 +0800

11 Feb, 2012

1 commit

af5feae3d Merge tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux ... Browse Code »

fix 1 mysterious divide error
fix 3 NULL dereference bugs in writeback tracing, on SD card removal w/o umount

* tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: fix dereferencing NULL bdi->dev on trace_writeback_queue
lib: proportion: lower PROP_MAX_SHIFT to 32 on 64-bit kernel
writeback: fix NULL bdi->dev in trace writeback_single_inode
backing-dev: fix wakeup timer races with bdi_unregister()

Linus Torvalds
2012-02-11 01:05:52 +0800

09 Feb, 2012

2 commits

b9980cdcf mm: fix UP THP spin_is_locked BUGs ... Browse Code »
1

Fix CONFIG_TRANSPARENT_HUGEPAGE=y CONFIG_SMP=n CONFIG_DEBUG_VM=y
CONFIG_DEBUG_SPINLOCK=n kernel: spin_is_locked() is then always false,
and so triggers some BUGs in Transparent HugePage codepaths.

asm-generic/bug.h mentions this problem, and provides a WARN_ON_SMP(x);
but being too lazy to add VM_BUG_ON_SMP, BUG_ON_SMP, WARN_ON_SMP_ONCE,
VM_WARN_ON_SMP_ONCE, just test NR_CPUS != 1 in the existing VM_BUG_ONs.

Signed-off-by: Hugh Dickins
Cc: Andrea Arcangeli
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-02-09 11:03:51 +0800
dc9086004 mm: compaction: check for overlapping nodes during isolation for migration ... Browse Code »
1

When isolating pages for migration, migration starts at the start of a
zone while the free scanner starts at the end of the zone. Migration
avoids entering a new zone by never going beyond the free scanned.

Unfortunately, in very rare cases nodes can overlap. When this happens,
migration isolates pages without the LRU lock held, corrupting lists
which will trigger errors in reclaim or during page free such as in the
following oops

BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [] free_pcppages_bulk+0xcc/0x450
PGD 1dda554067 PUD 1e1cb58067 PMD 0
Oops: 0000 [#1] SMP
CPU 37
Pid: 17088, comm: memcg_process_s Tainted: G X
RIP: free_pcppages_bulk+0xcc/0x450
Process memcg_process_s (pid: 17088, threadinfo ffff881c2926e000, task ffff881c2926c0c0)
Call Trace:
free_hot_cold_page+0x17e/0x1f0
__pagevec_free+0x90/0xb0
release_pages+0x22a/0x260
pagevec_lru_move_fn+0xf3/0x110
putback_lru_page+0x66/0xe0
unmap_and_move+0x156/0x180
migrate_pages+0x9e/0x1b0
compact_zone+0x1f3/0x2f0
compact_zone_order+0xa2/0xe0
try_to_compact_pages+0xdf/0x110
__alloc_pages_direct_compact+0xee/0x1c0
__alloc_pages_slowpath+0x370/0x830
__alloc_pages_nodemask+0x1b1/0x1c0
alloc_pages_vma+0x9b/0x160
do_huge_pmd_anonymous_page+0x160/0x270
do_page_fault+0x207/0x4c0
page_fault+0x25/0x30

The "X" in the taint flag means that external modules were loaded but but
is unrelated to the bug triggering. The real problem was because the PFN
layout looks like this

Zone PFN ranges:
DMA 0x00000010 -> 0x00001000
DMA32 0x00001000 -> 0x00100000
Normal 0x00100000 -> 0x01e80000
Movable zone start PFN for each node
early_node_map[14] active PFN ranges
0: 0x00000010 -> 0x0000009b
0: 0x00000100 -> 0x0007a1ec
0: 0x0007a354 -> 0x0007a379
0: 0x0007f7ff -> 0x0007f800
0: 0x00100000 -> 0x00680000
1: 0x00680000 -> 0x00e80000
0: 0x00e80000 -> 0x01080000
1: 0x01080000 -> 0x01280000
0: 0x01280000 -> 0x01480000
1: 0x01480000 -> 0x01680000
0: 0x01680000 -> 0x01880000
1: 0x01880000 -> 0x01a80000
0: 0x01a80000 -> 0x01c80000
1: 0x01c80000 -> 0x01e80000

The fix is straight-forward. isolate_migratepages() has to make a
similar check to isolate_freepage to ensure that it never isolates pages
from a zone it does not hold the LRU lock for.

This was discovered in a 3.0-based kernel but it affects 3.1.x, 3.2.x
and current mainline.

Signed-off-by: Mel Gorman
Acked-by: Michal Nazarewicz
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-02-09 11:03:51 +0800

05 Feb, 2012

1 commit

82bdc843c Merge branch 'akpm' ... Browse Code »

* akpm:
mm: compaction: check pfn_valid when entering a new MAX_ORDER_NR_PAGES block during isolation for migration
readahead: fix pipeline break caused by block plug
kprobes: fix a memory leak in function pre_handler_kretprobe()
drivers/tty/vt/vt_ioctl.c: fix KDFONTOP 32bit compatibility layer
lkdtm: avoid calling lkdtm_do_action() with spinlock held
mm/filemap_xip.c: fix race condition in xip_file_fault()
mm/memcontrol.c: fix warning with CONFIG_NUMA=n
avr32: select generic atomic64_t support
mm: postpone migrated page mapping reset
xtensa: fix memscan()
MAINTAINERS: update lguest F: patterns
MAINTAINERS: remove staging sections
MAINTAINERS: remove iMX5 section
MAINTAINERS: update partitions block F: patterns

Linus Torvalds
2012-02-05 02:51:54 +0800

04 Feb, 2012

6 commits

0bf380bc7 mm: compaction: check pfn_valid when entering a new MAX_ORDER_NR_PAGES block dur… ... Browse Code »
1

…ing isolation for migration

When isolating for migration, migration starts at the start of a zone
which is not necessarily pageblock aligned. Further, it stops isolating
when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally
not aligned. This allows isolate_migratepages() to call pfn_to_page() on
an invalid PFN which can result in a crash. This was originally reported
against a 3.0-based kernel with the following trace in a crash dump.

PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s"
#0 [d72d3ad0] crash_kexec at c028cfdb
#1 [d72d3b24] oops_end at c05c5322
#2 [d72d3b38] __bad_area_nosemaphore at c0227e60
#3 [d72d3bec] bad_area at c0227fb6
#4 [d72d3c00] do_page_fault at c05c72ec
#5 [d72d3c80] error_code (via page_fault) at c05c47a4
EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000
DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50
CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002
#6 [d72d3cb4] isolate_migratepages at c030b15a
#7 [d72d3d14] zone_watermark_ok at c02d26cb
#8 [d72d3d2c] compact_zone at c030b8de
#9 [d72d3d68] compact_zone_order at c030bba1
#10 [d72d3db4] try_to_compact_pages at c030bc84
#11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7
#12 [d72d3e08] __alloc_pages_slowpath at c02d66c7
#13 [d72d3e78] __alloc_pages_nodemask at c02d6a97
#14 [d72d3eb8] alloc_pages_vma at c030a845
#15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb
#16 [d72d3f00] handle_mm_fault at c02f36c6
#17 [d72d3f30] do_page_fault at c05c70ed
#18 [d72d3fb0] error_code (via page_fault) at c05c47a4
EAX: b71ff000 EBX: 00000001 ECX: 00001600 EDX: 00000431
DS: 007b ESI: 08048950 ES: 007b EDI: bfaa3788
SS: 007b ESP: bfaa36e0 EBP: bfaa3828 GS: 6f50
CS: 0073 EIP: 080487c8 ERR: ffffffff EFLAGS: 00010202

It was also reported by Herbert van den Bergh against 3.1-based kernel
with the following snippet from the console log.

BUG: unable to handle kernel paging request at 01c00008
IP: [<c0522399>] isolate_migratepages+0x119/0x390
*pdpt = 000000002f7ce001 *pde = 0000000000000000

It is expected that it also affects 3.2.x and current mainline.

The problem is that pfn_valid is only called on the first PFN being
checked and that PFN is not necessarily aligned. Lets say we have a case
like this

H = MAX_ORDER_NR_PAGES boundary
| = pageblock boundary
m = cc->migrate_pfn
f = cc->free_pfn
o = memory hole

H------|------H------|----m-Hoooooo|ooooooH-f----|------H

The migrate_pfn is just below a memory hole and the free scanner is beyond
the hole. When isolate_migratepages started, it scans from migrate_pfn to
migrate_pfn+pageblock_nr_pages which is now in a memory hole. It checks
pfn_valid() on the first PFN but then scans into the hole where there are
not necessarily valid struct pages.

This patch ensures that isolate_migratepages calls pfn_valid when
necessary.

Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com>
Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Mel Gorman
2012-02-04 08:16:41 +0800
3deaa7190 readahead: fix pipeline break caused by block plug ... Browse Code »
1

Herbert Poetzl reported a performance regression since 2.6.39. The test
is a simple dd read, but with big block size. The reason is:

T1: ra (A, A+128k), (A+128k, A+256k)
T2: lock_page for page A, submit the 256k
T3: hit page A+128K, ra (A+256k, A+384). the range isn't submitted
because of plug and there isn't any lock_page till we hit page A+256k
because all pages from A to A+256k is in memory
T4: hit page A+256k, ra (A+384, A+ 512). Because of plug, the range isn't
submitted again.
T5: lock_page A+256k, so (A+256k, A+512k) will be submitted. The task is
waitting for (A+256k, A+512k) finish.

There is no request to disk in T3 and T4, so readahead pipeline breaks.

We really don't need block plug for generic_file_aio_read() for buffered
I/O. The readahead already has plug and has fine grained control when I/O
should be submitted. Deleting plug for buffered I/O fixes the regression.

One side effect is plug makes the request size 256k, the size is 128k
without it. This is because default ra size is 128k and not a reason we
need plug here.

Vivek said:

: We submit some readahead IO to device request queue but because of nested
: plug, queue never gets unplugged. When read logic reaches a page which is
: not in page cache, it waits for page to be read from the disk
: (lock_page_killable()) and that time we flush the plug list.
:
: So effectively read ahead logic is kind of broken in parts because of
: nested plugging. Removing top level plug (generic_file_aio_read()) for
: buffered reads, will allow unplugging queue earlier for readahead.

Signed-off-by: Shaohua Li
Signed-off-by: Wu Fengguang
Reported-by: Herbert Poetzl
Tested-by: Eric Dumazet
Cc: Christoph Hellwig
Cc: Jens Axboe
Cc: Vivek Goyal
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2012-02-04 08:16:41 +0800
99f02ef1f mm/filemap_xip.c: fix race condition in xip_file_fault() ... Browse Code »
1

Fix a race condition that shows in conjunction with xip_file_fault() when
two threads of the same user process fault on the same memory page.

In this case, the race winner will install the page table entry and the
unlucky loser will cause an oops: xip_file_fault calls vm_insert_pfn (via
vm_insert_mixed) which drops out at this check:

retval = -EBUSY;
if (!pte_none(*pte))
goto out_unlock;

The resulting -EBUSY return value will trigger a BUG_ON() in
xip_file_fault.

This fix simply considers the fault as fixed in this case, because the
race winner has successfully installed the pte.

[akpm@linux-foundation.org: use conventional (and consistent) comment layout]
Reported-by: David Sadler
Signed-off-by: Carsten Otte
Reported-by: Louis Alex Eisner
Cc: Hugh Dickins
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Carsten Otte
2012-02-04 08:16:41 +0800
82b3f2a71 mm/memcontrol.c: fix warning with CONFIG_NUMA=n ... Browse Code »

mm/memcontrol.c: In function 'memcg_check_events':
mm/memcontrol.c:779: warning: unused variable 'do_numainfo'

Acked-by: Michal Hocko
Cc: Li Zefan
Cc: Hiroyuki KAMEZAWA
Cc: Johannes Weiner
Acked-by: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2012-02-04 08:16:40 +0800
35512ecae mm: postpone migrated page mapping reset ... Browse Code »

Postpone resetting page->mapping until the final remove_migration_ptes().
Otherwise the expression PageAnon(migration_entry_to_page(entry)) does not
work.

Signed-off-by: Konstantin Khlebnikov
Cc: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-02-04 08:16:40 +0800
7c7ed8ec3 Merge tag 'kmemleak-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux ... Browse Code »

Trivial kmemleak bug-fixes:

- Early logging doesn't stop when kmemleak is off by default.
- Zero-size scanning areas should be ignored (currently it prints a
warning).

* tag 'kmemleak-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux:
kmemleak: Disable early logging when kmemleak is off by default
kmemleak: Only scan non-zero-size areas

Linus Torvalds
2012-02-04 04:41:31 +0800

03 Feb, 2012

1 commit

8cdb878dc Fix race in process_vm_rw_core ... Browse Code »

This fixes the race in process_vm_core found by Oleg (see

http://article.gmane.org/gmane.linux.kernel/1235667/

for details).

This has been updated since I last sent it as the creation of the new
mm_access() function did almost exactly the same thing as parts of the
previous version of this patch did.

In order to use mm_access() even when /proc isn't enabled, we move it to
kernel/fork.c where other related process mm access functions already
are.

Signed-off-by: Chris Yeoh
Signed-off-by: Linus Torvalds

Christopher Yeoh
2012-02-03 04:55:17 +0800

01 Feb, 2012

1 commit

2673b4cf5 backing-dev: fix wakeup timer races with bdi_unregister() ... Browse Code »

While 7a401a972df8e18 ("backing-dev: ensure wakeup_timer is deleted")
addressed the problem of the bdi being freed with a queued wakeup
timer, there are other races that could happen if the wakeup timer
expires after/during bdi_unregister(), before bdi_destroy() is called.

wakeup_timer_fn() could attempt to wakeup a task which has already has
been freed, or could access a NULL bdi->dev via the wake_forker_thread
tracepoint.

Cc:
Cc: Jens Axboe
Reported-by: Chanho Min
Reviewed-by: Namjae Jeon
Signed-off-by: Rabin Vincent
Signed-off-by: Wu Fengguang

Rabin Vincent
2012-02-01 16:52:49 +0800

27 Jan, 2012

1 commit

2437dcbf5 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rcu: Add missing __cpuinit annotation in rcutorture code
sched: Add "const" to is_idle_task() parameter
rcu: Make rcutorture bool parameters really bool (core code)
memblock: Fix alloc failure due to dumb underflow protection in memblock_find_in_range_node()

Linus Torvalds
2012-01-27 04:45:41 +0800

25 Jan, 2012

1 commit

701b259f4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Davem says:

1) Fix JIT code generation on x86-64 for divide by zero, from Eric Dumazet.

2) tg3 header length computation correction from Eric Dumazet.

3) More build and reference counting fixes for socket memory cgroup
code from Glauber Costa.

4) module.h snuck back into a core header after all the hard work we
did to remove that, from Paul Gortmaker and Jesper Dangaard Brouer.

5) Fix PHY naming regression and add some new PCI IDs in stmmac, from
Alessandro Rubini.

6) Netlink message generation fix in new team driver, should only advertise
the entries that changed during events, from Jiri Pirko.

7) SRIOV VF registration and unregistration fixes, and also add a
missing PCI ID, from Roopa Prabhu.

8) Fix infinite loop in tx queue flush code of brcmsmac, from Stanislaw Gruszka.

9) ftgmac100/ftmac100 build fix, missing interrupt.h include.

10) Memory leak fix in net/hyperv do_set_mutlicast() handling, from Wei Yongjun.

11) Off by one fix in netem packet scheduler, from Vijay Subramanian.

12) TCP loss detection fix from Yuchung Cheng.

13) TCP reset packet MD5 calculation uses wrong address, fix from Shawn Lu.

14) skge carrier assertion and DMA mapping fixes from Stephen Hemminger.

15) Congestion recovery undo performed at the wrong spot in BIC and CUBIC
congestion control modules, fix from Neal Cardwell.

16) Ethtool ETHTOOL_GSSET_INFO is unnecessarily restrictive, from Michał Mirosław.

17) Fix triggerable race in ipv6 sysctl handling, from Francesco Ruggeri.

18) Statistics bug fixes in mlx4 from Eugenia Emantayev.

19) rds locking bug fix during info dumps, from your's truly.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (67 commits)
rds: Make rds_sock_lock BH rather than IRQ safe.
netprio_cgroup.h: dont include module.h from other includes
net: flow_dissector.c missing include linux/export.h
team: send only changed options/ports via netlink
net/hyperv: fix possible memory leak in do_set_multicast()
drivers/net: dsa/mv88e6xxx.c files need linux/module.h
stmmac: added PCI identifiers
llc: Fix race condition in llc_ui_recvmsg
stmmac: fix phy naming inconsistency
dsa: Add reporting of silicon revision for Marvell 88E6123/88E6161/88E6165 switches.
tg3: fix ipv6 header length computation
skge: add byte queue limit support
mv643xx_eth: Add Rx Discard and Rx Overrun statistics
bnx2x: fix compilation error with SOE in fw_dump
bnx2x: handle CHIP_REVISION during init_one
bnx2x: allow user to change ring size in ISCSI SD mode
bnx2x: fix Big-Endianess in ethtool -t
bnx2x: fixed ethtool statistics for MF modes
bnx2x: credit-leakage fixup on vlan_mac_del_all
macvlan: fix a possible use after free
...

Linus Torvalds
2012-01-25 07:51:40 +0800

24 Jan, 2012

7 commits

9f9f1acd7 mm: fix rss count leakage during migration ... Browse Code »

Memory migration fills a pte with a migration entry and it doesn't
update the rss counters. Then it replaces the migration entry with the
new page (or the old one if migration failed). But between these two
passes this pte can be unmaped, or a task can fork a child and it will
get a copy of this migration entry. Nobody accounts for this in the rss
counters.

This patch properly adjust rss counters for migration entries in
zap_pte_range() and copy_one_pte(). Thus we avoid extra atomic
operations on the migration fast-path.

Signed-off-by: Konstantin Khlebnikov
Cc: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-01-24 00:38:49 +0800
245132643 SHM_UNLOCK: fix Unevictable pages stranded after swap ... Browse Code »

Commit cc39c6a9bbde ("mm: account skipped entries to avoid looping in
find_get_pages") correctly fixed an infinite loop; but left a problem
that find_get_pages() on shmem would return 0 (appearing to callers to
mean end of tree) when it meets a run of nr_pages swap entries.

The only uses of find_get_pages() on shmem are via pagevec_lookup(),
called from invalidate_mapping_pages(), and from shmctl SHM_UNLOCK's
scan_mapping_unevictable_pages(). The first is already commented, and
not worth worrying about; but the second can leave pages on the
Unevictable list after an unusual sequence of swapping and locking.

Fix that by using shmem_find_get_pages_and_swap() (then ignoring the
swap) instead of pagevec_lookup().

But I don't want to contaminate vmscan.c with shmem internals, nor
shmem.c with LRU locking. So move scan_mapping_unevictable_pages() into
shmem.c, renaming it shmem_unlock_mapping(); and rename
check_move_unevictable_page() to check_move_unevictable_pages(), looping
down an array of pages, oftentimes under the same lock.

Leave out the "rotate unevictable list" block: that's a leftover from
when this was used for /proc/sys/vm/scan_unevictable_pages, whose flawed
handling involved looking at pages at tail of LRU.

Was there significance to the sequence first ClearPageUnevictable, then
test page_evictable, then SetPageUnevictable here? I think not, we're
under LRU lock, and have no barriers between those.

Signed-off-by: Hugh Dickins
Reviewed-by: KOSAKI Motohiro
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Shaohua Li
Cc: Eric Dumazet
Cc: Johannes Weiner
Cc: Michel Lespinasse
Cc: [back to 3.1 but will need respins]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-01-24 00:38:48 +0800
85046579b SHM_UNLOCK: fix long unpreemptible section ... Browse Code »

scan_mapping_unevictable_pages() is used to make SysV SHM_LOCKed pages
evictable again once the shared memory is unlocked. It does this with
pagevec_lookup()s across the whole object (which might occupy most of
memory), and takes 300ms to unlock 7GB here. A cond_resched() every
PAGEVEC_SIZE pages would be good.

However, KOSAKI-san points out that this is called under shmem.c's
info->lock, and it's also under shm.c's shm_lock(), both spinlocks.
There is no strong reason for that: we need to take these pages off the
unevictable list soonish, but those locks are not required for it.

So move the call to scan_mapping_unevictable_pages() from shmem.c's
unlock handling up to shm.c's unlock handling. Remove the recently
added barrier, not needed now we have spin_unlock() before the scan.

Use get_file(), with subsequent fput(), to make sure we have a reference
to mapping throughout scan_mapping_unevictable_pages(): that's something
that was previously guaranteed by the shm_lock().

Remove shmctl's lru_add_drain_all(): we don't fault in pages at SHM_LOCK
time, and we lazily discover them to be Unevictable later, so it serves
no purpose for SHM_LOCK; and serves no purpose for SHM_UNLOCK, since
pages still on pagevec are not marked Unevictable.

The original code avoided redundant rescans by checking VM_LOCKED flag
at its level: now avoid them by checking shp's SHM_LOCKED.

The original code called scan_mapping_unevictable_pages() on a locked
area at shm_destroy() time: perhaps we once had accounting cross-checks
which required that, but not now, so skip the overhead and just let
inode eviction deal with them.

Put check_move_unevictable_page() and scan_mapping_unevictable_pages()
under CONFIG_SHMEM (with stub for the TINY case when ramfs is used),
more as comment than to save space; comment them used for SHM_UNLOCK.

Signed-off-by: Hugh Dickins
Reviewed-by: KOSAKI Motohiro
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Shaohua Li
Cc: Eric Dumazet
Cc: Johannes Weiner
Cc: Michel Lespinasse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-01-24 00:38:48 +0800
409eb8c26 mm/hugetlb.c: undo change to page mapcount in fault handler ... Browse Code »

Page mapcount should be updated only if we are sure that the page ends
up in the page table otherwise we would leak if we couldn't COW due to
reservations or if idx is out of bounds.

Signed-off-by: Hillf Danton
Reviewed-by: Michal Hocko
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hillf Danton
2012-01-24 00:38:48 +0800
6568d4a9c mm: memcg: update the correct soft limit tree during migration ... Browse Code »
43

end_migration() passes the old page instead of the new page to commit
the charge. This page descriptor is not used for committing itself,
though, since we also pass the (correct) page_cgroup descriptor. But
it's used to find the soft limit tree through the page's zone, so the
soft limit tree of the old page's zone is updated instead of that of the
new page's, which might get slightly out of date until the next charge
reaches the ratelimit point.

This glitch has been present since 5564e88 ("memcg: condense
page_cgroup-to-page lookup points").

This fixes a bug that I introduced in 2.6.38. It's benign enough (to my
knowledge) that we probably don't want this for stable.

Reported-by: Hugh Dickins
Signed-off-by: Johannes Weiner
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Michal Hocko
Acked-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-24 00:38:48 +0800
656a07062 mm: __count_immobile_pages(): make sure the node is online ... Browse Code »

page_zone() requires an online node otherwise we are accessing NULL
NODE_DATA. This is not an issue at the moment because node_zones are
located at the structure beginning but this might change in the future
so better be careful about that.

Signed-off-by: Michal Hocko
Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2012-01-24 00:38:47 +0800
687875fb7 mm: fix NULL ptr dereference in __count_immobile_pages ... Browse Code »
1

Fix the following NULL ptr dereference caused by

cat /sys/devices/system/memory/memory0/removable

Pid: 13979, comm: sed Not tainted 3.0.13-0.5-default #1 IBM BladeCenter LS21 -[7971PAM]-/Server Blade
RIP: __count_immobile_pages+0x4/0x100
Process sed (pid: 13979, threadinfo ffff880221c36000, task ffff88022e788480)
Call Trace:
is_pageblock_removable_nolock+0x34/0x40
is_mem_section_removable+0x74/0xf0
show_mem_removable+0x41/0x70
sysfs_read_file+0xfe/0x1c0
vfs_read+0xc7/0x130
sys_read+0x53/0xa0
system_call_fastpath+0x16/0x1b

We are crashing because we are trying to dereference NULL zone which
came from pfn=0 (struct page ffffea0000000000). According to the boot
log this page is marked reserved:
e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)

and early_node_map confirms that:
early_node_map[3] active PFN ranges
1: 0x00000010 -> 0x0000009c
1: 0x00000100 -> 0x000bffa3
1: 0x00100000 -> 0x00240000

The problem is that memory_present works in PAGE_SECTION_MASK aligned
blocks so the reserved range sneaks into the the section as well. This
also means that free_area_init_node will not take care of those reserved
pages and they stay uninitialized.

When we try to read the removable status we walk through all available
sections and hope that the zone is valid for all pages in the section.
But this is not true in this case as the zone and nid are not initialized.

We have only one node in this particular case and it is marked as node=1
(rather than 0) and that made the problem visible because page_to_nid will
return 0 and there are no zones on the node.

Let's check that the zone is valid and that the given pfn falls into its
boundaries and mark the section not removable. This might cause some
false positives, probably, but we do not have any sane way to find out
whether the page is reserved by the platform or it is just not used for
whatever other reasons.

Signed-off-by: Michal Hocko
Acked-by: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: Andrea Arcangeli
Cc: David Rientjes
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2012-01-24 00:38:47 +0800

23 Jan, 2012

1 commit

376be5ff8 net: fix socket memcg build with !CONFIG_NET ... Browse Code »

There is still a build bug with the sock memcg code, that triggers
with !CONFIG_NET, that survived my series of randconfig builds.

Signed-off-by: Glauber Costa
Reported-by: Randy Dunlap
CC: Hiroyouki Kamezawa
Signed-off-by: David S. Miller

Glauber Costa
2012-01-23 04:08:45 +0800