Eric Lee / smarc-fsl-linux-kernel

07 Dec, 2010

2 commits

7787d2c2f Merge branch 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 ... Browse Code »

* 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
PM / Hibernate: Fix memory corruption related to swap
PM / Hibernate: Use async I/O when reading compressed hibernation image

Linus Torvalds
2010-12-07 07:51:14 +0800
c9e664f1f PM / Hibernate: Fix memory corruption related to swap ... Browse Code »

There is a problem that swap pages allocated before the creation of
a hibernation image can be released and used for storing the contents
of different memory pages while the image is being saved. Since the
kernel stored in the image doesn't know of that, it causes memory
corruption to occur after resume from hibernation, especially on
systems with relatively small RAM that need to swap often.

This issue can be addressed by keeping the GFP_IOFS bits clear
in gfp_allowed_mask during the entire hibernation, including the
saving of the image, until the system is finally turned off or
the hibernation is aborted. Unfortunately, for this purpose
it's necessary to rework the way in which the hibernate and
suspend code manipulates gfp_allowed_mask.

This change is based on an earlier patch from Hugh Dickins.

Signed-off-by: Rafael J. Wysocki
Reported-by: Ondrej Zary
Acked-by: Hugh Dickins
Reviewed-by: KAMEZAWA Hiroyuki
Cc: stable@kernel.org

Rafael J. Wysocki
2010-12-07 06:52:08 +0800

06 Dec, 2010

1 commit

771f8bc71 Merge branch 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 ... Browse Code »

* 'slab/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
slub: Fix a crash during slabinfo -v

Linus Torvalds
2010-12-06 08:45:02 +0800

04 Dec, 2010

1 commit

37d57443d slub: Fix a crash during slabinfo -v ... Browse Code »

Commit f7cb1933621bce66a77f690776a16fe3ebbc4d58 ("SLUB: Pass active
and inactive redzone flags instead of boolean to debug functions")
missed two instances of check_object(). This caused a lot of warnings
during 'slabinfo -v' finally leading to a crash:

BUG ext4_xattr: Freepointer corrupt
...
BUG buffer_head: Freepointer corrupt
...
BUG ext4_alloc_context: Freepointer corrupt
...
...
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
IP: [] file_sb_list_del+0x1c/0x35
PGD 79d78067 PUD 79e67067 PMD 0
Oops: 0002 [#1] SMP
last sysfs file: /sys/kernel/slab/:t-0000192/validate

This patch fixes the problem by converting the two missed instances.

Acked-by: Christoph Lameter
Signed-off-by: Tero Roponen
Signed-off-by: Pekka Enberg

Tero Roponen
2010-12-04 15:53:49 +0800

03 Dec, 2010

6 commits

a0b0f58cd ksm: annotate ksm_thread_mutex is no deadlock source ... Browse Code »

commit 62b61f611e ("ksm: memory hotremove migration only") caused the
following new lockdep warning.

=======================================================
[ INFO: possible circular locking dependency detected ]
-------------------------------------------------------
bash/1621 is trying to acquire lock:
((memory_chain).rwsem){.+.+.+}, at: []
__blocking_notifier_call_chain+0x69/0xc0

but task is already holding lock:
(ksm_thread_mutex){+.+.+.}, at: []
ksm_memory_callback+0x3a/0xc0

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (ksm_thread_mutex){+.+.+.}:
[] lock_acquire+0xaa/0x140
[] __mutex_lock_common+0x44/0x3f0
[] mutex_lock_nested+0x48/0x60
[] ksm_memory_callback+0x3a/0xc0
[] notifier_call_chain+0x8c/0xe0
[] __blocking_notifier_call_chain+0x7e/0xc0
[] blocking_notifier_call_chain+0x16/0x20
[] memory_notify+0x1b/0x20
[] remove_memory+0x1cc/0x5f0
[] memory_block_change_state+0xfd/0x1a0
[] store_mem_state+0xe2/0xf0
[] sysdev_store+0x20/0x30
[] sysfs_write_file+0xe6/0x170
[] vfs_write+0xc8/0x190
[] sys_write+0x54/0x90
[] system_call_fastpath+0x16/0x1b

-> #0 ((memory_chain).rwsem){.+.+.+}:
[] __lock_acquire+0x155a/0x1600
[] lock_acquire+0xaa/0x140
[] down_read+0x51/0xa0
[] __blocking_notifier_call_chain+0x69/0xc0
[] blocking_notifier_call_chain+0x16/0x20
[] memory_notify+0x1b/0x20
[] remove_memory+0x56e/0x5f0
[] memory_block_change_state+0xfd/0x1a0
[] store_mem_state+0xe2/0xf0
[] sysdev_store+0x20/0x30
[] sysfs_write_file+0xe6/0x170
[] vfs_write+0xc8/0x190
[] sys_write+0x54/0x90
[] system_call_fastpath+0x16/0x1b

But it's a false positive. Both memory_chain.rwsem and ksm_thread_mutex
have an outer lock (mem_hotplug_mutex). So they cannot deadlock.

Thus, This patch annotate ksm_thread_mutex is not deadlock source.

[akpm@linux-foundation.org: update comment, from Hugh]
Signed-off-by: KOSAKI Motohiro
Acked-by: Hugh Dickins
Cc: Andrea Arcangeli
Cc: Andi Kleen
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-12-03 06:51:15 +0800
20d6c96b5 mem-hotplug: introduce {un}lock_memory_hotplug() ... Browse Code »
43

Presently hwpoison is using lock_system_sleep() to prevent a race with
memory hotplug. However lock_system_sleep() is a no-op if
CONFIG_HIBERNATION=n. Therefore we need a new lock.

Signed-off-by: KOSAKI Motohiro
Cc: Andi Kleen
Cc: Kamezawa Hiroyuki
Suggested-by: Hugh Dickins
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-12-03 06:51:15 +0800
64141da58 vmalloc: eagerly clear ptes on vunmap ... Browse Code »

On stock 2.6.37-rc4, running:

# mount lilith:/export /mnt/lilith
# find /mnt/lilith/ -type f -print0 | xargs -0 file

crashes the machine fairly quickly under Xen. Often it results in oops
messages, but the couple of times I tried just now, it just hung quietly
and made Xen print some rude messages:

(XEN) mm.c:2389:d80 Bad type (saw 7400000000000001 != exp
3000000000000000) for mfn 1d7058 (pfn 18fa7)
(XEN) mm.c:964:d80 Attempt to create linear p.t. with write perms
(XEN) mm.c:2389:d80 Bad type (saw 7400000000000010 != exp
1000000000000000) for mfn 1d2e04 (pfn 1d1fb)
(XEN) mm.c:2965:d80 Error while pinning mfn 1d2e04

Which means the domain tried to map a pagetable page RW, which would
allow it to map arbitrary memory, so Xen stopped it. This is because
vm_unmap_ram() left some pages mapped in the vmalloc area after NFS had
finished with them, and those pages got recycled as pagetable pages
while still having these RW aliases.

Removing those mappings immediately removes the Xen-visible aliases, and
so it has no problem with those pages being reused as pagetable pages.
Deferring the TLB flush doesn't upset Xen because it can flush the TLB
itself as needed to maintain its invariants.

When unmapping a region in the vmalloc space, clear the ptes
immediately. There's no point in deferring this because there's no
amortization benefit.

The TLBs are left dirty, and they are flushed lazily to amortize the
cost of the IPIs.

This specific motivation for this patch is an oops-causing regression
since 2.6.36 when using NFS under Xen, triggered by the NFS client's use
of vm_map_ram() introduced in 56e4ebf877b60 ("NFS: readdir with vmapped
pages") . XFS also uses vm_map_ram() and could cause similar problems.

Signed-off-by: Jeremy Fitzhardinge
Cc: Nick Piggin
Cc: Bryan Schumaker
Cc: Trond Myklebust
Cc: Alex Elder
Cc: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jeremy Fitzhardinge
2010-12-03 06:51:15 +0800
e172662d1 vmstat: fix dirty threshold ordering ... Browse Code »

The nr_dirty_[background_]threshold fields are misplaced before the
numa_* fields, and users will read strange values.

This is the right order. Before patch, nr_dirty_background_threshold
will read as 0 (the value from numa_miss).

numa_hit 128501
numa_miss 0
numa_foreign 0
numa_interleave 7388
numa_local 128501
numa_other 0
nr_dirty_threshold 144291
nr_dirty_background_threshold 72145

Signed-off-by: Wu Fengguang
Cc: Michael Rubin
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2010-12-03 06:51:14 +0800
55cfaa3cb mm/mempolicy.c: add rcu read lock to protect pid structure ... Browse Code »

find_task_by_vpid() should be protected by rcu_read_lock(), to prevent
free_pid() reclaiming pid.

Signed-off-by: Zeng Zhaoming
Cc: "Paul E. McKenney"
Cc: KOSAKI Motohiro
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zeng Zhaoming
2010-12-03 06:51:14 +0800
1f64d69c7 mm/hugetlb.c: avoid double unlock_page() in hugetlb_fault() ... Browse Code »

Have hugetlb_fault() call unlock_page(page) only if it had previously
called lock_page(page).

Setting CONFIG_DEBUG_VM=y and then running the libhugetlbfs test suite,
resulted in the tripping of VM_BUG_ON(!PageLocked(page)) in
unlock_page() having been called by hugetlb_fault() when page ==
pagecache_page. This patch remedied the problem.

Signed-off-by: Dean Nelson
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dean Nelson
2010-12-03 06:51:14 +0800

25 Nov, 2010

6 commits

5f0af70a2 mm: remove call to find_vma in pagewalk for non-hugetlbfs ... Browse Code »

Commit d33b9f45 ("mm: hugetlb: fix hugepage memory leak in
walk_page_range()") introduces a check if a vma is a hugetlbfs one and
later in 5dc37642 ("mm hugetlb: add hugepage support to pagemap") it is
moved under #ifdef CONFIG_HUGETLB_PAGE but a needless find_vma call is
left behind and its result is not used anywhere else in the function.

The side-effect of caching vma for @addr inside walk->mm is neither
utilized in walk_page_range() nor in called functions.

Signed-off-by: David Sterba
Reviewed-by: Naoya Horiguchi
Acked-by: Andi Kleen
Cc: Andy Whitcroft
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Lee Schermerhorn
Cc: Matt Mackall
Acked-by: Mel Gorman
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Sterba
2010-11-25 05:50:46 +0800
e9959f0f3 mm/page_alloc.c: fix build_all_zonelist() where percpu_alloc() is wrongly called… ... Browse Code »

… under stop_machine_run()

During memory hotplug, build_allzonelists() may be called under
stop_machine_run(). In this function, setup_zone_pageset() is called.
But it's bug because it will do page allocation under stop_machine_run().

Here is a report from Alok Kataria.

BUG: sleeping function called from invalid context at kernel/mutex.c:94
in_atomic(): 0, irqs_disabled(): 1, pid: 4, name: migration/0
Pid: 4, comm: migration/0 Not tainted 2.6.35.6-45.fc14.x86_64 #1
Call Trace:
[<ffffffff8103d12b>] __might_sleep+0xeb/0xf0
[<ffffffff81468245>] mutex_lock+0x24/0x50
[<ffffffff8110eaa6>] pcpu_alloc+0x6d/0x7ee
[<ffffffff81048888>] ? load_balance+0xbe/0x60e
[<ffffffff8103a1b3>] ? rt_se_boosted+0x21/0x2f
[<ffffffff8103e1cf>] ? dequeue_rt_stack+0x18b/0x1ed
[<ffffffff8110f237>] __alloc_percpu+0x10/0x12
[<ffffffff81465e22>] setup_zone_pageset+0x38/0xbe
[<ffffffff810d6d81>] ? build_zonelists_node.clone.58+0x79/0x8c
[<ffffffff81452539>] __build_all_zonelists+0x419/0x46c
[<ffffffff8108ef01>] ? cpu_stopper_thread+0xb2/0x198
[<ffffffff8108f075>] stop_machine_cpu_stop+0x8e/0xc5
[<ffffffff8108efe7>] ? stop_machine_cpu_stop+0x0/0xc5
[<ffffffff8108ef57>] cpu_stopper_thread+0x108/0x198
[<ffffffff81467a37>] ? schedule+0x5b2/0x5cc
[<ffffffff8108ee4f>] ? cpu_stopper_thread+0x0/0x198
[<ffffffff81065f29>] kthread+0x7f/0x87
[<ffffffff8100aae4>] kernel_thread_helper+0x4/0x10
[<ffffffff81065eaa>] ? kthread+0x0/0x87
[<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10
Built 5 zonelists in Node order, mobility grouping on. Total pages: 289456
Policy zone: Normal

This patch tries to fix the issue by moving setup_zone_pageset() out from
stop_machine_run(). It's obviously not necessary to be called under
stop_machine_run().

[akpm@linux-foundation.org: remove unneeded local]
Reported-by: Alok Kataria <akataria@vmware.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Petr Vandrovec <petr@vmware.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

KAMEZAWA Hiroyuki
2010-11-25 05:50:45 +0800
a42c390cf cgroups: make swap accounting default behavior configurable ... Browse Code »

Swap accounting can be configured by CONFIG_CGROUP_MEM_RES_CTLR_SWAP
configuration option and then it is turned on by default. There is a boot
option (noswapaccount) which can disable this feature.

This makes it hard for distributors to enable the configuration option as
this feature leads to a bigger memory consumption and this is a no-go for
general purpose distribution kernel. On the other hand swap accounting
may be very usuful for some workloads.

This patch adds a new configuration option which controls the default
behavior (CGROUP_MEM_RES_CTLR_SWAP_ENABLED). If the option is selected
then the feature is turned on by default.

It also adds a new boot parameter swapaccount[=1|0] which enhances the
original noswapaccount parameter semantic by means of enable/disable logic
(defaults to 1 if no value is provided to be still consistent with
noswapaccount).

The default behavior is unchanged (if CONFIG_CGROUP_MEM_RES_CTLR_SWAP is
enabled then CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED is enabled as well)

Signed-off-by: Michal Hocko
Acked-by: Daisuke Nishimura
Cc: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2010-11-25 05:50:45 +0800
b1dd693e5 memcg: avoid deadlock between move charge and try_charge() ... Browse Code »

__mem_cgroup_try_charge() can be called under down_write(&mmap_sem)(e.g.
mlock does it). This means it can cause deadlock if it races with move charge:

Ex.1)
move charge | try charge
--------------------------------------+------------------------------
mem_cgroup_can_attach() | down_write(&mmap_sem)
mc.moving_task = current | ..
mem_cgroup_precharge_mc() | __mem_cgroup_try_charge()
mem_cgroup_count_precharge() | prepare_to_wait()
down_read(&mmap_sem) | if (mc.moving_task)
-> cannot aquire the lock | -> true
| schedule()

Ex.2)
move charge | try charge
--------------------------------------+------------------------------
mem_cgroup_can_attach() |
mc.moving_task = current |
mem_cgroup_precharge_mc() |
mem_cgroup_count_precharge() |
down_read(&mmap_sem) |
.. |
up_read(&mmap_sem) |
| down_write(&mmap_sem)
mem_cgroup_move_task() | ..
mem_cgroup_move_charge() | __mem_cgroup_try_charge()
down_read(&mmap_sem) | prepare_to_wait()
-> cannot aquire the lock | if (mc.moving_task)
| -> true
| schedule()

To avoid this deadlock, we do all the move charge works (both can_attach() and
attach()) under one mmap_sem section.
And after this patch, we set/clear mc.moving_task outside mc.lock, because we
use the lock only to check mc.from/to.

Signed-off-by: Daisuke Nishimura
Cc: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke Nishimura
2010-11-25 05:50:44 +0800
112bc2e12 memcg: fix false positive VM_BUG on non-SMP ... Browse Code »

Fix this:

kernel BUG at mm/memcontrol.c:2155!
invalid opcode: 0000 [#1]
last sysfs file:

Pid: 18, comm: sh Not tainted 2.6.37-rc3 #3 /Bochs
EIP: 0060:[] EFLAGS: 00000246 CPU: 0
EIP is at mem_cgroup_move_account+0xe2/0xf0
EAX: 00000004 EBX: c6f931d4 ECX: c681c300 EDX: c681c000
ESI: c681c300 EDI: ffffffea EBP: c681c000 ESP: c46f3e30
DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
Process sh (pid: 18, ti=c46f2000 task=c6826e60 task.ti=c46f2000)
Stack:
00000155 c681c000 0805f000 c46ee180 c46f3e5c c7058820 c1074d37 00000000
08060000 c46db9a0 c46ec080 c7058820 0805f000 08060000 c46f3e98 c1074c50
c106c75e c46f3e98 c46ec080 08060000 0805ffff c46db9a0 c46f3e98 c46e0340
Call Trace:
[] ? mem_cgroup_move_charge_pte_range+0xe7/0x130
[] ? mem_cgroup_move_charge_pte_range+0x0/0x130
[] ? walk_page_range+0xee/0x1d0
[] ? mem_cgroup_move_task+0x66/0x90
[] ? mem_cgroup_move_charge_pte_range+0x0/0x130
[] ? mem_cgroup_move_task+0x0/0x90
[] ? cgroup_attach_task+0x136/0x200
[] ? cgroup_tasks_write+0x48/0xc0
[] ? cgroup_file_write+0xde/0x220
[] ? do_page_fault+0x17d/0x3f0
[] ? alloc_fd+0x2d/0xd0
[] ? cgroup_file_write+0x0/0x220
[] ? vfs_write+0x92/0xc0
[] ? sys_write+0x41/0x70
[] ? syscall_call+0x7/0xb
Code: 03 00 74 09 8b 44 24 04 e8 1c f1 ff ff 89 73 04 8d 86 b0 00 00 00 b9 01 00 00 00 89 da 31 ff e8 65 f5 ff ff e9 4d ff ff ff 0f 0b 0b 0f 0b 0f 0b 90 8d b4 26 00 00 00 00 83 ec 10 8b 0d f4 e3
EIP: [] mem_cgroup_move_account+0xe2/0xf0 SS:ESP 0068:c46f3e30
---[ end trace 7daa1582159b6532 ]---

lock_page_cgroup and unlock_page_cgroup are implemented using
bit_spinlock. bit_spinlock doesn't touch the bit if we are on non-SMP
machine, so we can't use the bit to check whether the lock was taken.

Let's introduce is_page_cgroup_locked based on bit_spin_is_locked instead
of PageCgroupLocked to fix it.

[akpm@linux-foundation.org: s/is_page_cgroup_locked/page_is_cgroup_locked/]
Signed-off-by: Kirill A. Shutemov
Reviewed-by: Johannes Weiner
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2010-11-25 05:50:40 +0800
04c349615 nommu: yield CPU while disposing VM ... Browse Code »

Depending on processor speed, page size, and the amount of memory a
process is allowed to amass, cleanup of a large VM may freeze the system
for many seconds. This can result in a watchdog timeout.

Make sure other tasks receive some service when cleaning up large VMs.

Signed-off-by: Steven J. Magnani
Cc: Greg Ungerer
Reviewed-by: KOSAKI Motohiro
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Steven J. Magnani
2010-11-25 05:50:38 +0800

15 Nov, 2010

1 commit

2744b8889 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
slub: Fix slub_lock down/up imbalance

Linus Torvalds
2010-11-15 05:06:37 +0800

14 Nov, 2010

1 commit

68cee4f11 slub: Fix slub_lock down/up imbalance ... Browse Code »

There are two places, that do not release the slub_lock.

Respective bugs were introduced by sysfs changes ab4d5ed5 (slub: Enable
sysfs support for !CONFIG_SLUB_DEBUG) and 2bce6485 ( slub: Allow removal
of slab caches during boot).

Acked-by: Christoph Lameter
Signed-off-by: Pavel Emelyanov
Signed-off-by: Pekka Enberg

Pavel Emelyanov
2010-11-14 22:53:11 +0800

12 Nov, 2010

4 commits

27d20fddc radix-tree: fix RCU bug ... Browse Code »

Salman Qazi describes the following radix-tree bug:

In the following case, we get can get a deadlock:

0. The radix tree contains two items, one has the index 0.
1. The reader (in this case find_get_pages) takes the rcu_read_lock.
2. The reader acquires slot(s) for item(s) including the index 0 item.
3. The non-zero index item is deleted, and as a consequence the other item is
moved to the root of the tree. The place where it used to be is queued for
deletion after the readers finish.
3b. The zero item is deleted, removing it from the direct slot, it remains in
the rcu-delayed indirect node.
4. The reader looks at the index 0 slot, and finds that the page has 0 ref
count
5. The reader looks at it again, hoping that the item will either be freed or
the ref count will increase. This never happens, as the slot it is looking
at will never be updated. Also, this slot can never be reclaimed because
the reader is holding rcu_read_lock and is in an infinite loop.

The fix is to re-use the same "indirect" pointer case that requires a slot
lookup retry into a general "retry the lookup" bit.

Signed-off-by: Nick Piggin
Reported-by: Salman Qazi
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2010-11-12 23:55:32 +0800
1dce071e1 vmscan: avoid setting zone congested if no page dirty ... Browse Code »

nr_dirty and nr_congested are increased only when the page is dirty. So
if all pages are clean, both them will be zero. In this case, we should
not mark the zone congested.

Signed-off-by: Shaohua Li
Reviewed-by: Johannes Weiner
Reviewed-by: Minchan Kim
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2010-11-12 23:55:31 +0800
8d056cb96 mm/vfs: revalidate page->mapping in do_generic_file_read() ... Browse Code »

70 hours into some stress tests of a 2.6.32-based enterprise kernel, we
ran into a NULL dereference in here:

int block_is_partially_uptodate(struct page *page, read_descriptor_t *desc,
unsigned long from)
{
----> struct inode *inode = page->mapping->host;

It looks like page->mapping was the culprit. (xmon trace is below).
After closer examination, I realized that do_generic_file_read() does a
find_get_page(), and eventually locks the page before calling
block_is_partially_uptodate(). However, it doesn't revalidate the
page->mapping after the page is locked. So, there's a small window
between the find_get_page() and ->is_partially_uptodate() where the page
could get truncated and page->mapping cleared.

We _have_ a reference, so it can't get reclaimed, but it certainly
can be truncated.

I think the correct thing is to check page->mapping after the
trylock_page(), and jump out if it got truncated. This patch has been
running in the test environment for a month or so now, and we have not
seen this bug pop up again.

xmon info:

1f:mon> e
cpu 0x1f: Vector: 300 (Data Access) at [c0000002ae36f770]
pc: c0000000001e7a6c: .block_is_partially_uptodate+0xc/0x100
lr: c000000000142944: .generic_file_aio_read+0x1e4/0x770
sp: c0000002ae36f9f0
msr: 8000000000009032
dar: 0
dsisr: 40000000
current = 0xc000000378f99e30
paca = 0xc000000000f66300
pid = 21946, comm = bash
1f:mon> r
R00 = 0025c0500000006d R16 = 0000000000000000
R01 = c0000002ae36f9f0 R17 = c000000362cd3af0
R02 = c000000000e8cd80 R18 = ffffffffffffffff
R03 = c0000000031d0f88 R19 = 0000000000000001
R04 = c0000002ae36fa68 R20 = c0000003bb97b8a0
R05 = 0000000000000000 R21 = c0000002ae36fa68
R06 = 0000000000000000 R22 = 0000000000000000
R07 = 0000000000000001 R23 = c0000002ae36fbb0
R08 = 0000000000000002 R24 = 0000000000000000
R09 = 0000000000000000 R25 = c000000362cd3a80
R10 = 0000000000000000 R26 = 0000000000000002
R11 = c0000000001e7b60 R27 = 0000000000000000
R12 = 0000000042000484 R28 = 0000000000000001
R13 = c000000000f66300 R29 = c0000003bb97b9b8
R14 = 0000000000000001 R30 = c000000000e28a08
R15 = 000000000000ffff R31 = c0000000031d0f88
pc = c0000000001e7a6c .block_is_partially_uptodate+0xc/0x100
lr = c000000000142944 .generic_file_aio_read+0x1e4/0x770
msr = 8000000000009032 cr = 22000488
ctr = c0000000001e7a60 xer = 0000000020000000 trap = 300
dar = 0000000000000000 dsisr = 40000000
1f:mon> t
[link register ] c000000000142944 .generic_file_aio_read+0x1e4/0x770
[c0000002ae36f9f0] c000000000142a14 .generic_file_aio_read+0x2b4/0x770 (unreliable)
[c0000002ae36fb40] c0000000001b03e4 .do_sync_read+0xd4/0x160
[c0000002ae36fce0] c0000000001b153c .vfs_read+0xec/0x1f0
[c0000002ae36fd80] c0000000001b1768 .SyS_read+0x58/0xb0
[c0000002ae36fe30] c00000000000852c syscall_exit+0x0/0x40
--- Exception: c00 (System Call) at 00000080a840bc54
SP (fffca15df30) is in userspace
1f:mon> di c0000000001e7a6c
c0000000001e7a6c e9290000 ld r9,0(r9)
c0000000001e7a70 418200c0 beq c0000000001e7b30 # .block_is_partially_uptodate+0xd0/0x100
c0000000001e7a74 e9440008 ld r10,8(r4)
c0000000001e7a78 78a80020 clrldi r8,r5,32
c0000000001e7a7c 3c000001 lis r0,1
c0000000001e7a80 812900a8 lwz r9,168(r9)
c0000000001e7a84 39600001 li r11,1
c0000000001e7a88 7c080050 subf r0,r8,r0
c0000000001e7a8c 7f805040 cmplw cr7,r0,r10
c0000000001e7a90 7d6b4830 slw r11,r11,r9
c0000000001e7a94 796b0020 clrldi r11,r11,32
c0000000001e7a98 419d00a8 bgt cr7,c0000000001e7b40 # .block_is_partially_uptodate+0xe0/0x100
c0000000001e7a9c 7fa55840 cmpld cr7,r5,r11
c0000000001e7aa0 7d004214 add r8,r0,r8
c0000000001e7aa4 79080020 clrldi r8,r8,32
c0000000001e7aa8 419c0078 blt cr7,c0000000001e7b20 # .block_is_partially_uptodate+0xc0/0x100

Signed-off-by: Dave Hansen
Reviewed-by: Minchan Kim
Reviewed-by: Johannes Weiner
Acked-by: Rik van Riel
Cc:
Cc:
Cc: Christoph Hellwig
Cc: Al Viro
Cc: Minchan Kim
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Hansen
2010-11-12 23:55:31 +0800
d2e61b8dc memcg: null dereference on allocation failure ... Browse Code »

The original code had a null dereference if alloc_percpu() failed. This
was introduced in commit 711d3d2c9bc3 ("memcg: cpu hotplug aware percpu
count updates")

Signed-off-by: Dan Carpenter
Reviewed-by: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Carpenter
2010-11-12 23:55:31 +0800

10 Nov, 2010

1 commit

63bfd7384 perf_events: Fix perf_counter_mmap() hook in mprotect() ... Browse Code »

As pointed out by Linus, commit dab5855 ("perf_counter: Add mmap event hooks to
mprotect()") is fundamentally wrong as mprotect_fixup() can free 'vma' due to
merging. Fix the problem by moving perf_event_mmap() hook to
mprotect_fixup().

Note: there's another successful return path from mprotect_fixup() if old
flags equal to new flags. We don't, however, need to call
perf_event_mmap() there because 'perf' already knows the VMA is
executable.

Reported-by: Dave Jones
Analyzed-by: Linus Torvalds
Cc: Ingo Molnar
Reviewed-by: Peter Zijlstra
Signed-off-by: Pekka Enberg
Signed-off-by: Linus Torvalds

Pekka Enberg
2010-11-10 02:19:38 +0800

04 Nov, 2010

1 commit

ff8b16d7e vmstat: fix offset calculation on void* ... Browse Code »

Fix regression introduced by commit 79da826aee6 ("writeback: report
dirty thresholds in /proc/vmstat").

The incorrect pointer arithmetic can result in problems like this:

BUG: unable to handle kernel paging request at 07c06d16
IP: [] strnlen+0x6/0x20
Call Trace:
[] ? string+0x39/0xe0
[] ? __wake_up_common+0x4b/0x80
[] ? vsnprintf+0x1ec/0x380
[] ? seq_printf+0x2e/0x60
[] ? vmstat_show+0x26/0x30
[] ? seq_read+0xa6/0x380
[] ? seq_read+0x0/0x380
[] ? proc_reg_read+0x5f/0x90
[] ? vfs_read+0xa1/0x140
[] ? proc_reg_read+0x0/0x90
[] ? sys_read+0x41/0x70
[] ? sysenter_do_call+0x12/0x26

Reported-by: Tetsuo Handa
Cc: Michael Rubin
Signed-off-by: Wu Fengguang
Signed-off-by: Linus Torvalds

Wu Fengguang
2010-11-04 02:39:58 +0800

03 Nov, 2010

1 commit

d88c0922f Release page reference during page fault retry ... Browse Code »

This slipped by when unifying the filemap and swap versions of
lock_page_or_retry()...

Signed-off-by: Michel Lespinasse
Acked-by: Rik van Riel
Signed-off-by: Linus Torvalds

Michel Lespinasse
2010-11-03 05:02:31 +0800

30 Oct, 2010

1 commit

120a795da audit mmap ... Browse Code »

Normal syscall audit doesn't catch 5th argument of syscall. It also
doesn't catch the contents of userland structures pointed to be
syscall argument, so for both old and new mmap(2) ABI it doesn't
record the descriptor we are mapping. For old one it also misses
flags.

Signed-off-by: Al Viro

Al Viro
2010-10-30 20:45:43 +0800

29 Oct, 2010

2 commits

3c26ff6e4 convert get_sb_nodev() users ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2010-10-29 16:16:31 +0800
800416f79 numa: fix slab_node(MPOL_BIND) ... Browse Code »

When a node contains only HighMem memory, slab_node(MPOL_BIND)
dereferences a NULL pointer.

[ This code seems to go back all the way to commit 19770b32609b: "mm:
filter based on a nodemask as well as a gfp_mask". Which was back in
April 2008, and it got merged into 2.6.26. - Linus ]

Signed-off-by: Eric Dumazet
Cc: Mel Gorman
Cc: Christoph Lameter
Cc: Lee Schermerhorn
Cc: Andrew Morton
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Eric Dumazet
2010-10-29 01:04:30 +0800

28 Oct, 2010

10 commits

bdab22501 Merge git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-mn10300 ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-mn10300: (44 commits)
MN10300: Save frame pointer in thread_info struct rather than global var
MN10300: Change "Matsushita" to "Panasonic".
MN10300: Create a defconfig for the ASB2364 board
MN10300: Update the ASB2303 defconfig
MN10300: ASB2364: Add support for SMSC911X and SMC911X
MN10300: ASB2364: Handle the IRQ multiplexer in the FPGA
MN10300: Generic time support
MN10300: Specify an ELF HWCAP flag for MN10300 Atomic Operations Unit support
MN10300: Map userspace atomic op regs as a vmalloc page
MN10300: And Panasonic AM34 subarch and implement SMP
MN10300: Delete idle_timestamp from irq_cpustat_t
MN10300: Make various interrupt priority settings configurable
MN10300: Optimise do_csum()
MN10300: Implement atomic ops using atomic ops unit
MN10300: Make the FPU operate in non-lazy mode under SMP
MN10300: SMP TLB flushing
MN10300: Use the [ID]PTEL2 registers rather than [ID]PTEL for TLB control
MN10300: Make the use of PIDR to mark TLB entries controllable
MN10300: Rename __flush_tlb*() to local_flush_tlb*()
MN10300: AM34 erratum requires MMUCTR read and write on exception entry
...

Linus Torvalds
2010-10-28 09:53:26 +0800
0be8557bc fuse: use release_pages() ... Browse Code »

Replace iterated page_cache_release() with release_pages(), which is
faster and shorter.

Needs release_pages() to be exported to modules.

Suggested-by: Andrew Morton
Signed-off-by: Miklos Szeredi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miklos Szeredi
2010-10-28 09:03:17 +0800
26174efd4 memcg: generic filestat update interface ... Browse Code »

This patch extracts the core logic from mem_cgroup_update_file_mapped() as
mem_cgroup_update_file_stat() and adds a wrapper.

As a planned future update, memory cgroup has to count dirty pages to
implement dirty_ratio/limit. And more, the number of dirty pages is
required to kick flusher thread to start writeback. (Now, no kick.)

This patch is preparation for it and makes other statistics implementation
clearer. Just a clean up.

Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Reviewed-by: Greg Thelen
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-10-28 09:03:10 +0800
1489ebad8 memcg: cpu hotplug aware quick acount_move detection ... Browse Code »

An event counter MEM_CGROUP_ON_MOVE is used for quick check whether file
stat update can be done in async manner or not. Now, it use percpu
counter and for_each_possible_cpu to update.

This patch replaces for_each_possible_cpu to for_each_online_cpu and adds
necessary synchronization logic at CPU HOTPLUG.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-10-28 09:03:09 +0800
711d3d2c9 memcg: cpu hotplug aware percpu count updates ... Browse Code »

Now, memcgroup's per cpu coutner uses for_each_possible_cpu() to get the
value. It's better to use for_each_online_cpu() and a cpu hotplug
handler.

This patch only handles statistics counter. MEM_CGROUP_ON_MOVE will be
handled in another patch.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-10-28 09:03:09 +0800
7d74b06f2 memcg: use for_each_mem_cgroup ... Browse Code »

In memory cgroup management, we sometimes have to walk through
subhierarchy of cgroup to gather informaiton, or lock something, etc.

Now, to do that, mem_cgroup_walk_tree() function is provided. It calls
given callback function per cgroup found. But the bad thing is that it
has to pass a fixed style function and argument, "void*" and it adds much
type casting to memcontrol.c.

To make the code clean, this patch replaces walk_tree() with

for_each_mem_cgroup_tree(iter, root)

An iterator style call. The good point is that iterator call doesn't have
to assume what kind of function is called under it. A bad point is that
it may cause reference-count leak if a caller use "break" from the loop by
mistake.

I think the benefit is larger. The modified code seems straigtforward and
easy to read because we don't have misterious callbacks and pointer cast.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-10-28 09:03:09 +0800
32047e2a8 memcg: avoid lock in updating file_mapped (Was fix race in file_mapped accouting flag management ... Browse Code »

At accounting file events per memory cgroup, we need to find memory cgroup
via page_cgroup->mem_cgroup. Now, we use lock_page_cgroup() for guarantee
pc->mem_cgroup is not overwritten while we make use of it.

But, considering the context which page-cgroup for files are accessed,
we can use alternative light-weight mutual execusion in the most case.

At handling file-caches, the only race we have to take care of is "moving"
account, IOW, overwriting page_cgroup->mem_cgroup. (See comment in the
patch)

Unlike charge/uncharge, "move" happens not so frequently. It happens only when
rmdir() and task-moving (with a special settings.)
This patch adds a race-checker for file-cache-status accounting v.s. account
moving. The new per-cpu-per-memcg counter MEM_CGROUP_ON_MOVE is added.
The routine for account move
1. Increment it before start moving
2. Call synchronize_rcu()
3. Decrement it after the end of moving.
By this, file-status-counting routine can check it needs to call
lock_page_cgroup(). In most case, I doesn't need to call it.

Following is a perf data of a process which mmap()/munmap 32MB of file cache
in a minute.

Before patch:
28.25% mmap mmap [.] main
22.64% mmap [kernel.kallsyms] [k] page_fault
9.96% mmap [kernel.kallsyms] [k] mem_cgroup_update_file_mapped
3.67% mmap [kernel.kallsyms] [k] filemap_fault
3.50% mmap [kernel.kallsyms] [k] unmap_vmas
2.99% mmap [kernel.kallsyms] [k] __do_fault
2.76% mmap [kernel.kallsyms] [k] find_get_page

After patch:
30.00% mmap mmap [.] main
23.78% mmap [kernel.kallsyms] [k] page_fault
5.52% mmap [kernel.kallsyms] [k] mem_cgroup_update_file_mapped
3.81% mmap [kernel.kallsyms] [k] unmap_vmas
3.26% mmap [kernel.kallsyms] [k] find_get_page
3.18% mmap [kernel.kallsyms] [k] __do_fault
3.03% mmap [kernel.kallsyms] [k] filemap_fault
2.40% mmap [kernel.kallsyms] [k] handle_mm_fault
2.40% mmap [kernel.kallsyms] [k] do_page_fault

This patch reduces memcg's cost to some extent.
(mem_cgroup_update_file_mapped is called by both of map/unmap)

Note: It seems some more improvements are required..but no idea.
maybe removing set/unset flag is required.

Signed-off-by: KAMEZAWA Hiroyuki
Reviewed-by: Daisuke Nishimura
Cc: Balbir Singh
Cc: Greg Thelen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-10-28 09:03:09 +0800
0c270f8f9 memcg: fix race in file_mapped accouting flag management ... Browse Code »

Presently memory cgroup accounts file-mapped by counter and flag. counter
is working in the same way with zone_stat but FileMapped flag only exists
in memcg (for helping move_account).

This flag can be updated wrongly in a case. Assume CPU0 and CPU1 and a
thread mapping a page on CPU0, another thread unmapping it on CPU1.

CPU0 CPU1
rmv rmap (mapcount 1->0)
add rmap (mapcount 0->1)
lock_page_cgroup()
memcg counter+1 (some delay)
set MAPPED FLAG.
unlock_page_cgroup()
lock_page_cgroup()
memcg counter-1
clear MAPPED flag

In the above sequence counter is properly updated but FLAG is not. This
means that representing a state by a flag which is maintained by counter
needs some special care.

To handle this, when clearing a flag, this patch check mapcount directly
and clear the flag only when mapcount == 0. (if mapcount >0, someone will
make it to zero later and flag will be cleared.)

Reverse case, dec-after-inc cannot be a problem because page_table_lock()
works well for it. (IOW, to make above sequence, 2 processes should touch
the same page at once with map/unmap.)

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Greg Thelen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-10-28 09:03:09 +0800
a8e23a291 mm,x86: fix kmap_atomic_push vs ioremap_32.c ... Browse Code »

It appears i386 uses kmap_atomic infrastructure regardless of
CONFIG_HIGHMEM which results in a compile error when highmem is disabled.

Cure this by providing the needed few bits for both CONFIG_HIGHMEM and
CONFIG_X86_32.

Signed-off-by: Peter Zijlstra
Reported-by: Chris Wilson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2010-10-28 09:03:05 +0800
7c7fcf762 MN10300: Save frame pointer in thread_info struct rather than global var ... Browse Code »

Save the current exception frame pointer in the thread_info struct rather than
in a global variable as the latter makes SMP tricky, especially when preemption
is also enabled.

This also replaces __frame with current_frame() and rearranges header file
inclusions to make it all compile.

Signed-off-by: David Howells
Acked-by: Akira Takeuchi

David Howells
2010-10-28 00:29:01 +0800

27 Oct, 2010

2 commits

426e1f5ce Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
split invalidate_inodes()
fs: skip I_FREEING inodes in writeback_sb_inodes
fs: fold invalidate_list into invalidate_inodes
fs: do not drop inode_lock in dispose_list
fs: inode split IO and LRU lists
fs: switch bdev inode bdi's correctly
fs: fix buffer invalidation in invalidate_list
fsnotify: use dget_parent
smbfs: use dget_parent
exportfs: use dget_parent
fs: use RCU read side protection in d_validate
fs: clean up dentry lru modification
fs: split __shrink_dcache_sb
fs: improve DCACHE_REFERENCED usage
fs: use percpu counter for nr_dentry and nr_dentry_unused
fs: simplify __d_free
fs: take dcache_lock inside __d_path
fs: do not assign default i_ino in new_inode
fs: introduce a per-cpu last_ino allocator
new helper: ihold()
...

Linus Torvalds
2010-10-27 08:58:44 +0800
766f91641 kernel: remove PF_FLUSHER ... Browse Code »

PF_FLUSHER is only ever set, not tested, remove it.

Signed-off-by: Peter Zijlstra
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2010-10-27 07:52:15 +0800