Eric Lee / smarc-fsl-linux-kernel

10 Sep, 2010

15 commits

ff3cb3fec Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block ... Browse Code »

* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
block: Range check cpu in blk_cpu_to_group
scatterlist: prevent invalid free when alloc fails
writeback: Fix lost wake-up shutting down writeback thread
writeback: do not lose wakeup events when forking bdi threads
cciss: fix reporting of max queue depth since init
block: switch s390 tape_block and mg_disk to elevator_change()
block: add function call to switch the IO scheduler from a driver
fs/bio-integrity.c: return -ENOMEM on kmalloc failure
bio-integrity.c: remove dependency on __GFP_NOFAIL
BLOCK: fix bio.bi_rw handling
block: put dev->kobj in blk_register_queue fail path
cciss: handle allocation failure
cfq-iosched: Documentation help for new tunables
cfq-iosched: blktrace print per slice sector stats
cfq-iosched: Implement tunable group_idle
cfq-iosched: Do group share accounting in IOPS when slice_idle=0
cfq-iosched: Do not idle if slice_idle=0
cciss: disable doorbell reset on reset_devices
blkio: Fix return code for mkdir calls

Linus Torvalds
2010-09-10 22:26:27 +0800
9ee493ce0 mm: page allocator: drain per-cpu lists after direct reclaim allocation fails ... Browse Code »

When under significant memory pressure, a process enters direct reclaim
and immediately afterwards tries to allocate a page. If it fails and no
further progress is made, it's possible the system will go OOM. However,
on systems with large amounts of memory, it's possible that a significant
number of pages are on per-cpu lists and inaccessible to the calling
process. This leads to a process entering direct reclaim more often than
it should increasing the pressure on the system and compounding the
problem.

This patch notes that if direct reclaim is making progress but allocations
are still failing that the system is already under heavy pressure. In
this case, it drains the per-cpu lists and tries the allocation a second
time before continuing.

Signed-off-by: Mel Gorman
Reviewed-by: Minchan Kim
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Christoph Lameter
Cc: Dave Chinner
Cc: Wu Fengguang
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2010-09-10 09:57:25 +0800
aa4548403 mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory is … ... Browse Code »

…low and kswapd is awake

Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is
cheaper than scanning a number of lists. To avoid synchronization
overhead, counter deltas are maintained on a per-cpu basis and drained
both periodically and when the delta is above a threshold. On large CPU
systems, the difference between the estimated and real value of
NR_FREE_PAGES can be very high. If NR_FREE_PAGES is much higher than
number of real free page in buddy, the VM can allocate pages below min
watermark, at worst reducing the real number of pages to zero. Even if
the OOM killer kills some victim for freeing memory, it may not free
memory if the exit path requires a new page resulting in livelock.

This patch introduces a zone_page_state_snapshot() function (courtesy of
Christoph) that takes a slightly more accurate view of an arbitrary vmstat
counter. It is used to read NR_FREE_PAGES while kswapd is awake to avoid
the watermark being accidentally broken. The estimate is not perfect and
may result in cache line bounces but is expected to be lighter than the
IPI calls necessary to continually drain the per-cpu counters while kswapd
is awake.

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Christoph Lameter
2010-09-10 09:57:25 +0800
72853e299 mm: page allocator: update free page counters after pages are placed on the free list ... Browse Code »

When allocating a page, the system uses NR_FREE_PAGES counters to
determine if watermarks would remain intact after the allocation was made.
This check is made without interrupts disabled or the zone lock held and
so is race-prone by nature. Unfortunately, when pages are being freed in
batch, the counters are updated before the pages are added on the list.
During this window, the counters are misleading as the pages do not exist
yet. When under significant pressure on systems with large numbers of
CPUs, it's possible for processes to make progress even though they should
have been stalled. This is particularly problematic if a number of the
processes are using GFP_ATOMIC as the min watermark can be accidentally
breached and in extreme cases, the system can livelock.

This patch updates the counters after the pages have been added to the
list. This makes the allocator more cautious with respect to preserving
the watermarks and mitigates livelock possibilities.

[akpm@linux-foundation.org: avoid modifying incoming args]
Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Reviewed-by: Minchan Kim
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Christoph Lameter
Reviewed-by: KOSAKI Motohiro
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2010-09-10 09:57:25 +0800
5ee28a447 vmstat: update zone stat threshold when onlining a cpu ... Browse Code »

refresh_zone_stat_thresholds() calculates parameter based on the number of
online cpus. It's called at cpu offlining but needs to be called at
onlining, too.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Christoph Lameter
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-09-10 09:57:25 +0800
339944663 swap: discard while swapping only if SWAP_FLAG_DISCARD ... Browse Code »

Tests with recent firmware on Intel X25-M 80GB and OCZ Vertex 60GB SSDs
show a shift since I last tested in December: in part because of firmware
updates, in part because of the necessary move from barriers to awaiting
completion at the block layer. While discard at swapon still shows as
slightly beneficial on both, discarding 1MB swap cluster when allocating
is now disadvanteous: adds 25% overhead on Intel, adds 230% on OCZ (YMMV).

Surrender: discard as presently implemented is more hindrance than help
for swap; but might prove useful on other devices, or with improvements.
So continue to do the discard at swapon, but make discard while swapping
conditional on a SWAP_FLAG_DISCARD to sys_swapon() (which has been using
only the lower 16 bits of int flags).

We can add a --discard or -d to swapon(8), and a "discard" to swap in
/etc/fstab: matching the mount option for btrfs, ext4, fat, gfs2, nilfs2.

Signed-off-by: Hugh Dickins
Cc: Christoph Hellwig
Cc: Nigel Cunningham
Cc: Tejun Heo
Cc: Jens Axboe
Cc: James Bottomley
Cc: "Martin K. Petersen"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2010-09-10 09:57:25 +0800
8f2ae0faa swap: do not send discards as barriers ... Browse Code »

The swap code already uses synchronous discards, no need to add I/O
barriers.

This fixes the worst of the terrible slowdown in swap allocation for
hibernation, reported on 2.6.35 by Nigel Cunningham; but does not entirely
eliminate that regression.

[tj@kernel.org: superflous newlines removed]
Signed-off-by: Christoph Hellwig
Tested-by: Nigel Cunningham
Signed-off-by: Tejun Heo
Signed-off-by: Hugh Dickins
Cc: Jens Axboe
Cc: James Bottomley
Cc: "Martin K. Petersen"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2010-09-10 09:57:25 +0800
b73d7fcec swap: prevent reuse during hibernation ... Browse Code »

Move the hibernation check from scan_swap_map() into try_to_free_swap():
to catch not only the common case when hibernation's allocation itself
triggers swap reuse, but also the less likely case when concurrent page
reclaim (shrink_page_list) might happen to try_to_free_swap from a page.

Hibernation already clears __GFP_IO from the gfp_allowed_mask, to stop
reclaim from going to swap: check that to prevent swap reuse too.

Signed-off-by: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: "Rafael J. Wysocki"
Cc: Ondrej Zary
Cc: Andrea Gelmini
Cc: Balbir Singh
Cc: Andrea Arcangeli
Cc: Nigel Cunningham
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2010-09-10 09:57:25 +0800
910321ea8 swap: revert special hibernation allocation ... Browse Code »

Please revert 2.6.36-rc commit d2997b1042ec150616c1963b5e5e919ffd0b0ebf
"hibernation: freeze swap at hibernation". It complicated matters by
adding a second swap allocation path, just for hibernation; without in any
way fixing the issue that it was intended to address - page reclaim after
fixing the hibernation image might free swap from a page already imaged as
swapcache, letting its swap be reallocated to store a different page of
the image: resulting in data corruption if the imaged page were freed as
clean then swapped back in. Pages freed to si->swap_map were still in
danger of being reallocated by the alternative allocation path.

I guess it inadvertently fixed slow SSD swap allocation for hibernation,
as reported by Nigel Cunningham: by missing out the discards that occur on
the usual swap allocation path; but that was unintentional, and needs a
separate fix.

Signed-off-by: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: "Rafael J. Wysocki"
Cc: Ondrej Zary
Cc: Andrea Gelmini
Cc: Balbir Singh
Cc: Andrea Arcangeli
Cc: Nigel Cunningham
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2010-09-10 09:57:25 +0800
ac8456d6f bounce: call flush_dcache_page() after bounce_copy_vec() ... Browse Code »

I have been seeing problems on Tegra 2 (ARMv7 SMP) systems with HIGHMEM
enabled on 2.6.35 (plus some patches targetted at 2.6.36 to perform cache
maintenance lazily), and the root cause appears to be that the mm bouncing
code is calling flush_dcache_page before it copies the bounce buffer into
the bio.

The bounced page needs to be flushed after data is copied into it, to
ensure that architecture implementations can synchronize instruction and
data caches if necessary.

Signed-off-by: Gary King
Cc: Tejun Heo
Cc: Russell King
Acked-by: Jens Axboe
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gary King
2010-09-10 09:57:25 +0800
0dcc48c15 memory hotplug: fix next block calculation in is_removable ... Browse Code »

next_active_pageblock() is for finding next _used_ freeblock. It skips
several blocks when it finds there are a chunk of free pages lager than
pageblock. But it has 2 bugs.

1. We have no lock. page_order(page) - pageblock_order can be minus.
2. pageblocks_stride += is wrong. it should skip page_order(p) of pages.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Wu Fengguang
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-09-10 09:57:24 +0800
bc6930457 mm: compaction: handle active and inactive fairly in too_many_isolated ... Browse Code »

Iram reported that compaction's too_many_isolated() loops forever.
(http://www.spinics.net/lists/linux-mm/msg08123.html)

The meminfo when the situation happened was inactive anon is zero. That's
because the system has no memory pressure until then. While all anon
pages were in the active lru, compaction could select active lru as well
as inactive lru. That's a different thing from vmscan's isolated. So we
has been two too_many_isolated.

While compaction can isolate pages in both active and inactive, current
implementation of too_many_isolated only considers inactive. It made
Iram's problem.

This patch handles active and inactive fairly. That's because we can't
expect where from and how many compaction would isolated pages.

This patch changes (nr_isolated > nr_inactive) with
nr_isolated > (nr_active + nr_inactive) / 2.

Signed-off-by: Minchan Kim
Reported-by: Iram Shahzad
Acked-by: Mel Gorman
Acked-by: Wu Fengguang
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2010-09-10 09:57:24 +0800
152e0659f mm: avoid warning when COMPACTION is selected ... Browse Code »

COMPACTION enables MIGRATION, but MIGRATION spawns a warning if numa or
memhotplug aren't selected. However MIGRATION doesn't depend on them. I
guess it's just trying to be strict doing a double check on who's enabling
it, but it doesn't know that compaction also enables MIGRATION.

Signed-off-by: Andrea Arcangeli
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2010-09-10 09:57:24 +0800
4969c1192 mm: fix swapin race condition ... Browse Code »
43

The pte_same check is reliable only if the swap entry remains pinned (by
the page lock on swapcache). We've also to ensure the swapcache isn't
removed before we take the lock as try_to_free_swap won't care about the
page pin.

One of the possible impacts of this patch is that a KSM-shared page can
point to the anon_vma of another process, which could exit before the page
is freed.

This can leave a page with a pointer to a recycled anon_vma object, or
worse, a pointer to something that is no longer an anon_vma.

[riel@redhat.com: changelog help]
Signed-off-by: Andrea Arcangeli
Acked-by: Hugh Dickins
Reviewed-by: Rik van Riel
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2010-09-10 09:57:24 +0800
39aa3cb3e mm: Move vma_stack_continue into mm.h ... Browse Code »

So it can be used by all that need to check for that.

Signed-off-by: Stefan Bader
Signed-off-by: Linus Torvalds

Stefan Bader
2010-09-10 00:05:06 +0800

08 Sep, 2010

1 commit

ce7db282a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: fix a mismatch between code and comment
percpu: fix a memory leak in pcpu_extend_area_map()
percpu: add __percpu notations to UP allocator
percpu: handle __percpu notations in UP accessors

Linus Torvalds
2010-09-08 05:08:37 +0800

29 Aug, 2010

2 commits

997396a73 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: fix get_ticket_handler() error handling
ceph: don't BUG on ENOMEM during mds reconnect
ceph: ceph_mdsc_build_path() returns an ERR_PTR
ceph: Fix warnings
ceph: ceph_get_inode() returns an ERR_PTR
ceph: initialize fields on new dentry_infos
ceph: maintain i_head_snapc when any caps are dirty, not just for data
ceph: fix osd request lru adjustment when sending request
ceph: don't improperly set dir complete when holding EXCL cap
mm: exporting account_page_dirty
ceph: direct requests in snapped namespace based on nonsnap parent
ceph: queue cap snap writeback for realm children on snap update
ceph: include dirty xattrs state in snapped caps
ceph: fix xattr cap writeback
ceph: fix multiple mds session shutdown

Linus Torvalds
2010-08-29 05:07:20 +0800
f18194275 mm: fix hang on anon_vma->root->lock ... Browse Code »

After several hours, kbuild tests hang with anon_vma_prepare() spinning on
a newly allocated anon_vma's lock - on a box with CONFIG_TREE_PREEMPT_RCU=y
(which makes this very much more likely, but it could happen without).

The ever-subtle page_lock_anon_vma() now needs a further twist: since
anon_vma_prepare() and anon_vma_fork() are liable to change the ->root
of a reused anon_vma structure at any moment, page_lock_anon_vma()
needs to check page_mapped() again before succeeding, otherwise
page_unlock_anon_vma() might address a different root->lock.

Signed-off-by: Hugh Dickins
Reviewed-by: Rik van Riel
Cc: Christoph Lameter
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Signed-off-by: Linus Torvalds

Hugh Dickins
2010-08-29 04:54:12 +0800

27 Aug, 2010

3 commits

54157c444 percpu: fix a mismatch between code and comment ... Browse Code »

When pcpu_build_alloc_info() searches best_upa value, it ignores current value
if the number of waste units exceeds 1/3 of the number of total cpus. But the
comment on the code says that it will ignore if wastage is over 25%.
Modify the comment.

Signed-off-by: Namhyung Kim
Signed-off-by: Tejun Heo

Namhyung Kim
2010-08-27 17:36:19 +0800
a002d1484 percpu: fix a memory leak in pcpu_extend_area_map() ... Browse Code »

The original code did not free the old map. This patch fixes it.

tj: use @old as memcpy source instead of @chunk->map, and indentation
and description update

Signed-off-by: Huang Shijie
Signed-off-by: Tejun Heo
Cc: stable@kernel.org

Huang Shijie
2010-08-27 17:36:08 +0800
6628bc74f writeback: do not lose wakeup events when forking bdi threads ... Browse Code »

This patch fixes the following issue:

INFO: task mount.nfs4:1120 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mount.nfs4 D 00000000fffc6a21 0 1120 1119 0x00000000
ffff880235643948 0000000000000046 ffffffff00000000 ffffffff00000000
ffff880235643fd8 ffff880235314760 00000000001d44c0 ffff880235643fd8
00000000001d44c0 00000000001d44c0 00000000001d44c0 00000000001d44c0
Call Trace:
[] schedule_timeout+0x34/0xf1
[] ? wait_for_common+0x3f/0x130
[] ? trace_hardirqs_on+0xd/0xf
[] wait_for_common+0xd2/0x130
[] ? default_wake_function+0x0/0xf
[] ? _raw_spin_unlock+0x26/0x2a
[] wait_for_completion+0x18/0x1a
[] sync_inodes_sb+0xca/0x1bc
[] __sync_filesystem+0x47/0x7e
[] sync_filesystem+0x47/0x4b
[] generic_shutdown_super+0x22/0xd2
[] kill_anon_super+0x11/0x4f
[] nfs4_kill_super+0x3f/0x72 [nfs]
[] deactivate_locked_super+0x21/0x41
[] deactivate_super+0x40/0x45
[] mntput_no_expire+0xb8/0xed
[] release_mounts+0x9a/0xb0
[] put_mnt_ns+0x6a/0x7b
[] nfs_follow_remote_path+0x19a/0x296 [nfs]
[] nfs4_try_mount+0x75/0xaf [nfs]
[] nfs4_get_sb+0x276/0x2ff [nfs]
[] vfs_kern_mount+0xb8/0x196
[] do_kern_mount+0x48/0xe8
[] do_mount+0x771/0x7e8
[] sys_mount+0x83/0xbd
[] system_call_fastpath+0x16/0x1b

The reason of this hang was a race condition: when the flusher thread is
forking a bdi thread, we use 'kthread_run()', so we run it _before_ we make it
visible in 'bdi->wb.task'. The bdi thread runs, does all works, and goes sleep.
'bdi->wb.task' is still NULL. And this is a dangerous time window.

If at this time someone queues a work for this bdi, he does not see the bdi
thread and wakes up the forker thread instead! But the forker has already
forked this bdi thread, but just did not make it visible yet!

The result is that we lose the wake up event for this bdi thread and the NFS4
code waits forever.

To fix the problem, we should use 'ktrhead_create()' for creating bdi threads,
then make them visible in 'bdi->wb.task', and only after this wake them up.
This is exactly what this patch does.

Signed-off-by: Artem Bityutskiy
Signed-off-by: Jens Axboe

Artem Bityutskiy
2010-08-27 15:16:18 +0800

25 Aug, 2010

2 commits

871eae489 Merge branch '2.6.36-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev ... Browse Code »

* '2.6.36-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev:
xfs: do not discard page cache data on EAGAIN
xfs: don't do memory allocation under the CIL context lock
xfs: Reduce log force overhead for delayed logging
xfs: dummy transactions should not dirty VFS state
xfs: ensure f_ffree returned by statfs() is non-negative
xfs: handle negative wbc->nr_to_write during sync writeback
writeback: write_cache_pages doesn't terminate at nr_to_write <= 0
xfs: fix untrusted inode number lookup
xfs: ensure we mark all inodes in a freed cluster XFS_ISTALE
xfs: unlock items before allowing the CIL to commit

Linus Torvalds
2010-08-25 23:39:07 +0800
8ca3eb080 guard page for stacks that grow upwards ... Browse Code »

pa-risc and ia64 have stacks that grow upwards. Check that
they do not run into other mappings. By making VM_GROWSUP
0x0 on architectures that do not ever use it, we can avoid
some unpleasant #ifdefs in check_stack_guard_page().

Signed-off-by: Tony Luck
Signed-off-by: Linus Torvalds

Luck, Tony
2010-08-25 03:13:20 +0800

24 Aug, 2010

1 commit

546a19242 writeback: write_cache_pages doesn't terminate at nr_to_write <= 0 ... Browse Code »

I noticed XFS writeback in 2.6.36-rc1 was much slower than it should have
been. Enabling writeback tracing showed:

flush-253:16-8516 [007] 1342952.351608: wbc_writepage: bdi 253:16: towrt=1024 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
flush-253:16-8516 [007] 1342952.351654: wbc_writepage: bdi 253:16: towrt=1023 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
flush-253:16-8516 [000] 1342952.369520: wbc_writepage: bdi 253:16: towrt=0 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
flush-253:16-8516 [000] 1342952.369542: wbc_writepage: bdi 253:16: towrt=-1 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0
flush-253:16-8516 [000] 1342952.369549: wbc_writepage: bdi 253:16: towrt=-2 skip=0 mode=0 kupd=0 bgrd=1 reclm=0 cyclic=1 more=0 older=0x0 start=0x0 end=0x0

Writeback is not terminating in background writeback if ->writepage is
returning with wbc->nr_to_write == 0, resulting in sub-optimal single page
writeback on XFS.

Fix the write_cache_pages loop to terminate correctly when this situation
occurs and so prevent this sub-optimal background writeback pattern. This
improves sustained sequential buffered write performance from around
250MB/s to 750MB/s for a 100GB file on an XFS filesystem on my 8p test VM.

Cc:
Signed-off-by: Dave Chinner
Reviewed-by: Wu Fengguang
Reviewed-by: Christoph Hellwig

Dave Chinner
2010-08-24 09:44:34 +0800

23 Aug, 2010

2 commits

679ceace8 mm: exporting account_page_dirty ... Browse Code »

This allows code outside of the mm core to safely manipulate page state
and not worry about the other accounting. Not using these routines means
that some code will lose track of the accounting and we get bugs. This
has happened once already.

Signed-off-by: Michael Rubin
Signed-off-by: Sage Weil

Michael Rubin
2010-08-23 06:16:51 +0800
bc584c510 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
slab: fix object alignment
slub: add missing __percpu markup in mm/slub_def.h

Linus Torvalds
2010-08-23 01:08:52 +0800

21 Aug, 2010

7 commits

0e8e50e20 mm: make stack guard page logic use vm_prev pointer ... Browse Code »

Like the mlock() change previously, this makes the stack guard check
code use vma->vm_prev to see what the mapping below the current stack
is, rather than have to look it up with find_vma().

Also, accept an abutting stack segment, since that happens naturally if
you split the stack with mlock or mprotect.

Tested-by: Ian Campbell
Signed-off-by: Linus Torvalds

Linus Torvalds
2010-08-21 23:50:00 +0800
7798330ac mm: make the mlock() stack guard page checks stricter ... Browse Code »

If we've split the stack vma, only the lowest one has the guard page.
Now that we have a doubly linked list of vma's, checking this is trivial.

Tested-by: Ian Campbell
Signed-off-by: Linus Torvalds

Linus Torvalds
2010-08-21 23:49:50 +0800
297c5eee3 mm: make the vma list be doubly linked ... Browse Code »

It's a really simple list, and several of the users want to go backwards
in it to find the previous vma. So rather than have to look up the
previous entry with 'find_vma_prev()' or something similar, just make it
doubly linked instead.

Tested-by: Ian Campbell
Signed-off-by: Linus Torvalds

Linus Torvalds
2010-08-21 23:49:21 +0800
8d6c83f0b oom: __task_cred() need rcu_read_lock() ... Browse Code »

dump_tasks() needs to hold the RCU read lock around its access of the
target task's UID. To this end it should use task_uid() as it only needs
that one thing from the creds.

The fact that dump_tasks() holds tasklist_lock is insufficient to prevent the
target process replacing its credentials on another CPU.

Then, this patch change to call rcu_read_lock() explicitly.

===================================================
[ INFO: suspicious rcu_dereference_check() usage. ]
---------------------------------------------------
mm/oom_kill.c:410 invoked rcu_dereference_check() without protection!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 1
4 locks held by kworker/1:2/651:
#0: (events){+.+.+.}, at: []
process_one_work+0x137/0x4a0
#1: (moom_work){+.+...}, at: []
process_one_work+0x137/0x4a0
#2: (tasklist_lock){.+.+..}, at: []
out_of_memory+0x164/0x3f0
#3: (&(&p->alloc_lock)->rlock){+.+...}, at: []
find_lock_task_mm+0x2e/0x70

Signed-off-by: KOSAKI Motohiro
Signed-off-by: David Howells
Acked-by: Paul E. McKenney
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-08-21 00:34:55 +0800
b52723c56 oom: fix tasklist_lock leak ... Browse Code »

Commit 0aad4b3124 ("oom: fold __out_of_memory into out_of_memory")
introduced a tasklist_lock leak. Then it caused following obvious
danger warnings and panic.

================================================
[ BUG: lock held when returning to user space! ]
------------------------------------------------
rsyslogd/1422 is leaving the kernel with locks still held!
1 lock held by rsyslogd/1422:
#0: (tasklist_lock){.+.+.+}, at: [] out_of_memory+0x164/0x3f0
BUG: scheduling while atomic: rsyslogd/1422/0x00000002
INFO: lockdep is turned off.

This patch fixes it.

Signed-off-by: KOSAKI Motohiro
Reviewed-by: Minchan Kim
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-08-21 00:34:55 +0800
be71cf220 oom: fix NULL pointer dereference ... Browse Code »

Commit b940fd7035 ("oom: remove unnecessary code and cleanup") added an
unnecessary NULL pointer dereference. remove it.

Signed-off-by: KOSAKI Motohiro
Reviewed-by: Minchan Kim
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-08-21 00:34:55 +0800
d5ed3a4af lib/radix-tree.c: fix overflow in radix_tree_range_tag_if_tagged() ... Browse Code »

When radix_tree_maxindex() is ~0UL, it can happen that scanning overflows
index and tree traversal code goes astray reading memory until it hits
unreadable memory. Check for overflow and exit in that case.

Signed-off-by: Jan Kara
Cc: Christoph Hellwig
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2010-08-21 00:34:55 +0800

18 Aug, 2010

1 commit

602586a83 shmem: put_super must percpu_counter_destroy ... Browse Code »

list_add() corruption messages reported from shmem_fill_super()'s recently
introduced percpu_counter_init(): shmem_put_super() needs to remember to
percpu_counter_destroy(). And also check error from percpu_counter_init().

Reported-bisected-and-tested-by: Tetsuo Handa
Signed-off-by: Hugh Dickins
Signed-off-by: Linus Torvalds

Hugh Dickins
2010-08-18 09:33:11 +0800

16 Aug, 2010

1 commit

d7824370e mm: fix up some user-visible effects of the stack guard page ... Browse Code »

This commit makes the stack guard page somewhat less visible to user
space. It does this by:

- not showing the guard page in /proc//maps

It looks like lvm-tools will actually read /proc/self/maps to figure
out where all its mappings are, and effectively do a specialized
"mlockall()" in user space. By not showing the guard page as part of
the mapping (by just adding PAGE_SIZE to the start for grows-up
pages), lvm-tools ends up not being aware of it.

- by also teaching the _real_ mlock() functionality not to try to lock
the guard page.

That would just expand the mapping down to create a new guard page,
so there really is no point in trying to lock it in place.

It would perhaps be nice to show the guard page specially in
/proc//maps (or at least mark grow-down segments some way), but
let's not open ourselves up to more breakage by user space from programs
that depends on the exact deails of the 'maps' file.

Special thanks to Henrique de Moraes Holschuh for diving into lvm-tools
source code to see what was going on with the whole new warning.

Reported-and-tested-by: François Valenduc
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Linus Torvalds
2010-08-16 02:35:52 +0800

15 Aug, 2010

2 commits

03ab450f0 mm/page-writeback: fix non-kernel-doc function comments ... Browse Code »

Remove leading /** from non-kernel-doc function comments to prevent
kernel-doc warnings.

Signed-off-by: Randy Dunlap
Signed-off-by: Linus Torvalds

Randy Dunlap
2010-08-15 07:20:59 +0800
11ac55247 mm: fix page table unmap for stack guard page properly ... Browse Code »

We do in fact need to unmap the page table _before_ doing the whole
stack guard page logic, because if it is needed (mainly 32-bit x86 with
PAE and CONFIG_HIGHPTE, but other architectures may use it too) then it
will do a kmap_atomic/kunmap_atomic.

And those kmaps will create an atomic region that we cannot do
allocations in. However, the whole stack expand code will need to do
anon_vma_prepare() and vma_lock_anon_vma() and they cannot do that in an
atomic region.

Now, a better model might actually be to do the anon_vma_prepare() when
_creating_ a VM_GROWSDOWN segment, and not have to worry about any of
this at page fault time. But in the meantime, this is the
straightforward fix for the issue.

See https://bugzilla.kernel.org/show_bug.cgi?id=16588 for details.

Reported-by: Wylda
Reported-by: Sedat Dilek
Reported-by: Mike Pagano
Reported-by: François Valenduc
Tested-by: Ed Tomlinson
Cc: Pekka Enberg
Cc: Greg KH
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Linus Torvalds
2010-08-15 02:44:56 +0800

14 Aug, 2010

2 commits

fe622e76f NOMMU: Remove an extraneous no_printk() ... Browse Code »

Remove an extraneous no_printk() in mm/nommu.c that got missed when the
function got generalised from several things that used it in commit
12fdff3fc248 ("Add a dummy printk function for the maintenance of unused
printks").

Without this, the following error is observed:

mm/nommu.c:41: error: conflicting types for 'no_printk'
include/linux/kernel.h:314: error: previous definition of 'no_printk' was here

Reported-by: Michal Simek
Signed-off-by: David Howells
Signed-off-by: Linus Torvalds

David Howells
2010-08-14 07:55:25 +0800
5528f9132 mm: fix missing page table unmap for stack guard page failure case ... Browse Code »

.. which didn't show up in my tests because it's a no-op on x86-64 and
most other architectures. But we enter the function with the last-level
page table mapped, and should unmap it at exit.

Signed-off-by: Linus Torvalds

Linus Torvalds
2010-08-14 00:24:04 +0800

13 Aug, 2010

1 commit

320b2b8de mm: keep a guard page below a grow-down stack segment ... Browse Code »

This is a rather minimally invasive patch to solve the problem of the
user stack growing into a memory mapped area below it. Whenever we fill
the first page of the stack segment, expand the segment down by one
page.

Now, admittedly some odd application might _want_ the stack to grow down
into the preceding memory mapping, and so we may at some point need to
make this a process tunable (some people might also want to have more
than a single page of guarding), but let's try the minimal approach
first.

Tested with trivial application that maps a single page just below the
stack, and then starts recursing. Without this, we will get a SIGSEGV
_after_ the stack has smashed the mapping. With this patch, we'll get a
nice SIGBUS just as the stack touches the page just above the mapping.

Requested-by: Keith Packard
Signed-off-by: Linus Torvalds

Linus Torvalds
2010-08-13 08:54:33 +0800