Eric Lee / smarc-fsl-linux-kernel

25 May, 2011

27 commits

d16dfc550 mm: mmu_gather rework ... Browse Code »

Rework the existing mmu_gather infrastructure.

The direct purpose of these patches was to allow preemptible mmu_gather,
but even without that I think these patches provide an improvement to the
status quo.

The first 9 patches rework the mmu_gather infrastructure. For review
purpose I've split them into generic and per-arch patches with the last of
those a generic cleanup.

The next patch provides generic RCU page-table freeing, and the followup
is a patch converting s390 to use this. I've also got 4 patches from
DaveM lined up (not included in this series) that uses this to implement
gup_fast() for sparc64.

Then there is one patch that extends the generic mmu_gather batching.

After that follow the mm preemptibility patches, these make part of the mm
a lot more preemptible. It converts i_mmap_lock and anon_vma->lock to
mutexes which together with the mmu_gather rework makes mmu_gather
preemptible as well.

Making i_mmap_lock a mutex also enables a clean-up of the truncate code.

This also allows for preemptible mmu_notifiers, something that XPMEM I
think wants.

Furthermore, it removes the new and universially detested unmap_mutex.

This patch:

Remove the first obstacle towards a fully preemptible mmu_gather.

The current scheme assumes mmu_gather is always done with preemption
disabled and uses per-cpu storage for the page batches. Change this to
try and allocate a page for batching and in case of failure, use a small
on-stack array to make some progress.

Preemptible mmu_gather is desired in general and usable once i_mmap_lock
becomes a mutex. Doing it before the mutex conversion saves us from
having to rework the code by moving the mmu_gather bits inside the
pte_lock.

Also avoid flushing the tlb batches from under the pte lock, this is
useful even without the i_mmap_lock conversion as it significantly reduces
pte lock hold times.

[akpm@linux-foundation.org: fix comment tpyo]
Signed-off-by: Peter Zijlstra
Cc: Benjamin Herrenschmidt
Cc: David Miller
Cc: Martin Schwidefsky
Cc: Russell King
Cc: Paul Mundt
Cc: Jeff Dike
Cc: Richard Weinberger
Cc: Tony Luck
Reviewed-by: KAMEZAWA Hiroyuki
Acked-by: Hugh Dickins
Acked-by: Mel Gorman
Cc: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Namhyung Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2011-05-25 23:39:12 +0800
d05f3169c mm: make expand_downwards() symmetrical with expand_upwards() ... Browse Code »

Currently we have expand_upwards exported while expand_downwards is
accessible only via expand_stack or expand_stack_downwards.

check_stack_guard_page is a nice example of the asymmetry. It uses
expand_stack for VM_GROWSDOWN while expand_upwards is called for
VM_GROWSUP case.

Let's clean this up by exporting both functions and make those names
consistent. Let's use expand_{upwards,downwards} because expanding
doesn't always involve stack manipulation (an example is
ia64_do_page_fault which uses expand_upwards for registers backing store
expansion). expand_downwards has to be defined for both
CONFIG_STACK_GROWS{UP,DOWN} because get_arg_page calls the downwards
version in the early process initialization phase for growsup
configuration.

Signed-off-by: Michal Hocko
Acked-by: Hugh Dickins
Cc: James Bottomley
Cc: "Luck, Tony"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2011-05-25 23:39:12 +0800
248ac0e19 mm/vmalloc: remove guard page from between vmap blocks ... Browse Code »

The vmap allocator is used to, among other things, allocate per-cpu vmap
blocks, where each vmap block is naturally aligned to its own size.
Obviously, leaving a guard page after each vmap area forbids packing vmap
blocks efficiently and can make the kernel run out of possible vmap blocks
long before overall vmap space is exhausted.

The new interface to map a user-supplied page array into linear vmalloc
space (vm_map_ram) insists on allocating from a vmap block (instead of
falling back to a custom area) when the area size is below a certain
threshold. With heavy users of this interface (e.g. XFS) and limited
vmalloc space on 32-bit, vmap block exhaustion is a real problem.

Remove the guard page from the core vmap allocator. vmalloc and the old
vmap interface enforce a guard page on their own at a higher level.

Note that without this patch, we had accidental guard pages after those
vm_map_ram areas that happened to be at the end of a vmap block, but not
between every area. This patch removes this accidental guard page only.

If we want guard pages after every vm_map_ram area, this should be done
separately. And just like with vmalloc and the old interface on a
different level, not in the core allocator.

Mel pointed out: "If necessary, the guard page could be reintroduced as a
debugging-only option (CONFIG_DEBUG_PAGEALLOC?). Otherwise it seems
reasonable."

Signed-off-by: Johannes Weiner
Cc: Nick Piggin
Cc: Dave Chinner
Acked-by: Mel Gorman
Cc: Hugh Dickins
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-05-25 23:39:11 +0800
72788c385 oom: replace PF_OOM_ORIGIN with toggling oom_score_adj ... Browse Code »

There's a kernel-wide shortage of per-process flags, so it's always
helpful to trim one when possible without incurring a significant penalty.
It's even more important when you're planning on adding a per- process
flag yourself, which I plan to do shortly for transparent hugepages.

PF_OOM_ORIGIN is used by ksm and swapoff to prefer current since it has a
tendency to allocate large amounts of memory and should be preferred for
killing over other tasks. We'd rather immediately kill the task making
the errant syscall rather than penalizing an innocent task.

This patch removes PF_OOM_ORIGIN since its behavior is equivalent to
setting the process's oom_score_adj to OOM_SCORE_ADJ_MAX.

The process's old oom_score_adj is stored and then set to
OOM_SCORE_ADJ_MAX during the time it used to have PF_OOM_ORIGIN. The old
value is then reinstated when the process should no longer be considered a
high priority for oom killing.

Signed-off-by: David Rientjes
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Minchan Kim
Cc: Hugh Dickins
Cc: Izik Eidus
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2011-05-25 23:39:10 +0800
c6a140bf1 mm/compaction: reverse the change that forbade sync migraton with __GFP_NO_KSWAPD ... Browse Code »

It's uncertain this has been beneficial, so it's safer to undo it. All
other compaction users would still go in synchronous mode if a first
attempt at async compaction failed. Hopefully we don't need to force
special behavior for THP (which is the only __GFP_NO_KSWAPD user so far
and it's the easier to exercise and to be noticeable). This also make
__GFP_NO_KSWAPD return to its original strict semantics specific to bypass
kswapd, as THP allocations have khugepaged for the async THP
allocations/compactions.

Signed-off-by: Andrea Arcangeli
Cc: Alex Villacis Lasso
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-05-25 23:39:10 +0800
a6cccdc36 mm, mem-hotplug: update pcp->stat_threshold when memory hotplug occur ... Browse Code »

Currently, cpu hotplug updates pcp->stat_threshold, but memory hotplug
doesn't. There is no reason for this.

[akpm@linux-foundation.org: fix CONFIG_SMP=n build]
Signed-off-by: KOSAKI Motohiro
Reviewed-by: KAMEZAWA Hiroyuki
Acked-by: Mel Gorman
Acked-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2011-05-25 23:39:09 +0800
1b79acc91 mm, mem-hotplug: recalculate lowmem_reserve when memory hotplug occurs ... Browse Code »

Currently, memory hotplug calls setup_per_zone_wmarks() and
calculate_zone_inactive_ratio(), but doesn't call
setup_per_zone_lowmem_reserve().

It means the number of reserved pages aren't updated even if memory hot
plug occur. This patch fixes it.

Signed-off-by: KOSAKI Motohiro
Reviewed-by: KAMEZAWA Hiroyuki
Acked-by: Mel Gorman
Reviewed-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2011-05-25 23:39:09 +0800
839a4fcc8 mm, mem-hotplug: fix section mismatch. setup_per_zone_inactive_ratio() should be __meminit. ... Browse Code »

Commit bce7394a3e ("page-allocator: reset wmark_min and inactive ratio of
zone when hotplug happens") introduced invalid section references. Now,
setup_per_zone_inactive_ratio() is marked __init and then it can't be
referenced from memory hotplug code.

This patch marks it as __meminit and also marks caller as __ref.

Signed-off-by: KOSAKI Motohiro
Reviewed-by: Minchan Kim
Cc: Yasunori Goto
Cc: Rik van Riel
Cc: Johannes Weiner
Reviewed-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2011-05-25 23:39:08 +0800
37b23e052 x86,mm: make pagefault killable ... Browse Code »
688

When an oom killing occurs, almost all processes are getting stuck at the
following two points.

1) __alloc_pages_nodemask
2) __lock_page_or_retry

1) is not very problematic because TIF_MEMDIE leads to an allocation
failure and getting out from page allocator.

2) is more problematic. In an OOM situation, zones typically don't have
page cache at all and memory starvation might lead to greatly reduced IO
performance. When a fork bomb occurs, TIF_MEMDIE tasks don't die quickly,
meaning that a fork bomb may create new process quickly rather than the
oom-killer killing it. Then, the system may become livelocked.

This patch makes the pagefault interruptible by SIGKILL.

Signed-off-by: KOSAKI Motohiro
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Minchan Kim
Cc: Matthew Wilcox
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2011-05-25 23:39:08 +0800
f62e00cc3 mm: introduce wait_on_page_locked_killable() ... Browse Code »

commit 2687a356 ("Add lock_page_killable") introduced killable
lock_page(). Similarly this patch introdues killable
wait_on_page_locked().

Signed-off-by: KOSAKI Motohiro
Acked-by: KAMEZAWA Hiroyuki
Reviewed-by: Minchan Kim
Cc: Matthew Wilcox
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2011-05-25 23:39:08 +0800
fa25c503d mm: per-node vmstat: show proper vmstats ... Browse Code »

commit 2ac390370a ("writeback: add
/sys/devices/system/node//vmstat") added vmstat entry. But
strangely it only show nr_written and nr_dirtied.

# cat /sys/devices/system/node/node20/vmstat
nr_written 0
nr_dirtied 0

Of course, It's not adequate. With this patch, the vmstat show all vm
stastics as /proc/vmstat.

# cat /sys/devices/system/node/node0/vmstat
nr_free_pages 899224
nr_inactive_anon 201
nr_active_anon 17380
nr_inactive_file 31572
nr_active_file 28277
nr_unevictable 0
nr_mlock 0
nr_anon_pages 17321
nr_mapped 8640
nr_file_pages 60107
nr_dirty 33
nr_writeback 0
nr_slab_reclaimable 6850
nr_slab_unreclaimable 7604
nr_page_table_pages 3105
nr_kernel_stack 175
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 260
nr_dirtied 1050
nr_written 938
numa_hit 962872
numa_miss 0
numa_foreign 0
numa_interleave 8617
numa_local 962872
numa_other 0
nr_anon_transparent_hugepages 0

[akpm@linux-foundation.org: no externs in .c files]
Signed-off-by: KOSAKI Motohiro
Cc: Michael Rubin
Cc: Wu Fengguang
Acked-by: David Rientjes
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2011-05-25 23:39:07 +0800
bb005a59e mm: nommu: fix a compile warning in do_mmap_pgoff() ... Browse Code »

Because 'ret' is declared as int, not unsigned long, no need to cast the
error contants into unsigned long. If you compile this code on a 64-bit
machine somehow, you'll see following warning:

CC mm/nommu.o
mm/nommu.c: In function `do_mmap_pgoff':
mm/nommu.c:1411: warning: overflow in implicit constant conversion

Signed-off-by: Namhyung Kim
Acked-by: Greg Ungerer
Cc: David Howells
Cc: Paul Mundt
Cc: Geert Uytterhoeven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Namhyung Kim
2011-05-25 23:39:07 +0800
7223bb4a8 mm: nommu: fix a potential memory leak in do_mmap_private() ... Browse Code »

If f_op->read() fails and sysctl_nr_trim_pages > 1, there could be a
memory leak between @region->vm_end and @region->vm_top.

Signed-off-by: Namhyung Kim
Acked-by: Greg Ungerer
Cc: David Howells
Cc: Paul Mundt
Cc: Geert Uytterhoeven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Namhyung Kim
2011-05-25 23:39:06 +0800
d75a310c4 mm: nommu: check the vma list when unmapping file-mapped vma ... Browse Code »

Now we have the sorted vma list, use it in do_munmap() to check that we
have an exact match.

Signed-off-by: Namhyung Kim
Acked-by: Greg Ungerer
Cc: David Howells
Cc: Paul Mundt
Cc: Geert Uytterhoeven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Namhyung Kim
2011-05-25 23:39:06 +0800
e922c4c53 mm: nommu: find vma using the sorted vma list ... Browse Code »

Now we have the sorted vma list, use it in the find_vma[_exact]() rather
than doing linear search on the rb-tree.

Signed-off-by: Namhyung Kim
Acked-by: Greg Ungerer
Cc: David Howells
Cc: Paul Mundt
Cc: Geert Uytterhoeven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Namhyung Kim
2011-05-25 23:39:06 +0800
b951bf2c4 mm: nommu: don't scan the vma list when deleting ... Browse Code »

Since commit 297c5eee3724 ("mm: make the vma list be doubly linked") made
it a doubly linked list, we don't need to scan the list when deleting
@vma.

And the original code didn't update the prev pointer. Fix it too.

Signed-off-by: Namhyung Kim
Acked-by: Greg Ungerer
Cc: David Howells
Cc: Paul Mundt
Cc: Geert Uytterhoeven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Namhyung Kim
2011-05-25 23:39:05 +0800
6038def0d mm: nommu: sort mm->mmap list properly ... Browse Code »

When I was reading nommu code, I found that it handles the vma list/tree
in an unusual way. IIUC, because there can be more than one
identical/overrapped vmas in the list/tree, it sorts the tree more
strictly and does a linear search on the tree. But it doesn't applied to
the list (i.e. the list could be constructed in a different order than
the tree so that we can't use the list when finding the first vma in that
order).

Since inserting/sorting a vma in the tree and link is done at the same
time, we can easily construct both of them in the same order. And linear
searching on the tree could be more costly than doing it on the list, it
can be converted to use the list.

Also, after the commit 297c5eee3724 ("mm: make the vma list be doubly
linked") made the list be doubly linked, there were a couple of code need
to be fixed to construct the list properly.

Patch 1/6 is a preparation. It maintains the list sorted same as the tree
and construct doubly-linked list properly. Patch 2/6 is a simple
optimization for the vma deletion. Patch 3/6 and 4/6 convert tree
traversal to list traversal and the rest are simple fixes and cleanups.

This patch:

@vma added into @mm should be sorted by start addr, end addr and VMA
struct addr in that order because we may get identical VMAs in the @mm.
However this was true only for the rbtree, not for the list.

This patch fixes this by remembering 'rb_prev' during the tree traversal
like find_vma_prepare() does and linking the @vma via __vma_link_list().
After this patch, we can iterate the whole VMAs in correct order simply by
using @mm->mmap list.

[akpm@linux-foundation.org: avoid duplicating __vma_link_list()]
Signed-off-by: Namhyung Kim
Acked-by: Greg Ungerer
Cc: David Howells
Cc: Paul Mundt
Cc: Geert Uytterhoeven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Namhyung Kim
2011-05-25 23:39:05 +0800
ac3bbec5e mm: remove unused zone_idx variable from set_migratetype_isolate ... Browse Code »

Signed-off-by: Sergey Senozhatsky
Reviewed-by: Christoph Lameter
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2011-05-25 23:39:04 +0800
965f55dea mmap: avoid merging cloned VMAs ... Browse Code »

Avoid merging a VMA with another VMA which is cloned from the parent process.

The cloned VMA shares the anon_vma lock with the parent process's VMA. If
we do the merge, more vmas (even the new range is only for current
process) use the perent process's anon_vma lock. This introduces
scalability issues. find_mergeable_anon_vma() already considers this
case.

Signed-off-by: Shaohua Li
Cc: Rik van Riel
Cc: Hugh Dickins
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2011-05-25 23:39:04 +0800
5f70b962c mmap: avoid unnecessary anon_vma lock ... Browse Code »

If we only change vma->vm_end, we can avoid taking anon_vma lock even if
'insert' isn't NULL, which is the case of split_vma.

As I understand it, we need the lock before because rmap must get the
'insert' VMA when we adjust old VMA's vm_end (the 'insert' VMA is linked
to anon_vma list in __insert_vm_struct before).

But now this isn't true any more. The 'insert' VMA is already linked to
anon_vma list in __split_vma(with anon_vma_clone()) instead of
__insert_vm_struct. There is no race rmap can't get required VMAs. So
the anon_vma lock is unnecessary, and this can reduce one locking in brk
case and improve scalability.

Signed-off-by: Shaohua Li
Cc: Rik van Riel
Acked-by: Hugh Dickins
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2011-05-25 23:39:04 +0800
34679d7ea mmap: add alignment for some variables ... Browse Code »

Make some variables have correct alignment/section to avoid cache issue.
In a workload which heavily does mmap/munmap, the variables will be used
frequently.

Signed-off-by: Shaohua Li
Cc: Andi Kleen
Cc: Rik van Riel
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2011-05-25 23:39:03 +0800
7bf02ea22 arch, mm: filter disallowed nodes from arch specific show_mem functions ... Browse Code »

Architectures that implement their own show_mem() function did not pass
the filter argument to show_free_areas() to appropriately avoid emitting
the state of nodes that are disallowed in the current context. This patch
now passes the filter argument to show_free_areas() so those nodes are now
avoided.

This patch also removes the show_free_areas() wrapper around
__show_free_areas() and converts existing callers to pass an empty filter.

ia64 emits additional information for each node, so skip_free_areas_zone()
must be made global to filter disallowed nodes and it is converted to use
a nid argument rather than a zone for this use case.

Signed-off-by: David Rientjes
Cc: Russell King
Cc: Tony Luck
Cc: Fenghua Yu
Cc: Kyle McMartin
Cc: Helge Deller
Cc: James Bottomley
Cc: "David S. Miller"
Cc: Guan Xuetao
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2011-05-25 23:39:03 +0800
f06590bd7 mm: vmscan: correctly check if reclaimer should schedule during shrink_slab ... Browse Code »

It has been reported on some laptops that kswapd is consuming large
amounts of CPU and not being scheduled when SLUB is enabled during large
amounts of file copying. It is expected that this is due to kswapd
missing every cond_resched() point because;

shrink_page_list() calls cond_resched() if inactive pages were isolated
which in turn may not happen if all_unreclaimable is set in
shrink_zones(). If for whatver reason, all_unreclaimable is
set on all zones, we can miss calling cond_resched().

balance_pgdat() only calls cond_resched if the zones are not
balanced. For a high-order allocation that is balanced, it
checks order-0 again. During that window, order-0 might have
become unbalanced so it loops again for order-0 and returns
that it was reclaiming for order-0 to kswapd(). It can then
find that a caller has rewoken kswapd for a high-order and
re-enters balance_pgdat() without ever calling cond_resched().

shrink_slab only calls cond_resched() if we are reclaiming slab
pages. If there are a large number of direct reclaimers, the
shrinker_rwsem can be contended and prevent kswapd calling
cond_resched().

This patch modifies the shrink_slab() case. If the semaphore is
contended, the caller will still check cond_resched(). After each
successful call into a shrinker, the check for cond_resched() remains in
case one shrinker is particularly slow.

[mgorman@suse.de: preserve call to cond_resched after each call into shrinker]
Signed-off-by: Mel Gorman
Signed-off-by: Minchan Kim
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Wu Fengguang
Cc: James Bottomley
Tested-by: Colin King
Cc: Raghavendra D Prabhu
Cc: Jan Kara
Cc: Chris Mason
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Rik van Riel
Cc: [2.6.38+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2011-05-25 23:39:01 +0800
afc7e326a mm: vmscan: correct use of pgdat_balanced in sleeping_prematurely ... Browse Code »

There are a few reports of people experiencing hangs when copying large
amounts of data with kswapd using a large amount of CPU which appear to be
due to recent reclaim changes. SLUB using high orders is the trigger but
not the root cause as SLUB has been using high orders for a while. The
root cause was bugs introduced into reclaim which are addressed by the
following two patches.

Patch 1 corrects logic introduced by commit 1741c877 ("mm: kswapd:
keep kswapd awake for high-order allocations until a percentage of
the node is balanced") to allow kswapd to go to sleep when
balanced for high orders.

Patch 2 notes that it is possible for kswapd to miss every
cond_resched() and updates shrink_slab() so it'll at least reach
that scheduling point.

Chris Wood reports that these two patches in isolation are sufficient to
prevent the system hanging. AFAIK, they should also resolve similar hangs
experienced by James Bottomley.

This patch:

Johannes Weiner poined out that the logic in commit 1741c877 ("mm: kswapd:
keep kswapd awake for high-order allocations until a percentage of the
node is balanced") is backwards. Instead of allowing kswapd to go to
sleep when balancing for high order allocations, it keeps it kswapd
running uselessly.

Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Signed-off-by: Johannes Weiner
Reviewed-by: Wu Fengguang
Cc: James Bottomley
Tested-by: Colin King
Cc: Raghavendra D Prabhu
Cc: Jan Kara
Cc: Chris Mason
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Rik van Riel
Reviewed-by: Minchan Kim
Reviewed-by: Wu Fengguang
Cc: [2.6.38+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-05-25 23:39:01 +0800
a71ae47a2 slub: Fix double bit unlock in debug mode ... Browse Code »

Commit 442b06bcea23 ("slub: Remove node check in slab_free") added a
call to deactivate_slab() in the debug case in __slab_alloc(), which
unlocks the current slab used for allocation. Going to the label
'unlock_out' then does it again.

Also, in the debug case we do not need all the other processing that the
'unlock_out' path does. We always fall back to the slow path in the
debug case. So the tid update is useless.

Similarly, ALLOC_SLOWPATH would just be incremented for all allocations.
Also a pretty useless thing.

So simply restore irq flags and return the object.

Signed-off-by: Christoph Lameter
Reported-and-bisected-by: James Morris
Reported-by: Ingo Molnar
Reported-by: Jens Axboe
Cc: Pekka Enberg
Signed-off-by: Linus Torvalds

Christoph Lameter
2011-05-25 23:38:24 +0800
0d66cba1a Merge branch 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6 ... Browse Code »

* 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6: (29 commits)
[S390] cpu hotplug: fix external interrupt subclass mask handling
[S390] oprofile: dont access lowcore
[S390] oprofile: add missing irq stats counter
[S390] Ignore sendmmsg system call note wired up warning
[S390] s390,oprofile: fix compile error for !CONFIG_SMP
[S390] s390,oprofile: fix alert counter increment
[S390] Remove unused includes in process.c
[S390] get CPC image name
[S390] sclp: event buffer dissection
[S390] chsc: process channel-path-availability information
[S390] refactor page table functions for better pgste support
[S390] merge page_test_dirty and page_clear_dirty
[S390] qdio: prevent compile warning
[S390] sclp: remove unnecessary sendmask check
[S390] convert old cpumask API into new one
[S390] pfault: cleanup code
[S390] pfault: cpu hotplug vs missing completion interrupts
[S390] smp: add __noreturn attribute to cpu_die()
[S390] percpu: implement arch specific irqsafe_cpu_ops
[S390] vdso: disable gcov profiling
...

Linus Torvalds
2011-05-25 03:06:02 +0800
5129df03d Merge branch 'for-2.6.40' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu ... Browse Code »

* 'for-2.6.40' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: Unify input section names
percpu: Avoid extra NOP in percpu_cmpxchg16b_double
percpu: Cast away printk format warning
percpu: Always align percpu output section to PAGE_SIZE

Fix up fairly trivial conflict in arch/x86/include/asm/percpu.h as per Tejun

Linus Torvalds
2011-05-25 02:53:42 +0800

24 May, 2011

4 commits

6988f20fe Merge branch 'fixes-2.6.39' into for-2.6.40 Browse Code »

Tejun Heo
2011-05-24 15:59:36 +0800
4867faab1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
slub: Deal with hyperthetical case of PAGE_SIZE > 2M
slub: Remove node check in slab_free
slub: avoid label inside conditional
slub: Make CONFIG_DEBUG_PAGE_ALLOC work with new fastpath
slub: Avoid warning for !CONFIG_SLUB_DEBUG
slub: Remove CONFIG_CMPXCHG_LOCAL ifdeffery
slub: Move debug handlign in __slab_free
slub: Move node determination out of hotpath
slub: Eliminate repeated use of c->page through a new page variable
slub: get_map() function to establish map of free objects in a slab
slub: Use NUMA_NO_NODE in get_partial
slub: Fix a typo in config name

Linus Torvalds
2011-05-24 01:10:44 +0800
bfb91fb65 Merge branch 'slab/next' into for-linus ... Browse Code »

Conflicts:
mm/slub.c

Pekka Enberg
2011-05-24 00:50:39 +0800
57d19e80f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
b43: fix comment typo reqest -> request
Haavard Skinnemoen has left Atmel
cris: typo in mach-fs Makefile
Kconfig: fix copy/paste-ism for dell-wmi-aio driver
doc: timers-howto: fix a typo ("unsgined")
perf: Only include annotate.h once in tools/perf/util/ui/browsers/annotate.c
md, raid5: Fix spelling error in comment ('Ofcourse' --> 'Of course').
treewide: fix a few typos in comments
regulator: change debug statement be consistent with the style of the rest
Revert "arm: mach-u300/gpio: Fix mem_region resource size miscalculations"
audit: acquire creds selectively to reduce atomic op overhead
rtlwifi: don't touch with treewide double semicolon removal
treewide: cleanup continuations and remove logging message whitespace
ath9k_hw: don't touch with treewide double semicolon removal
include/linux/leds-regulator.h: fix syntax in example code
tty: fix typo in descripton of tty_termios_encode_baud_rate
xtensa: remove obsolete BKL kernel option from defconfig
m68k: fix comment typo 'occcured'
arch:Kconfig.locks Remove unused config option.
treewide: remove extra semicolons
...

Linus Torvalds
2011-05-24 00:12:26 +0800

23 May, 2011

1 commit

2d42552d1 [S390] merge page_test_dirty and page_clear_dirty ... Browse Code »

The page_clear_dirty primitive always sets the default storage key
which resets the access control bits and the fetch protection bit.
That will surprise a KVM guest that sets non-zero access control
bits or the fetch protection bit. Merge page_test_dirty and
page_clear_dirty back to a single function and only clear the
dirty bit from the storage key.

In addition move the function page_test_and_clear_dirty and
page_test_and_clear_young to page.h where they belong. This
requires to change the parameter from a struct page * to a page
frame number.

Signed-off-by: Martin Schwidefsky

Martin Schwidefsky
2011-05-23 16:24:31 +0800

21 May, 2011

3 commits

442b06bce slub: Remove node check in slab_free ... Browse Code »

We can set the page pointing in the percpu structure to
NULL to have the same effect as setting c->node to NUMA_NO_NODE.

Gets rid of one check in slab_free() that was only used for
forcing the slab_free to the slowpath for debugging.

We still need to set c->node to NUMA_NO_NODE to force the
slab_alloc() fastpath to the slowpath in case of debugging.

Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Christoph Lameter
2011-05-21 17:53:53 +0800
e6c9366b2 tmpfs: fix highmem swapoff crash regression ... Browse Code »

Commit 778dd893ae78 ("tmpfs: fix race between umount and swapoff")
forgot the new rules for strict atomic kmap nesting, causing

WARNING: at arch/x86/mm/highmem_32.c:81

from __kunmap_atomic(), then

BUG: unable to handle kernel paging request at fffb9000

from shmem_swp_set() when shmem_unuse_inode() is handling swapoff with
highmem in use. My disgrace again.

See
https://bugzilla.kernel.org/show_bug.cgi?id=35352

Reported-by: Witold Baryluk
Signed-off-by: Hugh Dickins
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-05-21 07:23:19 +0800
268bb0ce3 sanitize <linux/prefetch.h> usage ... Browse Code »

Commit e66eed651fd1 ("list: remove prefetching from regular list
iterators") removed the include of prefetch.h from list.h, which
uncovered several cases that had apparently relied on that rather
obscure header file dependency.

So this fixes things up a bit, using

grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')

to guide us in finding files that either need
inclusion, or have it despite not needing it.

There are more of them around (mostly network drivers), but this gets
many core ones.

Reported-by: Stephen Rothwell
Signed-off-by: Linus Torvalds

Linus Torvalds
2011-05-21 03:50:29 +0800

20 May, 2011

2 commits

83d7e9487 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-2.6-cm ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-2.6-cm:
kmemleak: Initialise kmemleak after debug_objects_mem_init()
kmemleak: Select DEBUG_FS unconditionally in DEBUG_KMEMLEAK
kmemleak: Do not return a pointer to an object that kmemleak did not get

Linus Torvalds
2011-05-20 07:44:13 +0800
52c3ce4ec kmemleak: Do not return a pointer to an object that kmemleak did not get ... Browse Code »

The kmemleak_seq_next() function tries to get an object (and increment
its use count) before returning it. If it could not get the last object
during list traversal (because it may have been freed), the function
should return NULL rather than a pointer to such object that it did not
get.

Signed-off-by: Catalin Marinas
Reported-by: Phil Carmody
Acked-by: Phil Carmody
Cc:

Catalin Marinas
2011-05-20 00:35:28 +0800

18 May, 2011

3 commits

d6c438b6c memcg: fix zone congestion ... Browse Code »

ZONE_CONGESTED should be a state of global memory reclaim. If not, a busy
memcg sets this and give unnecessary throttoling in wait_iff_congested()
against memory recalim in other contexts. This makes system performance
bad.

I'll think about "memcg is congested!" flag is required or not, later.
But this fix is required first.

Signed-off-by: KAMEZAWA Hiroyuki
Reviewed-by: Minchan Kim
Cc: Daisuke Nishimura
Acked-by: Ying Han
Cc: Balbir Singh
Cc: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2011-05-18 17:55:23 +0800
bd07d87fd slub: avoid label inside conditional ... Browse Code »

Jumping to a label inside a conditional is considered poor style,
especially considering the current organization of __slab_alloc().

This removes the 'load_from_page' label and just duplicates the three
lines of code that it uses:

c->node = page_to_nid(page);
c->page = page;
goto load_freelist;

since it's probably not worth making this a separate helper function.

Acked-by: Christoph Lameter
Signed-off-by: David Rientjes
Signed-off-by: Pekka Enberg

David Rientjes
2011-05-18 03:19:00 +0800
1393d9a18 slub: Make CONFIG_DEBUG_PAGE_ALLOC work with new fastpath ... Browse Code »

Fastpath can do a speculative access to a page that CONFIG_DEBUG_PAGE_ALLOC may have
marked as invalid to retrieve the pointer to the next free object.

Use probe_kernel_read in that case in order not to cause a page fault.

Cc: # 38.x
Reported-by: Eric Dumazet
Signed-off-by: Christoph Lameter
Signed-off-by: Eric Dumazet
Signed-off-by: Pekka Enberg

Christoph Lameter
2011-05-18 03:18:55 +0800