Eric Lee / smarc-fsl-linux-kernel

21 Aug, 2008

7 commits

14bac5acf mm: xip/ext2 fix block allocation race ... Browse Code »

XIP can call into get_xip_mem concurrently with the same file,offset with
create=1. This usually maps down to get_block, which expects the page
lock to prevent such a situation. This causes ext2 to explode for one
reason or another.

Serialise those calls for the moment. For common usages today, I suspect
get_xip_mem rarely is called to create new blocks. In future as XIP
technologies evolve we might need to look at which operations require
scalability, and rework the locking to suit.

Signed-off-by: Nick Piggin
Cc: Jared Hulbert
Acked-by: Carsten Otte
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2008-08-21 06:40:32 +0800
538f8ea6c mm: xip fix fault vs sparse page invalidate race ... Browse Code »

XIP has a race between sparse pages being inserted into page tables, and
sparse pages being zapped when its time to put a non-sparse page in.

What can happen is that a process can be left with a dangling sparse page
in a MAP_SHARED mapping, while the rest of the world sees the non-sparse
version. Ie. data corruption.

Guard these operations with a seqlock, making fault-in-sparse-pages the
slowpath, and try-to-unmap-sparse-pages the fastpath.

Signed-off-by: Nick Piggin
Cc: Jared Hulbert
Acked-by: Carsten Otte
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2008-08-21 06:40:32 +0800
479db0bf4 mm: dirty page tracking race fix ... Browse Code »

There is a race with dirty page accounting where a page may not properly
be accounted for.

clear_page_dirty_for_io() calls page_mkclean; then TestClearPageDirty.

page_mkclean walks the rmaps for that page, and for each one it cleans and
write protects the pte if it was dirty. It uses page_check_address to
find the pte. That function has a shortcut to avoid the ptl if the pte is
not present. Unfortunately, the pte can be switched to not-present then
back to present by other code while holding the page table lock -- this
should not be a signal for page_mkclean to ignore that pte, because it may
be dirty.

For example, powerpc64's set_pte_at will clear a previously present pte
before setting it to the desired value. There may also be other code in
core mm or in arch which do similar things.

The consequence of the bug is loss of data integrity due to msync, and
loss of dirty page accounting accuracy. XIP's __xip_unmap could easily
also be unreliable (depending on the exact XIP locking scheme), which can
lead to data corruption.

Fix this by having an option to always take ptl to check the pte in
page_check_address.

It's possible to retain this optimization for page_referenced and
try_to_unmap.

Signed-off-by: Nick Piggin
Cc: Jared Hulbert
Cc: Carsten Otte
Cc: Hugh Dickins
Acked-by: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2008-08-21 06:40:32 +0800
481ebd0d7 bootmem: fix aligning of node-relative indexes and offsets ... Browse Code »

Absolute alignment requirements may never be applied to node-relative
offsets. Andreas Herrmann spotted this flaw when a bootmem allocation on
an unaligned node was itself not aligned because the combination of an
unaligned node with an aligned offset into that node is not garuanteed to
be aligned itself.

This patch introduces two helper functions that align a node-relative
index or offset with respect to the node's starting address so that the
absolute PFN or virtual address that results from combining the two
satisfies the requested alignment.

Then all the broken ALIGN()s in alloc_bootmem_core() are replaced by these
helpers.

Signed-off-by: Johannes Weiner
Reported-by: Andreas Herrmann
Debugged-by: Andreas Herrmann
Reviewed-by: Andreas Herrmann
Tested-by: Andreas Herrmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2008-08-21 06:40:31 +0800
759f9a2df mm: mminit_loglevel cannot be __meminitdata anymore ... Browse Code »

mminit_loglevel is now used from mminit_verify_zonelist
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Marcin Slusarz
2008-08-21 06:40:30 +0800
07279cdfd mm: show free swap as signed ... Browse Code »

Adjust m show_swap_cache_info() to show "Free swap" as a
signed long: the signed format is preferable, because during swapoff
nr_swap_pages can legitimately go negative, so makes more sense thus
(it used to be shown redundantly, once as signed and once as unsigned).

Signed-off-by: Hugh Dickins
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2008-08-21 06:40:30 +0800
16f8c5b2e mm: page_remove_rmap comments on PageAnon ... Browse Code »

Add a comment to s390's page_test_dirty/page_clear_dirty/page_set_dirty
dance in page_remove_rmap(): I was wrong to think the PageSwapCache test
could be avoided, and would like a comment in there to remind me. And
mention s390, to help us remember that this block is not really common.

Also move down the "It would be tidy to reset PageAnon" comment: it does
not belong to s390's block, and it would be unwise to reset PageAnon
before we're done with testing it.

Signed-off-by: Hugh Dickins
Acked-by: Martin Schwidefsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2008-08-21 06:40:30 +0800

16 Aug, 2008

1 commit

71ef2a46f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorri… ... Browse Code »

…s/security-testing-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
security: Fix setting of PF_SUPERPRIV by __capable()

Linus Torvalds
2008-08-16 06:32:13 +0800

15 Aug, 2008

1 commit

627240aaa bootmem allocator: alloc_bootmem_core(): page-align the end offset ... Browse Code »

This is the minimal sequence that jams the allocator:

void *p, *q, *r;
p = alloc_bootmem(PAGE_SIZE);
q = alloc_bootmem(64);
free_bootmem(p, PAGE_SIZE);
p = alloc_bootmem(PAGE_SIZE);
r = alloc_bootmem(64);

after this sequence (assuming that the allocator was empty or page-aligned
before), pointer "q" will be equal to pointer "r".

What's hapenning inside the allocator:
p = alloc_bootmem(PAGE_SIZE);
in allocator: last_end_off == PAGE_SIZE, bitmap contains bits 10000...
q = alloc_bootmem(64);
in allocator: last_end_off == PAGE_SIZE + 64, bitmap contains 11000...
free_bootmem(p, PAGE_SIZE);
in allocator: last_end_off == PAGE_SIZE + 64, bitmap contains 01000...
p = alloc_bootmem(PAGE_SIZE);
in allocator: last_end_off == PAGE_SIZE, bitmap contains 11000...
r = alloc_bootmem(64);

and now:

it finds bit "2", as a place where to allocate (sidx)

it hits the condition

if (bdata->last_end_off && PFN_DOWN(bdata->last_end_off) + 1 == sidx))
start_off = ALIGN(bdata->last_end_off, align);

-you can see that the condition is true, so it assigns start_off =
ALIGN(bdata->last_end_off, align); (that is PAGE_SIZE) and allocates
over already allocated block.

With the patch it tries to continue at the end of previous allocation only
if the previous allocation ended in the middle of the page.

Signed-off-by: Mikulas Patocka
Acked-by: Johannes Weiner
Cc: David Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mikulas Patocka
2008-08-15 23:35:41 +0800

14 Aug, 2008

1 commit

5cd9c58fb security: Fix setting of PF_SUPERPRIV by __capable() ... Browse Code »

Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags
the target process if that is not the current process and it is trying to
change its own flags in a different way at the same time.

__capable() is using neither atomic ops nor locking to protect t->flags. This
patch removes __capable() and introduces has_capability() that doesn't set
PF_SUPERPRIV on the process being queried.

This patch further splits security_ptrace() in two:

(1) security_ptrace_may_access(). This passes judgement on whether one
process may access another only (PTRACE_MODE_ATTACH for ptrace() and
PTRACE_MODE_READ for /proc), and takes a pointer to the child process.
current is the parent.

(2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only,
and takes only a pointer to the parent process. current is the child.

In Smack and commoncap, this uses has_capability() to determine whether
the parent will be permitted to use PTRACE_ATTACH if normal checks fail.
This does not set PF_SUPERPRIV.

Two of the instances of __capable() actually only act on current, and so have
been changed to calls to capable().

Of the places that were using __capable():

(1) The OOM killer calls __capable() thrice when weighing the killability of a
process. All of these now use has_capability().

(2) cap_ptrace() and smack_ptrace() were using __capable() to check to see
whether the parent was allowed to trace any process. As mentioned above,
these have been split. For PTRACE_ATTACH and /proc, capable() is now
used, and for PTRACE_TRACEME, has_capability() is used.

(3) cap_safe_nice() only ever saw current, so now uses capable().

(4) smack_setprocattr() rejected accesses to tasks other than current just
after calling __capable(), so the order of these two tests have been
switched and capable() is used instead.

(5) In smack_file_send_sigiotask(), we need to allow privileged processes to
receive SIGIO on files they're manipulating.

(6) In smack_task_wait(), we let a process wait for a privileged process,
whether or not the process doing the waiting is privileged.

I've tested this with the LTP SELinux and syscalls testscripts.

Signed-off-by: David Howells
Acked-by: Serge Hallyn
Acked-by: Casey Schaufler
Acked-by: Andrew G. Morgan
Acked-by: Al Viro
Signed-off-by: James Morris

David Howells
2008-08-14 20:59:43 +0800

13 Aug, 2008

7 commits

fc1efbdb7 mm/sparse.c: removed duplicated include ... Browse Code »

Signed-off-by: Huang Weiyi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Weiyi
2008-08-13 07:07:30 +0800
d6bf73e43 do_migrate_pages(): remove unused variable ... Browse Code »

Signed-off-by: MinChan Kim
Acked-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

MinChan Kim
2008-08-13 07:07:29 +0800
2b26736c8 allocate structures for reservation tracking in hugetlbfs outside of spinlocks v2 ... Browse Code »

[Andrew this should replace the previous version which did not check
the returns from the region prepare for errors. This has been tested by
us and Gerald and it looks good.

Bah, while reviewing the locking based on your previous email I spotted
that we need to check the return from the vma_needs_reservation call for
allocation errors. Here is an updated patch to correct this. This passes
testing here.]

Signed-off-by: Andy Whitcroft
Tested-by: Gerald Schaefer
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Whitcroft
2008-08-13 07:07:28 +0800
57303d801 hugetlbfs: allocate structures for reservation tracking outside of spinlocks ... Browse Code »

In the normal case, hugetlbfs reserves hugepages at map time so that the
pages exist for future faults. A struct file_region is used to track when
reservations have been consumed and where. These file_regions are
allocated as necessary with kmalloc() which can sleep with the
mm->page_table_lock held. This is wrong and triggers may-sleep warning
when PREEMPT is enabled.

Updates to the underlying file_region are done in two phases. The first
phase prepares the region for the change, allocating any necessary memory,
without actually making the change. The second phase actually commits the
change. This patch makes use of this by checking the reservations before
the page_table_lock is taken; triggering any necessary allocations. This
may then be safely repeated within the locks without any allocations being
required.

Credit to Mel Gorman for diagnosing this failure and initial versions of
the patch.

Signed-off-by: Andy Whitcroft
Tested-by: Gerald Schaefer
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Whitcroft
2008-08-13 07:07:28 +0800
9623e078c memcg: fix oops in mem_cgroup_shrink_usage ... Browse Code »

Got an oops in mem_cgroup_shrink_usage() when testing loop over tmpfs:
yes, of course, loop0 has no mm: other entry points check but this didn't.

Signed-off-by: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2008-08-13 07:07:28 +0800
74768ed83 page allocator: use no-panic variant of alloc_bootmem() in alloc_large_system_hash() ... Browse Code »

.. since a failed allocation is being (initially) handled gracefully, and
panic()-ed upon failure explicitly in the function if retries with smaller
sizes failed.

Signed-off-by: Jan Beulich
Signed-off-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Beulich
2008-08-13 07:07:27 +0800
caff3a2c3 hugetlb: call arch_prepare_hugepage() for surplus pages ... Browse Code »

The s390 software large page emulation implements shared page tables by
using page->index of the first tail page from a compound large page to
store page table information. This is set up in arch_prepare_hugepage(),
which is called from alloc_fresh_huge_page_node().

A similar call to arch_prepare_hugepage() is missing for surplus large
pages that are allocated in alloc_buddy_huge_page(), which breaks the
software emulation mode for (surplus) large pages on s390. This patch
adds the missing call to arch_prepare_hugepage(). It will have no effect
on other architectures where arch_prepare_hugepage() is a nop.

Also, use the correct order in the error path in alloc_fresh_huge_page_node().

Acked-by: Martin Schwidefsky
Signed-off-by: Gerald Schaefer
Acked-by: Nick Piggin
Acked-by: Adam Litke
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gerald Schaefer
2008-08-13 07:07:27 +0800

12 Aug, 2008

4 commits

1c89ac550 Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
fix spinlock recursion in hvc_console
stop_machine: remove unused variable
modules: extend initcall_debug functionality to the module loader
export virtio_rng.h
lguest: use get_user_pages_fast() instead of get_user_pages()
mm: Make generic weak get_user_pages_fast and EXPORT_GPL it
lguest: don't set MAC address for guest unless specified

Linus Torvalds
2008-08-12 23:40:19 +0800
912985dce mm: Make generic weak get_user_pages_fast and EXPORT_GPL it ... Browse Code »

Out of line get_user_pages_fast fallback implementation, make it a weak
symbol, get rid of CONFIG_HAVE_GET_USER_PAGES_FAST.

Export the symbol to modules so lguest can use it.

Signed-off-by: Nick Piggin
Signed-off-by: Rusty Russell

Rusty Russell
2008-08-12 15:52:53 +0800
9b4d0bab3 Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel… ... Browse Code »

…/git/tip/linux-2.6-tip

* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
lockdep: fix debug_lock_alloc
lockdep: increase MAX_LOCKDEP_KEYS
generic-ipi: fix stack and rcu interaction bug in smp_call_function_mask()
lockdep: fix overflow in the hlock shrinkage code
lockdep: rename map_[acquire|release]() => lock_map_[acquire|release]()
lockdep: handle chains involving classes defined in modules
mm: fix mm_take_all_locks() locking order
lockdep: annotate mm_take_all_locks()
lockdep: spin_lock_nest_lock()
lockdep: lock protection locks
lockdep: map_acquire
lockdep: shrink held_lock structure
lockdep: re-annotate scheduler runqueues
lockdep: lock_set_subclass - reset a held lock's subclass
lockdep: change scheduler annotation
debug_locks: set oops_in_progress if we will log messages.
lockdep: fix combinatorial explosion in lock subgraph traversal

Linus Torvalds
2008-08-12 07:45:46 +0800
23a0ee908 Merge branch 'core/locking' into core/urgent Browse Code »

Ingo Molnar
2008-08-12 06:11:49 +0800

11 Aug, 2008

2 commits

7cd5a02f5 mm: fix mm_take_all_locks() locking order ... Browse Code »

Lockdep spotted:

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.27-rc1 #270
-------------------------------------------------------
qemu-kvm/2033 is trying to acquire lock:
(&inode->i_data.i_mmap_lock){----}, at: [] mm_take_all_locks+0xc2/0xea

but task is already holding lock:
(&anon_vma->lock){----}, at: [] mm_take_all_locks+0x70/0xea

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&anon_vma->lock){----}:
[] __lock_acquire+0x11be/0x14d2
[] lock_acquire+0x5e/0x7a
[] _spin_lock+0x3b/0x47
[] vma_adjust+0x200/0x444
[] split_vma+0x12f/0x146
[] mprotect_fixup+0x13c/0x536
[] sys_mprotect+0x1a9/0x21e
[] system_call_fastpath+0x16/0x1b
[] 0xffffffffffffffff

-> #0 (&inode->i_data.i_mmap_lock){----}:
[] __lock_acquire+0xedb/0x14d2
[] lock_release_non_nested+0x1c2/0x219
[] lock_release+0x127/0x14a
[] _spin_unlock+0x1e/0x50
[] mm_drop_all_locks+0x7f/0xb0
[] do_mmu_notifier_register+0xe2/0x112
[] mmu_notifier_register+0xe/0x10
[] kvm_dev_ioctl+0x11e/0x287 [kvm]
[] vfs_ioctl+0x2a/0x78
[] do_vfs_ioctl+0x257/0x274
[] sys_ioctl+0x55/0x78
[] system_call_fastpath+0x16/0x1b
[] 0xffffffffffffffff

other info that might help us debug this:

5 locks held by qemu-kvm/2033:
#0: (&mm->mmap_sem){----}, at: [] do_mmu_notifier_register+0x55/0x112
#1: (mm_all_locks_mutex){--..}, at: [] mm_take_all_locks+0x34/0xea
#2: (&anon_vma->lock){----}, at: [] mm_take_all_locks+0x70/0xea
#3: (&anon_vma->lock){----}, at: [] mm_take_all_locks+0x70/0xea
#4: (&anon_vma->lock){----}, at: [] mm_take_all_locks+0x70/0xea

stack backtrace:
Pid: 2033, comm: qemu-kvm Not tainted 2.6.27-rc1 #270

Call Trace:
[] print_circular_bug_tail+0xb8/0xc3
[] __lock_acquire+0xedb/0x14d2
[] ? add_lock_to_list+0x7e/0xad
[] ? mm_take_all_locks+0x70/0xea
[] ? mm_take_all_locks+0x70/0xea
[] lock_release_non_nested+0x1c2/0x219
[] ? mm_take_all_locks+0xc2/0xea
[] ? mm_take_all_locks+0xc2/0xea
[] ? trace_hardirqs_on_caller+0x4d/0x115
[] ? mm_drop_all_locks+0x7f/0xb0
[] lock_release+0x127/0x14a
[] _spin_unlock+0x1e/0x50
[] mm_drop_all_locks+0x7f/0xb0
[] do_mmu_notifier_register+0xe2/0x112
[] mmu_notifier_register+0xe/0x10
[] kvm_dev_ioctl+0x11e/0x287 [kvm]
[] ? file_has_perm+0x83/0x8e
[] vfs_ioctl+0x2a/0x78
[] do_vfs_ioctl+0x257/0x274
[] sys_ioctl+0x55/0x78
[] system_call_fastpath+0x16/0x1b

Which the locking hierarchy in mm/rmap.c confirms as valid.

Fix this by first taking all the mapping->i_mmap_lock instances and then
take all anon_vma->lock instances.

Signed-off-by: Peter Zijlstra
Acked-by: Hugh Dickins
Signed-off-by: Ingo Molnar

Peter Zijlstra
2008-08-11 15:30:25 +0800
454ed842d lockdep: annotate mm_take_all_locks() ... Browse Code »

The nesting is correct due to holding mmap_sem, use the new annotation
to annotate this.

Signed-off-by: Peter Zijlstra
Signed-off-by: Ingo Molnar

Peter Zijlstra
2008-08-11 15:30:25 +0800

10 Aug, 2008

1 commit

4fbb71597 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
SLUB: dynamic per-cache MIN_PARTIAL
mm: unexport ksize

Linus Torvalds
2008-08-10 07:21:33 +0800

07 Aug, 2008

1 commit

d6606683a Revert duplicate "mm/hugetlb.c must #include <asm/io.h>" ... Browse Code »

This reverts commit 7cb93181629c613ee2b8f4ffe3446f8003074842, since we
did that patch twice, and the problem was already fixed earlier by
78a34ae29bf1c9df62a5bd0f0798b6c62a54d520.

Reported-by: Andi Kleen
Signed-off-by: Linus Torvalds

Linus Torvalds
2008-08-07 03:04:54 +0800

06 Aug, 2008

2 commits

dfe195fb7 mm: fix uninitialized variables for find_vma_prepare callers ... Browse Code »

gcc 4.3.0 correctly emits the following warnings.
When a vma covering addr is found, find_vma_prepare indeed returns without
setting pprev, rb_link, and rb_parent.

mm/mmap.c: In function `insert_vm_struct':
mm/mmap.c:2085: warning: `rb_parent' may be used uninitialized in this function
mm/mmap.c:2085: warning: `rb_link' may be used uninitialized in this function
mm/mmap.c:2084: warning: `prev' may be used uninitialized in this function
mm/mmap.c: In function `copy_vma':
mm/mmap.c:2124: warning: `rb_parent' may be used uninitialized in this function
mm/mmap.c:2124: warning: `rb_link' may be used uninitialized in this function
mm/mmap.c:2123: warning: `prev' may be used uninitialized in this function
mm/mmap.c: In function `do_brk':
mm/mmap.c:1951: warning: `rb_parent' may be used uninitialized in this function
mm/mmap.c:1951: warning: `rb_link' may be used uninitialized in this function
mm/mmap.c:1949: warning: `prev' may be used uninitialized in this function
mm/mmap.c: In function `mmap_region':
mm/mmap.c:1092: warning: `rb_parent' may be used uninitialized in this function
mm/mmap.c:1092: warning: `rb_link' may be used uninitialized in this function
mm/mmap.c:1089: warning: `prev' may be used uninitialized in this function

Hugh adds: in fact, none of find_vma_prepare's callers use those values
when a vma is found to be already covering addr, it's either an error or
an occasion to munmap and repeat. Okay, let's quieten the compiler (but I
would prefer it if pprev, rb_link and rb_parent were meaningful in that
case, rather than whatever's in them from descending the tree).

Signed-off-by: Benny Halevy
Signed-off-by: Hugh Dickins
Cc: "Ryan Hope"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Benny Halevy
2008-08-06 05:33:50 +0800
5c9ffc9c3 mm_init.c: avoid ifdef-inside-macro-expansion ... Browse Code »

gcc-3.2:

mm/mm_init.c:77:1: directives may not be used inside a macro argument
mm/mm_init.c:76:47: unterminated argument list invoking macro "mminit_dprintk"
mm/mm_init.c: In function `mminit_verify_pageflags_layout':
mm/mm_init.c:80: `mminit_dprintk' undeclared (first use in this function)
mm/mm_init.c:80: (Each undeclared identifier is reported only once
mm/mm_init.c:80: for each function it appears in.)
mm/mm_init.c:80: syntax error before numeric constant

Also fix a typo in a comment.

Reported-by: Adrian Bunk
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2008-08-06 05:33:46 +0800

05 Aug, 2008

4 commits

5595cffc8 SLUB: dynamic per-cache MIN_PARTIAL ... Browse Code »

This patch changes the static MIN_PARTIAL to a dynamic per-cache ->min_partial
value that is calculated from object size. The bigger the object size, the more
pages we keep on the partial list.

I tested SLAB, SLUB, and SLUB with this patch on Jens Axboe's 'netio' example
script of the fio benchmarking tool. The script stresses the networking
subsystem which should also give a fairly good beating of kmalloc() et al.

To run the test yourself, first clone the fio repository:

git clone git://git.kernel.dk/fio.git

and then run the following command n times on your machine:

time ./fio examples/netio

The results on my 2-way 64-bit x86 machine are as follows:

[ the minimum, maximum, and average are captured from 50 individual runs ]

real time (seconds)
min max avg sd
SLAB 22.76 23.38 22.98 0.17
SLUB 22.80 25.78 23.46 0.72
SLUB (dynamic) 22.74 23.54 23.00 0.20

sys time (seconds)
min max avg sd
SLAB 6.90 8.28 7.70 0.28
SLUB 7.42 16.95 8.89 2.28
SLUB (dynamic) 7.17 8.64 7.73 0.29

user time (seconds)
min max avg sd
SLAB 36.89 38.11 37.50 0.29
SLUB 30.85 37.99 37.06 1.67
SLUB (dynamic) 36.75 38.07 37.59 0.32

As you can see from the above numbers, this patch brings SLUB to the same level
as SLAB for this particular workload fixing a ~2% regression. I'd expect this
change to help similar workloads that allocate a lot of objects that are close
to the size of a page.

Cc: Matthew Wilcox
Cc: Andrew Morton
Acked-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Pekka Enberg
2008-08-05 14:28:47 +0800
529ae9aaa mm: rename page trylock ... Browse Code »

Converting page lock to new locking bitops requires a change of page flag
operation naming, so we might as well convert it to something nicer
(!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked).

This also facilitates lockdeping of page lock.

Signed-off-by: Nick Piggin
Acked-by: KOSAKI Motohiro
Acked-by: Peter Zijlstra
Acked-by: Andrew Morton
Acked-by: Benjamin Herrenschmidt
Signed-off-by: Linus Torvalds

Nick Piggin
2008-08-05 12:31:34 +0800
2e1e9212e Merge git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6 ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6: (29 commits)
sh: enable maple_keyb in dreamcast_defconfig.
SH2(A) cache update
nommu: Provide vmalloc_exec().
add addrespace definition for sh2a.
sh: Kill off ARCH_SUPPORTS_AOUT and remnants of a.out support.
sh: define GENERIC_HARDIRQS_NO__DO_IRQ.
sh: define GENERIC_LOCKBREAK.
sh: Save NUMA node data in vmcore for crash dumps.
sh: module_alloc() should be using vmalloc_exec().
sh: Fix up __bug_table handling in module loader.
sh: Add documentation and integrate into docbook build.
sh: Fix up broken kerneldoc comments.
maple: Kill useless private_data pointer.
maple: Clean up maple_driver_register/unregister routines.
input: Clean up maple keyboard driver
maple: allow removal and reinsertion of keyboard driver module
sh: /proc/asids depends on MMU.
arch/sh/boards/mach-se/7343/irq.c: removed duplicated #include
arch/sh/boards/board-ap325rxa.c: removed duplicated #include
sh/boards/Makefile typo fix
...

Linus Torvalds
2008-08-05 08:26:15 +0800
a477097d9 mlock() fix return values ... Browse Code »

Halesh says:

Please find the below testcase provide to test mlock.

Test Case :
===========================

#include
#include
#include
#include
#include
#include
#include
#include
#include

int main(void)
{
int fd,ret, i = 0;
char *addr, *addr1 = NULL;
unsigned int page_size;
struct rlimit rlim;

if (0 != geteuid())
{
printf("Execute this pgm as root\n");
exit(1);
}

/* create a file */
if ((fd = open("mmap_test.c",O_RDWR|O_CREAT,0755)) == -1)
{
printf("cant create test file\n");
exit(1);
}

page_size = sysconf(_SC_PAGE_SIZE);

/* set the MEMLOCK limit */
rlim.rlim_cur = 2000;
rlim.rlim_max = 2000;

if ((ret = setrlimit(RLIMIT_MEMLOCK,&rlim)) != 0)
{
printf("Cant change limit values\n");
exit(1);
}

addr = 0;
while (1)
{
/* map a page into memory each time*/
if ((addr = (char *) mmap(addr,page_size, PROT_READ |
PROT_WRITE,MAP_SHARED,fd,0)) == MAP_FAILED)
{
printf("cant do mmap on file\n");
exit(1);
}

if (0 == i)
addr1 = addr;
i++;
errno = 0;
/* lock the mapped memory pagewise*/
if ((ret = mlock((char *)addr, 1500)) == -1)
{
printf("errno value is %d\n", errno);
printf("cant lock maped region\n");
exit(1);
}
addr = addr + page_size;
}
}
======================================================

This testcase results in an mlock() failure with errno 14 that is EFAULT,
but it has nowhere been specified that mlock() will return EFAULT. When I
tested the same on older kernels like 2.6.18, I got the correct result i.e
errno 12 (ENOMEM).

I think in source code mlock(2), setting errno ENOMEM has been missed in
do_mlock() , on mlock_fixup() failure.

SUSv3 requires the following behavior frmo mlock(2).

[ENOMEM]
Some or all of the address range specified by the addr and
len arguments does not correspond to valid mapped pages
in the address space of the process.

[EAGAIN]
Some or all of the memory identified by the operation could not
be locked when the call was made.

This rule isn't so nice and slighly strange. but many people think
POSIX/SUS compliance is important.

Reported-by: Halesh Sadashiv
Tested-by: Halesh Sadashiv
Signed-off-by: KOSAKI Motohiro
Cc: [2.6.25.x, 2.6.26.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2008-08-05 07:58:45 +0800

04 Aug, 2008

1 commit

1af446edf nommu: Provide vmalloc_exec(). ... Browse Code »

Now that SH has switched to vmalloc_exec() for PAGE_KERNEL_EXEC usage,
it's apparent that nommu has no vmalloc_exec() definition of its own.
Stub in the one from mm/vmalloc.c.

Signed-off-by: Paul Mundt

Paul Mundt
2008-08-04 15:01:47 +0800

03 Aug, 2008

1 commit

84209e02d mm: dont clear PG_uptodate on truncate/invalidate ... Browse Code »

Brian Wang reported that a FUSE filesystem exported through NFS could
return I/O errors on read. This was traced to splice_direct_to_actor()
returning a short or zero count when racing with page invalidation.

However this is not FUSE or NFSD specific, other filesystems (notably
NFS) also call invalidate_inode_pages2() to purge stale data from the
cache.

If this happens while such pages are sitting in a pipe buffer, then
splice(2) from the pipe can return zero, and read(2) from the pipe can
return ENODATA.

The zero return is especially bad, since it implies end-of-file or
disconnected pipe/socket, and is documented as such for splice. But
returning an error for read() is also nasty, when in fact there was no
error (data becoming stale is not an error).

The same problems can be triggered by "hole punching" with
madvise(MADV_REMOVE).

Fix this by not clearing the PG_uptodate flag on truncation and
invalidation.

Signed-off-by: Miklos Szeredi
Acked-by: Nick Piggin
Cc: Andrew Morton
Cc: Jens Axboe
Signed-off-by: Linus Torvalds

Miklos Szeredi
2008-08-03 00:12:34 +0800

02 Aug, 2008

3 commits

3669bc143 Remove EXPORTS of follow_page & zap_page_range ... Browse Code »

Delete 2 EXPORTs that were accidentally sent upstream.

Signed-off-by: Jack Steiner
Signed-off-by: Linus Torvalds

Jack Steiner
2008-08-02 04:19:16 +0800
0ef89d25d mm/hugetlb: don't crash when HPAGE_SHIFT is 0 ... Browse Code »

Some platform decide whether they support huge pages at boot time. On
these, such as powerpc, HPAGE_SHIFT is a variable, not a constant, and is
set to 0 when there is no such support.

The patches to introduce multiple huge pages support broke that causing
the kernel to crash at boot time on machines such as POWER3 which lack
support for multiple page sizes.

Signed-off-by: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Benjamin Herrenschmidt
2008-08-02 03:46:41 +0800
00e9028a9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6 ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6: (28 commits)
mm/hugetlb.c must #include
video: Fix up hp6xx driver build regressions.
sh: defconfig updates.
sh: Kill off stray mach-rsk7203 reference.
serial: sh-sci: Fix up SH7760/SH7780/SH7785 early printk regression.
sh: Move out individual boards without mach groups.
sh: Make sure AT_SYSINFO_EHDR is exposed to userspace in asm/auxvec.h.
sh: Allow SH-3 and SH-5 to use common headers.
sh: Provide common CPU headers, prune the SH-2 and SH-2A directories.
sh/maple: clean maple bus code
sh: More header path fixups for mach dir refactoring.
sh: Move out the solution engine headers to arch/sh/include/mach-se/
sh: I2C fix for AP325RXA and Migo-R
sh: Shuffle the board directories in to mach groups.
sh: dma-sh: Fix up dreamcast dma.h mach path.
sh: Switch KBUILD_DEFCONFIG to shx3_defconfig.
sh: Add ARCH_DEFCONFIG entries for sh and sh64.
sh: Fix compile error of Solution Engine
sh: Proper __put_user_asm() size mismatch fix.
sh: Stub in a dummy ENTRY_OFFSET for uImage offset calculation.
...

Linus Torvalds
2008-08-02 01:53:43 +0800

01 Aug, 2008

1 commit

a4b526b3b [S390] Optimize storage key operations for anon pages ... Browse Code »

For anonymous pages without a swap cache backing the check in
page_remove_rmap for the physical dirty bit in page_remove_rmap is
unnecessary. The instructions that are used to check and reset the dirty
bit are expensive. Removing the check noticably speeds up process exit.
In addition the clearing of the dirty bit in __SetPageUptodate is
pointless as well. With these two changes there is no storage key
operation for an anonymous page anymore if it does not hit the swap
space.

The micro benchmark which repeatedly executes an empty shell script
gets about 5% faster.

Signed-off-by: Martin Schwidefsky

Martin Schwidefsky
2008-08-01 22:39:30 +0800

31 Jul, 2008

3 commits

94ad374a0 Fix off-by-one error in iov_iter_advance() ... Browse Code »

The iov_iter_advance() function would look at the iov->iov_len entry
even though it might have iterated over the whole array, and iov was
pointing past the end. This would cause DEBUG_PAGEALLOC to trigger a
kernel page fault if the allocation was at the end of a page, and the
next page was unallocated.

The quick fix is to just change the order of the tests: check that there
is any iovec data left before we check the iov entry itself.

Thanks to Alexey Dobriyan for finding this case, and testing the fix.

Reported-and-tested-by: Alexey Dobriyan
Cc: Nick Piggin
Cc: Andrew Morton
Cc: [2.6.25.x, 2.6.26.x]
Signed-off-by: Linus Torvalds

Linus Torvalds
2008-07-31 05:50:18 +0800
0d39741a2 GRU Driver: export is_uv_system(), zap_page_range() & follow_page() ... Browse Code »

Exports needed by the GRU driver.

Signed-off-by: Jack Steiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jack Steiner
2008-07-31 00:41:48 +0800
c627f9cc0 mm: add zap_vma_ptes(): a library function to unmap driver ptes ... Browse Code »

zap_vma_ptes() is intended to be used by drivers to unmap ptes assigned to the
driver private vmas. This interface is similar to zap_page_range() but is
less general & less likely to be abused.

Needed by the GRU driver.

Signed-off-by: Jack Steiner
Cc: Nick Piggin
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jack Steiner
2008-07-31 00:41:47 +0800