Eric Lee / smarc-fsl-linux-kernel

27 Mar, 2009

4 commits

be0ea6967 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
slob: fix lockup in slob_free()
slub: use get_track()
slub: rename calculate_min_partial() to set_min_partial()
slub: add min_partial sysfs tunable
slub: move min_partial to struct kmem_cache
SLUB: Fix default slab order for big object sizes
SLUB: Do not pass 8k objects through to the page allocator
SLUB: Introduce and use SLUB_MAX_SIZE and SLUB_PAGE_SHIFT constants
slob: clean up the code
SLUB: Use ->objsize from struct kmem_cache_cpu in slab_free()

Linus Torvalds
2009-03-27 07:18:17 +0800
86d9c0701 Merge branch 'for-2.6.30' of git://git.kernel.dk/linux-2.6-block ... Browse Code »

* 'for-2.6.30' of git://git.kernel.dk/linux-2.6-block:
Get rid of pdflush_operation() in emergency sync and remount
btrfs: get rid of current_is_pdflush() in btrfs_btree_balance_dirty
Move the default_backing_dev_info out of readahead.c and into backing-dev.c
block: Repeated lines in switching-sched.txt
bsg: Remove bogus check against request_queue->max_sectors
block: WARN in __blk_put_request() for potential bio leak
loop: fix circular locking in loop_clr_fd()
loop: support barrier writes
bsg: add support for tail queuing
cpqarray: enable bus mastering
block: genhd.h cleanup patch
block: add private bio_set for bio integrity allocations
block: genhd.h comment needs updating
block: get rid of unused blkdev_free_rq() define
block: remove various blk_queue_*() setting functions in blk_init_queue_node()
cciss: add BUILD_BUG_ON() for catching bad CommandList_struct alignment
block: don't create bio_vec slabs of less than the inline number
block: cleanup bio_alloc_bioset()

Linus Torvalds
2009-03-27 07:03:04 +0800
8d80ce80e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorri… ... Browse Code »

…s/security-testing-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (71 commits)
SELinux: inode_doinit_with_dentry drop no dentry printk
SELinux: new permission between tty audit and audit socket
SELinux: open perm for sock files
smack: fixes for unlabeled host support
keys: make procfiles per-user-namespace
keys: skip keys from another user namespace
keys: consider user namespace in key_permission
keys: distinguish per-uid keys in different namespaces
integrity: ima iint radix_tree_lookup locking fix
TOMOYO: Do not call tomoyo_realpath_init unless registered.
integrity: ima scatterlist bug fix
smack: fix lots of kernel-doc notation
TOMOYO: Don't create securityfs entries unless registered.
TOMOYO: Fix exception policy read failure.
SELinux: convert the avc cache hash list to an hlist
SELinux: code readability with avc_cache
SELinux: remove unused av.decided field
SELinux: more careful use of avd in avc_has_perm_noaudit
SELinux: remove the unused ae.used
SELinux: check seqno when updating an avc_node
...

Linus Torvalds
2009-03-27 02:03:39 +0800
1b5e62b42 writeback: double the dirty thresholds ... Browse Code »

Enlarge default dirty ratios from 5/10 to 10/20. This fixes [Bug
#12809] iozone regression with 2.6.29-rc6.

The iozone benchmarks are performed on a 1200M file, with 8GB ram.

iozone -i 0 -i 1 -i 2 -i 3 -i 4 -r 4k -s 64k -s 512m -s 1200m -b tmp.xls
iozone -B -r 4k -s 64k -s 512m -s 1200m -b tmp.xls

The performance regression is triggered by commit 1cf6e7d83bf3(mm: task
dirty accounting fix), which makes more correct/thorough dirty
accounting.

The default 5/10 dirty ratios were picked (a) with the old dirty logic
and (b) largely at random and (c) designed to be aggressive. In
particular, that (a) means that having fixed some of the dirty
accounting, maybe the real bug is now that it was always too aggressive,
just hidden by an accounting issue.

The enlarged 10/20 dirty ratios are just about enough to fix the regression.

[ We will have to look at how this affects the old fsync() latency issue,
but that probably will need independent work. - Linus ]

Cc: Nick Piggin
Cc: Peter Zijlstra
Reported-by: "Lin, Ming M"
Tested-by: "Lin, Ming M"
Signed-off-by: Wu Fengguang
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-03-27 02:01:11 +0800

26 Mar, 2009

1 commit

26160158d Move the default_backing_dev_info out of readahead.c and into backing-dev.c ... Browse Code »

It really makes no sense to have it in readahead.c, so move it where
it belongs.

Signed-off-by: Jens Axboe

Jens Axboe
2009-03-26 18:01:33 +0800

24 Mar, 2009

2 commits

15a5b0a49 Merge branches 'topic/slob/cleanups', 'topic/slob/fixes', 'topic/slub/core', 'to… ... Browse Code »

…pic/slub/cleanups' and 'topic/slub/perf' into for-linus

Pekka Enberg
2009-03-24 16:25:21 +0800
703a3cd72 Merge branch 'master' into next Browse Code »

James Morris
2009-03-24 07:52:46 +0800

23 Mar, 2009

2 commits

6fb8f4243 slob: fix lockup in slob_free() ... Browse Code »

Don't hold SLOB lock when freeing the page. Reduces lock hold width. See
the following thread for discussion of the bug:

http://marc.info/?l=linux-kernel&m=123709983214143&w=2

Reported-by: Ingo Molnar
Acked-by: Matt Mackall
Signed-off-by: Nick Piggin
Signed-off-by: Pekka Enberg

Nick Piggin
2009-03-23 16:40:45 +0800
1a00df4a2 slub: use get_track() ... Browse Code »

Use get_track() in set_track()

Signed-off-by: Akinobu Mita
Cc: Christoph Lameter
Cc: Pekka Enberg
Signed-off-by: Pekka Enberg

Akinobu Mita
2009-03-23 15:43:52 +0800

15 Mar, 2009

1 commit

1d885526f vmscan: pgmoved should be cleared after updating recent_rotated ... Browse Code »

pgmoved should be cleared after updating recent_rotated.

Signed-off-by: Daisuke Nishimura
Cc: Rik van Riel
Cc: Lee Schermerhorn
Acked-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke Nishimura
2009-03-15 02:57:22 +0800

13 Mar, 2009

1 commit

f272b7bc4 memcg: use correct scan number at reclaim ... Browse Code »

Even when page reclaim is under mem_cgroup, # of scan page is determined by
status of global LRU. Fix that.

Signed-off-by: KOSAKI Motohiro
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-03-13 07:20:24 +0800

28 Feb, 2009

2 commits

cbb766766 mm: fix lazy vmap purging (use-after-free error) ... Browse Code »

I just got this new warning from kmemcheck:

WARNING: kmemcheck: Caught 32-bit read from freed memory (c7806a60)
a06a80c7ecde70c1a04080c700000000a06709c1000000000000000000000000
f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f
^

Pid: 0, comm: swapper Not tainted (2.6.29-rc4 #230)
EIP: 0060:[] EFLAGS: 00000286 CPU: 0
EIP is at __purge_vmap_area_lazy+0x117/0x140
EAX: 00070f43 EBX: c7806a40 ECX: c1677080 EDX: 00027b66
ESI: 00002001 EDI: c170df0c EBP: c170df00 ESP: c178830c
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 80050033 CR2: c7806b14 CR3: 01775000 CR4: 00000690
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: 00004000 DR7: 00000000
[] free_unmap_vmap_area_noflush+0x6e/0x70
[] remove_vm_area+0x2a/0x70
[] __vunmap+0x45/0xe0
[] vunmap+0x1e/0x30
[] text_poke+0x95/0x150
[] alternatives_smp_unlock+0x49/0x60
[] alternative_instructions+0x11b/0x124
[] check_bugs+0xbd/0xdc
[] start_kernel+0x2ed/0x360
[] __init_begin+0x9e/0xa9
[] 0xffffffff

It happened here:

$ addr2line -e vmlinux -i c1096df7
mm/vmalloc.c:540

Code:

list_for_each_entry(va, &valist, purge_list)
__free_vmap_area(va);

It's this instruction:

mov 0x20(%ebx),%edx

Which corresponds to a dereference of va->purge_list.next:

(gdb) p ((struct vmap_area *) 0)->purge_list.next
Cannot access memory at address 0x20

It seems that we should use "safe" list traversal here, as the element
is freed inside the loop. Please verify that this is the right fix.

Acked-by: Nick Piggin
Signed-off-by: Vegard Nossum
Cc: Pekka Enberg
Cc: Ingo Molnar
Cc: "Paul E. McKenney"
Cc: [2.6.28.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vegard Nossum
2009-02-28 08:26:21 +0800
7766970cc mm: vmap fix overflow ... Browse Code »

The new vmap allocator can wrap the address and get confused in the case
of large allocations or VMALLOC_END near the end of address space.

Problem reported by Christoph Hellwig on a 32-bit XFS workload.

Signed-off-by: Nick Piggin
Reported-by: Christoph Hellwig
Cc: [2.6.28.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2009-02-28 08:26:21 +0800

26 Feb, 2009

1 commit

0b0a0806b shmem: fix shared anonymous accounting ... Browse Code »

Each time I exit Firefox, /proc/meminfo's Committed_AS goes down almost
400 kB: OVERCOMMIT_NEVER would be allowing overcommits it should
prohibit.

Commit fc8744adc870a8d4366908221508bb113d8b72ee "Stop playing silly
games with the VM_ACCOUNT flag" changed shmem_file_setup() to set the
shmem file's VM_ACCOUNT flag according to VM_NORESERVE not being set in
the vma flags; but did so only _after_ the shmem_acct_size(flags, size)
call which is expected to pre-account a shared anonymous object.

It's all clearer if we switch shmem.c over to use VM_NORESERVE
throughout in place of !VM_ACCOUNT.

But I very nearly sent in a patch which mistakenly removed the
accounting from tmpfs files: shmem_get_inode()'s memset was good for not
setting VM_ACCOUNT, but now it needs to set VM_NORESERVE.

Rather than setting that by default, then perhaps clearing it again in
shmem_file_setup(), let's pass it as a flag to shmem_get_inode(): that
allows us to remove the #ifdef CONFIG_SHMEM from shmem_file_setup().

Signed-off-by: Hugh Dickins
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-02-26 04:21:42 +0800

25 Feb, 2009

1 commit

c0bdb232b slub: rename calculate_min_partial() to set_min_partial() ... Browse Code »

As suggested by Christoph Lameter, rename calculate_min_partial() to
set_min_partial() as the function doesn't really do any calculations.

Cc: Christoph Lameter
Signed-off-by: David Rientjes
Signed-off-by: Pekka Enberg

David Rientjes
2009-02-25 15:16:35 +0800

23 Feb, 2009

2 commits

73d342b16 slub: add min_partial sysfs tunable ... Browse Code »

Now that a cache's min_partial has been moved to struct kmem_cache, it's
possible to easily tune it from userspace by adding a sysfs attribute.

It may not be desirable to keep a large number of partial slabs around
if a cache is used infrequently and memory, especially when constrained
by a cgroup, is scarce. It's better to allow userspace to set the
minimum policy per cache instead of relying explicitly on
kmem_cache_shrink().

The memory savings from simply moving min_partial from struct
kmem_cache_node to struct kmem_cache is obviously not significant
(unless maybe you're from SGI or something), at the largest it's

# allocated caches * (MAX_NUMNODES - 1) * sizeof(unsigned long)

The true savings occurs when userspace reduces the number of partial
slabs that would otherwise be wasted, especially on machines with a
large number of nodes (ia64 with CONFIG_NODES_SHIFT at 10 for default?).
As well as the kernel estimates ideal values for n->min_partial and
ensures it's within a sane range, userspace has no other input other
than writing to /sys/kernel/slab/cache/shrink.

There simply isn't any better heuristic to add when calculating the
partial values for a better estimate that works for all possible caches.
And since it's currently a static value, the user really has no way of
reclaiming that wasted space, which can be significant when constrained
by a cgroup (either cpusets or, later, memory controller slab limits)
without shrinking it entirely.

This also allows the user to specify that increased fragmentation and
more partial slabs are actually desired to avoid the cost of allocating
new slabs at runtime for specific caches.

There's also no reason why this should be a per-struct kmem_cache_node
value in the first place. You could argue that a machine would have
such node size asymmetries that it should be specified on a per-node
basis, but we know nobody is doing that right now since it's a purely
static value at the moment and there's no convenient way to tune that
via slub's sysfs interface.

Cc: Christoph Lameter
Signed-off-by: David Rientjes
Signed-off-by: Pekka Enberg

David Rientjes
2009-02-23 18:05:46 +0800
3b89d7d88 slub: move min_partial to struct kmem_cache ... Browse Code »
43

Although it allows for better cacheline use, it is unnecessary to save a
copy of the cache's min_partial value in each kmem_cache_node.

Cc: Christoph Lameter
Signed-off-by: David Rientjes
Signed-off-by: Pekka Enberg

David Rientjes
2009-02-23 18:05:41 +0800

22 Feb, 2009

4 commits

adfafefd1 Merge branch 'hibernate' ... Browse Code »

* hibernate:
PM: Fix suspend_console and resume_console to use only one semaphore
PM: Wait for console in resume
PM: Fix pm_notifiers during user mode hibernation
swsusp: clean up shrink_all_zones()
swsusp: dont fiddle with swappiness
PM: fix build for CONFIG_PM unset
PM/hibernate: fix "swap breaks after hibernation failures"
PM/resume: wait for device probing to finish
Consolidate driver_probe_done() loops into one place

Linus Torvalds
2009-02-22 06:17:26 +0800
0cb57258f swsusp: clean up shrink_all_zones() ... Browse Code »

Move local variables to innermost possible scopes and use local
variables to cache calculations/reads done more than once.

No change in functionality (intended).

Signed-off-by: Johannes Weiner
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Rafael J. Wysocki
Cc: Len Brown
Cc: Greg KH
Acked-by: Pavel Machek
Signed-off-by: Linus Torvalds

Johannes Weiner
2009-02-22 06:17:17 +0800
3049103dd swsusp: dont fiddle with swappiness ... Browse Code »

sc.swappiness is not used in the swsusp memory shrinking path, do not
set it.

Signed-off-by: Johannes Weiner
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Rafael J. Wysocki
Cc: Len Brown
Cc: Greg KH
Signed-off-by: Linus Torvalds

Johannes Weiner
2009-02-22 06:17:17 +0800
a1bb7d612 PM/hibernate: fix "swap breaks after hibernation failures" ... Browse Code »

http://bugzilla.kernel.org/show_bug.cgi?id=12239

The image writing code dropped a reference to the current swap device.
This doesn't show up if the hibernation succeeds - because it doesn't
affect the image which gets resumed. But it means multiple _failed_
hibernations end up freeing the swap device while it is still use!

swsusp_write() finds the block device for the swap file using swap_type_of().
It then uses blkdev_get() / blkdev_put() to open and close the block device.

Unfortunately, blkdev_get() assumes ownership of the inode of the block_device
passed to it. So blkdev_put() calls iput() on the inode. This is by design
and other callers expect this behaviour. The fix is for swap_type_of() to take
a reference on the inode using bdget().

Signed-off-by: Alan Jenkins
Signed-off-by: Rafael J. Wysocki
Cc: Len Brown
Cc: Greg KH
Signed-off-by: Linus Torvalds

Alan Jenkins
2009-02-22 06:17:17 +0800

21 Feb, 2009

2 commits

f6fcba701 vmalloc: call flush_cache_vunmap() from unmap_kernel_range() ... Browse Code »

Impact: proper vcache flush on unmap_kernel_range()

flush_cache_vunmap() should be called before pages are unmapped. Add
a call to it in unmap_kernel_range().

Signed-off-by: Tejun Heo
Acked-by: Nick Piggin
Acked-by: David S. Miller
Cc: [2.6.28.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2009-02-21 09:57:49 +0800
3ef0e5ba4 slab: introduce kzfree() ... Browse Code »

kzfree() is a wrapper for kfree() that additionally zeroes the underlying
memory before releasing it to the slab allocator.

Currently there is code which memset()s the memory region of an object
before releasing it back to the slab allocator to make sure
security-sensitive data are really zeroed out after use.

These callsites can then just use kzfree() which saves some code, makes
users greppable and allows for a stupid destructor that isn't necessarily
aware of the actual object size.

Signed-off-by: Johannes Weiner
Reviewed-by: Pekka Enberg
Cc: Matt Mackall
Acked-by: Christoph Lameter
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2009-02-21 09:57:48 +0800

20 Feb, 2009

2 commits

e8120ff1f SLUB: Fix default slab order for big object sizes ... Browse Code »

The default order of kmalloc-8192 on 2*4 stoakley is an issue of
calculate_order.

slab_size order name
-------------------------------------------------
4096 3 sgpool-128
8192 2 kmalloc-8192
16384 3 kmalloc-16384

kmalloc-8192's default order is smaller than sgpool-128's.

On 4*4 tigerton machine, a similiar issue appears on another kmem_cache.

Function calculate_order uses 'min_objects /= 2;' to shrink. Plus size
calculation/checking in slab_order, sometimes above issue appear.

Below patch against 2.6.29-rc2 fixes it.

I checked the default orders of all kmem_cache and they don't become
smaller than before. So the patch wouldn't hurt performance.

Signed-off-by Zhang Yanmin
Signed-off-by: Pekka Enberg

Zhang Yanmin
2009-02-20 18:26:12 +0800
ffadd4d0f SLUB: Introduce and use SLUB_MAX_SIZE and SLUB_PAGE_SHIFT constants ... Browse Code »

As a preparational patch to bump up page allocator pass-through threshold,
introduce two new constants SLUB_MAX_SIZE and SLUB_PAGE_SHIFT and convert
mm/slub.c to use them.

Reported-by: "Zhang, Yanmin"
Tested-by: "Zhang, Yanmin"
Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg

Christoph Lameter
2009-02-20 18:22:44 +0800

19 Feb, 2009

5 commits

ba95fd47d Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block ... Browse Code »

* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
block: fix deadlock in blk_abort_queue() for drivers that readd to timeout list
block: fix booting from partitioned md array
block: revert part of 18ce3751ccd488c78d3827e9f6bf54e6322676fb
cciss: PCI power management reset for kexec
paride/pg.c: xs(): &&/|| confusion
fs/bio: bio_alloc_bioset: pass right object ptr to mempool_free
block: fix bad definition of BIO_RW_SYNC
bsg: Fix sense buffer bug in SG_IO

Linus Torvalds
2009-02-19 10:33:04 +0800
cc2559bcc mm: fix memmap init for handling memory hole ... Browse Code »

Now, early_pfn_in_nid(PFN, NID) may returns false if PFN is a hole.
and memmap initialization was not done. This was a trouble for
sparc boot.

To fix this, the PFN should be initialized and marked as PG_reserved.
This patch changes early_pfn_in_nid() return true if PFN is a hole.

Signed-off-by: KAMEZAWA Hiroyuki
Reported-by: David Miller
Tested-by: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Heiko Carstens
Cc: [2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-02-19 07:37:55 +0800
f2dbcfa73 mm: clean up for early_pfn_to_nid() ... Browse Code »

What's happening is that the assertion in mm/page_alloc.c:move_freepages()
is triggering:

BUG_ON(page_zone(start_page) != page_zone(end_page));

Once I knew this is what was happening, I added some annotations:

if (unlikely(page_zone(start_page) != page_zone(end_page))) {
printk(KERN_ERR "move_freepages: Bogus zones: "
"start_page[%p] end_page[%p] zone[%p]\n",
start_page, end_page, zone);
printk(KERN_ERR "move_freepages: "
"start_zone[%p] end_zone[%p]\n",
page_zone(start_page), page_zone(end_page));
printk(KERN_ERR "move_freepages: "
"start_pfn[0x%lx] end_pfn[0x%lx]\n",
page_to_pfn(start_page), page_to_pfn(end_page));
printk(KERN_ERR "move_freepages: "
"start_nid[%d] end_nid[%d]\n",
page_to_nid(start_page), page_to_nid(end_page));
...

And here's what I got:

move_freepages: Bogus zones: start_page[2207d0000] end_page[2207dffc0] zone[fffff8103effcb00]
move_freepages: start_zone[fffff8103effcb00] end_zone[fffff8003fffeb00]
move_freepages: start_pfn[0x81f600] end_pfn[0x81f7ff]
move_freepages: start_nid[1] end_nid[0]

My memory layout on this box is:

[ 0.000000] Zone PFN ranges:
[ 0.000000] Normal 0x00000000 -> 0x0081ff5d
[ 0.000000] Movable zone start PFN for each node
[ 0.000000] early_node_map[8] active PFN ranges
[ 0.000000] 0: 0x00000000 -> 0x00020000
[ 0.000000] 1: 0x00800000 -> 0x0081f7ff
[ 0.000000] 1: 0x0081f800 -> 0x0081fe50
[ 0.000000] 1: 0x0081fed1 -> 0x0081fed8
[ 0.000000] 1: 0x0081feda -> 0x0081fedb
[ 0.000000] 1: 0x0081fedd -> 0x0081fee5
[ 0.000000] 1: 0x0081fee7 -> 0x0081ff51
[ 0.000000] 1: 0x0081ff59 -> 0x0081ff5d

So it's a block move in that 0x81f600-->0x81f7ff region which triggers
the problem.

This patch:

Declaration of early_pfn_to_nid() is scattered over per-arch include
files, and it seems it's complicated to know when the declaration is used.
I think it makes fix-for-memmap-init not easy.

This patch moves all declaration to include/linux/mm.h

After this,
if !CONFIG_NODES_POPULATES_NODE_MAP && !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
-> Use static definition in include/linux/mm.h
else if !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
-> Use generic definition in mm/page_alloc.c
else
-> per-arch back end function will be called.

Signed-off-by: KAMEZAWA Hiroyuki
Tested-by: KOSAKI Motohiro
Reported-by: David Miller
Cc: Mel Gorman
Cc: Heiko Carstens
Cc: [2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-02-19 07:37:55 +0800
1cf6e7d83 mm: task dirty accounting fix ... Browse Code »

YAMAMOTO-san noticed that task_dirty_inc doesn't seem to be called properly for
cases where set_page_dirty is not used to dirty a page (eg. mark_buffer_dirty).

Additionally, there is some inconsistency about when task_dirty_inc is
called. It is used for dirty balancing, however it even gets called for
__set_page_dirty_no_writeback.

So rather than increment it in a set_page_dirty wrapper, move it down to
exactly where the dirty page accounting stats are incremented.

Cc: YAMAMOTO Takashi
Signed-off-by: Nick Piggin
Acked-by: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2009-02-19 07:37:54 +0800
c29686129 vmalloc: add __get_vm_area_caller() ... Browse Code »

We have get_vm_area_caller() and __get_vm_area() but not
__get_vm_area_caller()

On powerpc, I use __get_vm_area() to separate the ranges of addresses
given to vmalloc vs. ioremap (various good reasons for that) so in order
to be able to implement the new caller tracking in /proc/vmallocinfo, I
need a "_caller" variant of it.

(akpm: needed for ongoing powerpc development, so merge it early)

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Benjamin Herrenschmidt
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Benjamin Herrenschmidt
2009-02-19 07:37:53 +0800

18 Feb, 2009

2 commits

93dbb3935 block: fix bad definition of BIO_RW_SYNC ... Browse Code »

We can't OR shift values, so get rid of BIO_RW_SYNC and use BIO_RW_SYNCIO
and BIO_RW_UNPLUG explicitly. This brings back the behaviour from before
213d9417fec62ef4c3675621b9364a667954d4dd.

Signed-off-by: Jens Axboe

Jens Axboe
2009-02-18 17:32:00 +0800
35010334a Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/… ... Browse Code »

…git/tip/linux-2.6-tip

* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, vm86: fix preemption bug
x86, olpc: fix model detection without OFW
x86, hpet: fix for LS21 + HPET = boot hang
x86: CPA avoid repeated lazy mmu flush
x86: warn if arch_flush_lazy_mmu_cpu is called in preemptible context
x86/paravirt: make arch_flush_lazy_mmu/cpu disable preemption
x86, pat: fix warn_on_once() while mapping 0-1MB range with /dev/mem
x86/cpa: make sure cpa is safe to call in lazy mmu mode
x86, ptrace, mm: fix double-free on race

Linus Torvalds
2009-02-18 06:27:39 +0800

13 Feb, 2009

2 commits

071a0bc2c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
mm: Export symbol ksize()

Linus Torvalds
2009-02-13 01:56:14 +0800
3a4c6800f Fix page writeback thinko, causing Berkeley DB slowdown ... Browse Code »

A bug was introduced into write_cache_pages cyclic writeout by commit
31a12666d8f0c22235297e1c1575f82061480029 ("mm: write_cache_pages cyclic
fix"). The intention (and comments) is that we should cycle back and
look for more dirty pages at the beginning of the file if there is no
more work to be done.

But the !done condition was dropped from the test. This means that any
time the page writeout loop breaks (eg. due to nr_to_write == 0), we
will set index to 0, then goto again. This will set done_index to
index, then find done is set, so will proceed to the end of the
function. When updating mapping->writeback_index for cyclic writeout,
we now use done_index == 0, so we're always cycling back to 0.

This seemed to be causing random mmap writes (slapadd and iozone) to
start writing more pages from the LRU and writeout would slowdown, and
caused bugzilla entry

http://bugzilla.kernel.org/show_bug.cgi?id=12604

about Berkeley DB slowing down dramatically.

With this patch, iozone random write performance is increased nearly
5x on my system (iozone -B -r 4k -s 64k -s 512m -s 1200m on ext2).

Signed-off-by: Nick Piggin
Reported-and-tested-by: Jan Kara
Signed-off-by: Linus Torvalds

Nick Piggin
2009-02-13 00:10:53 +0800

12 Feb, 2009

6 commits

b1aabecd5 mm: Export symbol ksize() ... Browse Code »

Commit 7b2cd92adc5430b0c1adeb120971852b4ea1ab08 ("crypto: api - Fix
zeroing on free") added modular user of ksize(). Export that to fix
crypto.ko compilation.

Cc: Herbert Xu
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Pekka Enberg

Kirill A. Shutemov
2009-02-12 23:50:46 +0800
9480c53e9 mm: rearrange exit_mmap() to unlock before arch_exit_mmap ... Browse Code »

Christophe Saout reported [in precursor to:
http://marc.info/?l=linux-kernel&m=123209902707347&w=4]:

> Note that I also some a different issue with CONFIG_UNEVICTABLE_LRU.
> Seems like Xen tears down current->mm early on process termination, so
> that __get_user_pages in exit_mmap causes nasty messages when the
> process had any mlocked pages. (in fact, it somehow manages to get into
> the swapping code and produces a null pointer dereference trying to get
> a swap token)

Jeremy explained:

Yes. In the normal case under Xen, an in-use pagetable is "pinned",
meaning that it is RO to the kernel, and all updates must go via hypercall
(or writes are trapped and emulated, which is much the same thing). An
unpinned pagetable is not currently in use by any process, and can be
directly accessed as normal RW pages.

As an optimisation at process exit time, we unpin the pagetable as early
as possible (switching the process to init_mm), so that all the normal
pagetable teardown can happen with direct memory accesses.

This happens in exit_mmap() -> arch_exit_mmap(). The munlocking happens
a few lines below. The obvious thing to do would be to move
arch_exit_mmap() to below the munlock code, but I think we'd want to
call it even if mm->mmap is NULL, just to be on the safe side.

Thus, this patch:

exit_mmap() needs to unlock any locked vmas before calling arch_exit_mmap,
as the latter may switch the current mm to init_mm, which would cause the
former to fail.

Signed-off-by: Jeremy Fitzhardinge
Signed-off-by: Lee Schermerhorn
Cc: Christophe Saout
Cc: Keir Fraser
Cc: Christophe Saout
Cc: Alex Williamson
Cc: [2.6.28.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jeremy Fitzhardinge
2009-02-12 06:25:37 +0800
89e121900 writeback: fix break condition ... Browse Code »

Commit dcf6a79dda5cc2a2bec183e50d829030c0972aaa ("write-back: fix
nr_to_write counter") fixed nr_to_write counter, but didn't set the break
condition properly.

If nr_to_write == 0 after being decremented it will loop one more time
before setting done = 1 and breaking the loop.

[akpm@linux-foundation.org: coding-style fixes]
Cc: Artem Bityutskiy
Acked-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Federico Cuello
2009-02-12 06:25:37 +0800
2e9c23724 memcg: use __GFP_NOWARN in page cgroup allocation ... Browse Code »

page_cgroup's page allocation at init/memory hotplug uses kmalloc() and
vmalloc(). If kmalloc() failes, vmalloc() is used.

This is because vmalloc() is very limited resource on 32bit systems.
We want to use kmalloc() first.

But in this kind of call, __GFP_NOWARN should be specified.

Reported-by: Heiko Carstens
Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Acked-by: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-02-12 06:25:35 +0800
508b9f8ef mm: fix mlocked page counter mismatch ... Browse Code »

When I tested following program, I found that the mlocked counter
is strange. It cannot free some mlocked pages.

It is because try_to_unmap_file() doesn't check real
page mappings in vmas.

That is because the goal of an address_space for a file is to find all
processes into which the file's specific interval is mapped. It is
related to the file's interval, not to pages.

Even if the page isn't really mapped by the vma, it returns SWAP_MLOCK
since the vma has VM_LOCKED, then calls try_to_mlock_page. After this the
mlocked counter is increased again.

COWed anon page in a file-backed vma could be a such case. This patch
resolves it.

-- my test program --

int main()
{
mlockall(MCL_CURRENT);
return 0;
}

-- before --

root@barrios-target-linux:~# cat /proc/meminfo | egrep 'Mlo|Unev'
Unevictable: 0 kB
Mlocked: 0 kB

-- after --

root@barrios-target-linux:~# cat /proc/meminfo | egrep 'Mlo|Unev'
Unevictable: 8 kB
Mlocked: 8 kB

Signed-off-by: MinChan Kim
Acked-by: Lee Schermerhorn
Acked-by: KOSAKI Motohiro
Tested-by: Lee Schermerhorn
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

MinChan Kim
2009-02-12 06:25:35 +0800
fc3501d41 mm: fix dirty_bytes/dirty_background_bytes sysctls on 64bit arches ... Browse Code »

We need to pass an unsigned long as the minimum, because it gets casted
to an unsigned long in the sysctl handler. If we pass an int, we'll
access four more bytes on 64bit arches, resulting in a random minimum
value.

[rientjes@google.com: fix type of `old_bytes']
Signed-off-by: Sven Wegener
Cc: Peter Zijlstra
Cc: Dave Chinner
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sven Wegener
2009-02-12 06:25:35 +0800