Doug / smarc-fsl-linux-kernel | Embedian Git Server

24 Sep, 2009

16 commits

6c5daf012 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
truncate: use new helpers
truncate: new helpers
fs: fix overflow in sys_mount() for in-kernel calls
fs: Make unload_nls() NULL pointer safe
freeze_bdev: grab active reference to frozen superblocks
freeze_bdev: kill bd_mount_sem
exofs: remove BKL from super operations
fs/romfs: correct error-handling code
vfs: seq_file: add helpers for data filling
vfs: remove redundant position check in do_sendfile
vfs: change sb->s_maxbytes to a loff_t
vfs: explicitly cast s_maxbytes in fiemap_check_ranges
libfs: return error code on failed attr set
seq_file: return a negative error code when seq_path_root() fails.
vfs: optimize touch_time() too
vfs: optimization for touch_atime()
vfs: split generic_forget_inode() so that hugetlbfs does not have to copy it
fs/inode.c: add dev-id and inode number for debugging in init_special_inode()
libfs: make simple_read_from_buffer conventional

Linus Torvalds
2009-09-24 23:32:11 +0800
db1682636 Merge branch 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 ... Browse Code »

* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
HWPOISON: Enable error_remove_page on btrfs
HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
HWPOISON: Add madvise() based injector for hardware poisoned pages v4
HWPOISON: Enable error_remove_page for NFS
HWPOISON: Enable .remove_error_page for migration aware file systems
HWPOISON: The high level memory error handler in the VM v7
HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
HWPOISON: shmem: call set_page_dirty() with locked page
HWPOISON: Define a new error_remove_page address space op for async truncation
HWPOISON: Add invalidate_inode_page
HWPOISON: Refactor truncate to allow direct truncating of page v2
HWPOISON: check and isolate corrupted free pages v2
HWPOISON: Handle hardware poisoned pages in try_to_unmap
HWPOISON: Use bitmask/action code for try_to_unmap behaviour
HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
HWPOISON: Add poison check to page fault handling
HWPOISON: Add basic support for poisoned pages in fault handler v3
HWPOISON: Add new SIGBUS error codes for hardware poison signals
HWPOISON: Add support for poison swap entries v2
HWPOISON: Export some rmap vma locking to outside world
...

Linus Torvalds
2009-09-24 22:53:22 +0800
8d65af789 sysctl: remove "struct file *" argument of ->proc_handler ... Browse Code »

It's unused.

It isn't needed -- read or write flag is already passed and sysctl
shouldn't care about the rest.

It _was_ used in two places at arch/frv for some reason.

Signed-off-by: Alexey Dobriyan
Cc: David Howells
Cc: "Eric W. Biederman"
Cc: Al Viro
Cc: Ralf Baechle
Cc: Martin Schwidefsky
Cc: Ingo Molnar
Cc: "David S. Miller"
Cc: James Morris
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2009-09-24 22:21:04 +0800
1dd3a2732 memcg: show swap usage in stat file ... Browse Code »

We now count MEM_CGROUP_STAT_SWAPOUT, so we can show swap usage. It would
be useful for users to show swap usage in memory.stat file, because they
don't need calculate memsw.usage - res.usage to know swap usage.

Signed-off-by: Daisuke Nishimura
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke Nishimura
2009-09-24 22:20:59 +0800
0c3e73e84 memcg: improve resource counter scalability ... Browse Code »

Reduce the resource counter overhead (mostly spinlock) associated with the
root cgroup. This is a part of the several patches to reduce mem cgroup
overhead. I had posted other approaches earlier (including using percpu
counters). Those patches will be a natural addition and will be added
iteratively on top of these.

The patch stops resource counter accounting for the root cgroup. The data
for display is derived from the statisitcs we maintain via
mem_cgroup_charge_statistics (which is more scalable). What happens today
is that, we do double accounting, once using res_counter_charge() and once
using memory_cgroup_charge_statistics(). For the root, since we don't
implement limits any more, we don't need to track every charge via
res_counter_charge() and check for limit being exceeded and reclaim.

The main mem->res usage_in_bytes can be derived by summing the cache and
rss usage data from memory statistics (MEM_CGROUP_STAT_RSS and
MEM_CGROUP_STAT_CACHE). However, for memsw->res usage_in_bytes, we need
additional data about swapped out memory. This patch adds a
MEM_CGROUP_STAT_SWAPOUT and uses that along with MEM_CGROUP_STAT_RSS and
MEM_CGROUP_STAT_CACHE to derive the memsw data. This data is computed
recursively when hierarchy is enabled.

The tests results I see on a 24 way show that

1. The lock contention disappears from /proc/lock_stats
2. The results of the test are comparable to running with
cgroup_disable=memory.

Here is a sample of my program runs

Without Patch

Performance counter stats for '/home/balbir/parallel_pagefault':

7192804.124144 task-clock-msecs # 23.937 CPUs
424691 context-switches # 0.000 M/sec
267 CPU-migrations # 0.000 M/sec
28498113 page-faults # 0.004 M/sec
5826093739340 cycles # 809.989 M/sec
408883496292 instructions # 0.070 IPC
7057079452 cache-references # 0.981 M/sec
3036086243 cache-misses # 0.422 M/sec

300.485365680 seconds time elapsed

With cgroup_disable=memory

Performance counter stats for '/home/balbir/parallel_pagefault':

7182183.546587 task-clock-msecs # 23.915 CPUs
425458 context-switches # 0.000 M/sec
203 CPU-migrations # 0.000 M/sec
92545093 page-faults # 0.013 M/sec
6034363609986 cycles # 840.185 M/sec
437204346785 instructions # 0.072 IPC
6636073192 cache-references # 0.924 M/sec
2358117732 cache-misses # 0.328 M/sec

300.320905827 seconds time elapsed

With this patch applied

Performance counter stats for '/home/balbir/parallel_pagefault':

7191619.223977 task-clock-msecs # 23.955 CPUs
422579 context-switches # 0.000 M/sec
88 CPU-migrations # 0.000 M/sec
91946060 page-faults # 0.013 M/sec
5957054385619 cycles # 828.333 M/sec
1058117350365 instructions # 0.178 IPC
9161776218 cache-references # 1.274 M/sec
1920494280 cache-misses # 0.267 M/sec

300.218764862 seconds time elapsed

Data from Prarit (kernel compile with make -j64 on a 64
CPU/32G machine)

For a single run

Without patch

real 27m8.988s
user 87m24.916s
sys 382m6.037s

With patch

real 4m18.607s
user 84m58.943s
sys 50m52.682s

With config turned off

real 4m54.972s
user 90m13.456s
sys 50m19.711s

NOTE: The data looks counterintuitive due to the increased performance
with the patch, even over the config being turned off. We probably need
more runs, but so far all testing has shown that the patches definitely
help.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Balbir Singh
Cc: Prarit Bhargava
Cc: Andi Kleen
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Daisuke Nishimura
Cc: KOSAKI Motohiro
Cc: Paul Menage
Cc: Li Zefan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balbir Singh
2009-09-24 22:20:59 +0800
4e4169535 memory controller: soft limit reclaim on contention ... Browse Code »

Implement reclaim from groups over their soft limit

Permit reclaim from memory cgroups on contention (via the direct reclaim
path).

memory cgroup soft limit reclaim finds the group that exceeds its soft
limit by the largest number of pages and reclaims pages from it and then
reinserts the cgroup into its correct place in the rbtree.

Add additional checks to mem_cgroup_hierarchical_reclaim() to detect long
loops in case all swap is turned off. The code has been refactored and
the loop check (loop < 2) has been enhanced for soft limits. For soft
limits, we try to do more targetted reclaim. Instead of bailing out after
two loops, the routine now reclaims memory proportional to the size by
which the soft limit is exceeded. The proportion has been empirically
determined.

[akpm@linux-foundation.org: build fix]
[kamezawa.hiroyu@jp.fujitsu.com: fix softlimit css refcnt handling]
[nishimura@mxp.nes.nec.co.jp: refcount of the "victim" should be decremented before exiting the loop]
Signed-off-by: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Cc: Li Zefan
Acked-by: KOSAKI Motohiro
Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balbir Singh
2009-09-24 22:20:59 +0800
75822b449 memory controller: soft limit refactor reclaim flags ... Browse Code »

Refactor mem_cgroup_hierarchical_reclaim()

Refactor the arguments passed to mem_cgroup_hierarchical_reclaim() into
flags, so that new parameters don't have to be passed as we make the
reclaim routine more flexible

Signed-off-by: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balbir Singh
2009-09-24 22:20:59 +0800
f64c3f549 memory controller: soft limit organize cgroups ... Browse Code »

Organize cgroups over soft limit in a RB-Tree

Introduce an RB-Tree for storing memory cgroups that are over their soft
limit. The overall goal is to

1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
We are careful about updates, updates take place only after a particular
time interval has passed
2. We remove the node from the RB-Tree when the usage goes below the soft
limit

The next set of patches will exploit the RB-Tree to get the group that is
over its soft limit by the largest amount and reclaim from it, when we
face memory contention.

[hugh.dickins@tiscali.co.uk: CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_PREEMPT=y fails to boot]
Signed-off-by: Balbir Singh
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: KOSAKI Motohiro
Signed-off-by: Hugh Dickins
Cc: Jiri Slaby
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balbir Singh
2009-09-24 22:20:59 +0800
296c81d89 memory controller: soft limit interface ... Browse Code »

Add an interface to allow get/set of soft limits. Soft limits for memory
plus swap controller (memsw) is currently not supported. Resource
counters have been enhanced to support soft limits and new type
RES_SOFT_LIMIT has been added. Unlike hard limits, soft limits can be
directly set and do not need any reclaim or checks before setting them to
a newer value.

Kamezawa-San raised a question as to whether soft limit should belong to
res_counter. Since all resources understand the basic concepts of hard
and soft limits, it is justified to add soft limits here. Soft limits are
a generic resource usage feature, even file system quotas support soft
limits.

Signed-off-by: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balbir Singh
2009-09-24 22:20:59 +0800
261fb61a8 memcg: add comments explaining memory barriers ... Browse Code »

Add comments for the reason of smp_wmb() in mem_cgroup_commit_charge().

[akpm@linux-foundation.org: coding-style fixes]
Cc: Daisuke Nishimura
Cc: Balbir Singh
Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-09-24 22:20:58 +0800
4b3bde4c9 memcg: remove the overhead associated with the root cgroup ... Browse Code »

Change the memory cgroup to remove the overhead associated with accounting
all pages in the root cgroup. As a side-effect, we can no longer set a
memory hard limit in the root cgroup.

A new flag to track whether the page has been accounted or not has been
added as well. Flags are now set atomically for page_cgroup,
pcg_default_flags is now obsolete and removed.

[akpm@linux-foundation.org: fix a few documentation glitches]
Signed-off-by: Balbir Singh
Signed-off-by: Daisuke Nishimura
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Li Zefan
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balbir Singh
2009-09-24 22:20:58 +0800
be367d099 cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time ... Browse Code »

Alter the ss->can_attach and ss->attach functions to be able to deal with
a whole threadgroup at a time, for use in cgroup_attach_proc. (This is a
pre-patch to cgroup-procs-writable.patch.)

Currently, new mode of the attach function can only tell the subsystem
about the old cgroup of the threadgroup leader. No subsystem currently
needs that information for each thread that's being moved, but if one were
to be added (for example, one that counts tasks within a group) this bit
would need to be reworked a bit to tell the subsystem the right
information.

[hidave.darkstar@gmail.com: fix build]
Signed-off-by: Ben Blum
Signed-off-by: Paul Menage
Acked-by: Li Zefan
Reviewed-by: Matt Helsley
Cc: "Eric W. Biederman"
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Dave Young
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ben Blum
2009-09-24 22:20:58 +0800
2c6854fda ksm: change default values to better fit into mainline kernel ... Browse Code »

Now that ksm is in mainline it is better to change the default values to
better fit to most of the users.

This patch change the ksm default values to be:

ksm_thread_pages_to_scan = 100 (instead of 200)
ksm_thread_sleep_millisecs = 20 (like before)
ksm_run = KSM_RUN_STOP (instead of KSM_RUN_MERGE - meaning ksm is
disabled by default)
ksm_max_kernel_pages = nr_free_buffer_pages / 4 (instead of 2046)

The important aspect of this patch is: it disables ksm by default, and sets
the number of the kernel_pages that can be allocated to be a reasonable
number.

Signed-off-by: Izik Eidus
Cc: Hugh Dickins
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Izik Eidus
2009-09-24 22:20:56 +0800
25d9e2d15 truncate: new helpers ... Browse Code »

Introduce new truncate helpers truncate_pagecache and inode_newsize_ok.
vmtruncate is also consolidated from mm/memory.c and mm/nommu.c and
into mm/truncate.c.

Reviewed-by: Christoph Hellwig
Signed-off-by: Nick Piggin
Signed-off-by: Al Viro

npiggin@suse.de
2009-09-24 20:41:47 +0800
db7907865 cpumask: use new-style cpumask ops in mm/quicklist. ... Browse Code »

This slipped past the previous sweeps.

Signed-off-by: Rusty Russell
Acked-by: Christoph Lameter

Rusty Russell
2009-09-24 08:04:52 +0800
4266c97a3 nommu: fix two build breakages ... Browse Code »

My 58fa879e1e640a1856f736b418984ebeccee1c95 "mm: FOLL flags for GUP flags"
broke CONFIG_NOMMU build by forgetting to update nommu.c foll_flags type:

mm/nommu.c:171: error: conflicting types for `__get_user_pages'
mm/internal.h:254: error: previous declaration of `__get_user_pages' was here
make[1]: *** [mm/nommu.o] Error 1

My 03f6462a3ae78f36eb1f0ee8b4d5ae2f7859c1d5 "mm: move highest_memmap_pfn"
broke CONFIG_NOMMU build by forgetting to add a nommu.c highest_memmap_pfn:

mm/built-in.o: In function `memmap_init_zone':
(.meminit.text+0x326): undefined reference to `highest_memmap_pfn'
mm/built-in.o: In function `memmap_init_zone':
(.meminit.text+0x32d): undefined reference to `highest_memmap_pfn'

Fix both breakages, and give myself 30 lashes (ouch!)

Reported-by: Michal Simek
Signed-off-by: Hugh Dickins
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-09-24 00:22:10 +0800

23 Sep, 2009

3 commits

81ac3ad90 kcore: register module area in generic way ... Browse Code »

Some archs define MODULED_VADDR/MODULES_END which is not in VMALLOC area.
This is handled only in x86-64. This patch make it more generic. And we
can use vread/vwrite to access the area. Fix it.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Jiri Slaby
Cc: Ralf Baechle
Cc: Benjamin Herrenschmidt
Cc: WANG Cong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-09-23 22:39:42 +0800
908eedc61 walk system ram range ... Browse Code »

Originally, walk_memory_resource() was introduced to traverse all memory
of "System RAM" for detecting memory hotplug/unplug range. For doing so,
flags of IORESOUCE_MEM|IORESOURCE_BUSY was used and this was enough for
memory hotplug.

But for using other purpose, /proc/kcore, this may includes some firmware
area marked as IORESOURCE_BUSY | IORESOUCE_MEM. This patch makes the
check strict to find out busy "System RAM".

Note: PPC64 keeps their own walk_memory_resouce(), which walk through
ppc64's lmb informaton. Because old kclist_add() is called per lmb, this
patch makes no difference in behavior, finally.

And this patch removes CONFIG_MEMORY_HOTPLUG check from this function.
Because pfn_valid() just show "there is memmap or not* and cannot be used
for "there is physical memory or not", this function is useful in generic
to scan physical memory range.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Ralf Baechle
Cc: Benjamin Herrenschmidt
Cc: WANG Cong
Cc: Américo Wang
Cc: David Rientjes
Cc: Roland Dreier
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-09-23 22:39:41 +0800
d899bf7b5 procfs: provide stack information for threads ... Browse Code »

A patch to give a better overview of the userland application stack usage,
especially for embedded linux.

Currently you are only able to dump the main process/thread stack usage
which is showed in /proc/pid/status by the "VmStk" Value. But you get no
information about the consumed stack memory of the the threads.

There is an enhancement in the /proc//{task/*,}/*maps and which marks
the vm mapping where the thread stack pointer reside with "[thread stack
xxxxxxxx]". xxxxxxxx is the maximum size of stack. This is a value
information, because libpthread doesn't set the start of the stack to the
top of the mapped area, depending of the pthread usage.

A sample output of /proc//task//maps looks like:

08048000-08049000 r-xp 00000000 03:00 8312 /opt/z
08049000-0804a000 rw-p 00001000 03:00 8312 /opt/z
0804a000-0806b000 rw-p 00000000 00:00 0 [heap]
a7d12000-a7d13000 ---p 00000000 00:00 0
a7d13000-a7f13000 rw-p 00000000 00:00 0 [thread stack: 001ff4b4]
a7f13000-a7f14000 ---p 00000000 00:00 0
a7f14000-a7f36000 rw-p 00000000 00:00 0
a7f36000-a8069000 r-xp 00000000 03:00 4222 /lib/libc.so.6
a8069000-a806b000 r--p 00133000 03:00 4222 /lib/libc.so.6
a806b000-a806c000 rw-p 00135000 03:00 4222 /lib/libc.so.6
a806c000-a806f000 rw-p 00000000 00:00 0
a806f000-a8083000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0
a8083000-a8084000 r--p 00013000 03:00 14462 /lib/libpthread.so.0
a8084000-a8085000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0
a8085000-a8088000 rw-p 00000000 00:00 0
a8088000-a80a4000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2
a80a4000-a80a5000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2
a80a5000-a80a6000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2
afaf5000-afb0a000 rw-p 00000000 00:00 0 [stack]
ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso]

Also there is a new entry "stack usage" in /proc//{task/*,}/status
which will you give the current stack usage in kb.

A sample output of /proc/self/status looks like:

Name: cat
State: R (running)
Tgid: 507
Pid: 507
.
.
.
CapBnd: fffffffffffffeff
voluntary_ctxt_switches: 0
nonvoluntary_ctxt_switches: 0
Stack usage: 12 kB

I also fixed stack base address in /proc//{task/*,}/stat to the base
address of the associated thread stack and not the one of the main
process. This makes more sense.

[akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()]
Signed-off-by: Stefani Seibold
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman"
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Stefani Seibold
2009-09-23 22:39:41 +0800

22 Sep, 2009

21 commits

342ff1a1b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
trivial: fix typo in aic7xxx comment
trivial: fix comment typo in drivers/ata/pata_hpt37x.c
trivial: typo in kernel-parameters.txt
trivial: fix typo in tracing documentation
trivial: add __init/__exit macros in drivers/gpio/bt8xxgpio.c
trivial: add __init macro/ fix of __exit macro location in ipmi_poweroff.c
trivial: remove unnecessary semicolons
trivial: Fix duplicated word "options" in comment
trivial: kbuild: remove extraneous blank line after declaration of usage()
trivial: improve help text for mm debug config options
trivial: doc: hpfall: accept disk device to unload as argument
trivial: doc: hpfall: reduce risk that hpfall can do harm
trivial: SubmittingPatches: Fix reference to renumbered step
trivial: fix typos "man[ae]g?ment" -> "management"
trivial: media/video/cx88: add __init/__exit macros to cx88 drivers
trivial: fix typo in CONFIG_DEBUG_FS in gcov doc
trivial: fix missing printk space in amd_k7_smp_check
trivial: fix typo s/ketymap/keymap/ in comment
trivial: fix typo "to to" in multiple files
trivial: fix typos in comments s/DGBU/DBGU/
...

Linus Torvalds
2009-09-22 22:51:45 +0800
eb8cdec4a nommu: add support for Memory Protection Units (MPU) ... Browse Code »

Some architectures (like the Blackfin arch) implement some of the
"simpler" features that one would expect out of a MMU such as memory
protection.

In our case, we actually get read/write/exec protection down to the page
boundary so processes can't stomp on each other let alone the kernel.

There is a performance decrease (which depends greatly on the workload)
however as the hardware/software interaction was not optimized at design
time.

Signed-off-by: Bernd Schmidt
Signed-off-by: Bryan Wu
Signed-off-by: Mike Frysinger
Acked-by: David Howells
Acked-by: Greg Ungerer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bernd Schmidt
2009-09-22 22:17:43 +0800
f68e14805 mm: reduce atomic use on use_mm fast path ... Browse Code »

When the mm being switched to matches the active mm, we don't need to
increment and then drop the mm count. In a simple benchmark this happens
in about 50% of time. Making that conditional reduces contention on that
cacheline on SMP systems.

Acked-by: Andrea Arcangeli
Signed-off-by: Michael S. Tsirkin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michael S. Tsirkin
2009-09-22 22:17:42 +0800
3d2d827f5 mm: move use_mm/unuse_mm from aio.c to mm/ ... Browse Code »

Anyone who wants to do copy to/from user from a kernel thread, needs
use_mm (like what fs/aio has). Move that into mm/, to make reusing and
exporting easier down the line, and make aio use it. Next intended user,
besides aio, will be vhost-net.

Acked-by: Andrea Arcangeli
Signed-off-by: Michael S. Tsirkin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michael S. Tsirkin
2009-09-22 22:17:42 +0800
425fbf047 shmem: initialize struct shmem_sb_info to zero ... Browse Code »

Fixes the following kmemcheck false positive (the compiler is using
a 32-bit mov to load the 16-bit sbinfo->mode in shmem_fill_super):

[ 0.337000] Total of 1 processors activated (3088.38 BogoMIPS).
[ 0.352000] CPU0 attaching NULL sched-domain.
[ 0.360000] WARNING: kmemcheck: Caught 32-bit read from uninitialized
memory (9f8020fc)
[ 0.361000]
a44240820000000041f6998100000000000000000000000000000000ff030000
[ 0.368000] i i i i i i i i i i i i i i i i u u u u i i i i i i i i i i u
u
[ 0.375000] ^
[ 0.376000]
[ 0.377000] Pid: 9, comm: khelper Not tainted (2.6.31-tip #206) P4DC6
[ 0.378000] EIP: 0060:[] EFLAGS: 00010246 CPU: 0
[ 0.379000] EIP is at shmem_fill_super+0xb5/0x120
[ 0.380000] EAX: 00000000 EBX: 9f845400 ECX: 824042a4 EDX: 8199f641
[ 0.381000] ESI: 9f8020c0 EDI: 9f845400 EBP: 9f81af68 ESP: 81cd6eec
[ 0.382000] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[ 0.383000] CR0: 8005003b CR2: 9f806200 CR3: 01ccd000 CR4: 000006d0
[ 0.384000] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 0.385000] DR6: ffff4ff0 DR7: 00000400
[ 0.386000] [] get_sb_nodev+0x3c/0x80
[ 0.388000] [] shmem_get_sb+0x14/0x20
[ 0.390000] [] vfs_kern_mount+0x4f/0x120
[ 0.392000] [] init_tmpfs+0x7e/0xb0
[ 0.394000] [] do_basic_setup+0x17/0x30
[ 0.396000] [] kernel_init+0x57/0xa0
[ 0.398000] [] kernel_thread_helper+0x7/0x10
[ 0.400000] [] 0xffffffff
[ 0.402000] khelper used greatest stack depth: 2820 bytes left
[ 0.407000] calling init_mmap_min_addr+0x0/0x10 @ 1
[ 0.408000] initcall init_mmap_min_addr+0x0/0x10 returned 0 after 0 usecs

Reported-by: Ingo Molnar
Analysed-by: Vegard Nossum
Signed-off-by: Pekka Enberg
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pekka Enberg
2009-09-22 22:17:42 +0800
4e52780d4 hugetlb: add MAP_HUGETLB for mmaping pseudo-anonymous huge page regions ... Browse Code »

Add a flag for mmap that will be used to request a huge page region that
will look like anonymous memory to userspace. This is accomplished by
using a file on the internal vfsmount. MAP_HUGETLB is a modifier of
MAP_ANONYMOUS and so must be specified with it. The region will behave
the same as a MAP_ANONYMOUS region using small pages.

[akpm@linux-foundation.org: fix arch definitions of MAP_HUGETLB]
Signed-off-by: Eric B Munson
Acked-by: David Rientjes
Cc: Mel Gorman
Cc: Adam Litke
Cc: David Gibson
Cc: Lee Schermerhorn
Cc: Nick Piggin
Cc: Hugh Dickins
Cc: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric B Munson
2009-09-22 22:17:42 +0800
f8dbf0a7a mmap: save some cycles for the shared anonymous mapping ... Browse Code »

shmem_zero_setup() does not change vm_start, pgoff or vm_flags, only some
drivers change them (such as /driver/video/bfin-t350mcqb-fb.c).

Move these codes to a more proper place to save cycles for shared
anonymous mapping.

Signed-off-by: Huang Shijie
Reviewed-by: Minchan Kim
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Shijie
2009-09-22 22:17:41 +0800
252c5f94d mmap: avoid unnecessary anon_vma lock acquisition in vma_adjust() ... Browse Code »

We noticed very erratic behavior [throughput] with the AIM7 shared
workload running on recent distro [SLES11] and mainline kernels on an
8-socket, 32-core, 256GB x86_64 platform. On the SLES11 kernel
[2.6.27.19+] with Barcelona processors, as we increased the load [10s of
thousands of tasks], the throughput would vary between two "plateaus"--one
at ~65K jobs per minute and one at ~130K jpm. The simple patch below
causes the results to smooth out at the ~130k plateau.

But wait, there's more:

We do not see this behavior on smaller platforms--e.g., 4 socket/8 core.
This could be the result of the larger number of cpus on the larger
platform--a scalability issue--or it could be the result of the larger
number of interconnect "hops" between some nodes in this platform and how
the tasks for a given load end up distributed over the nodes' cpus and
memories--a stochastic NUMA effect.

The variability in the results are less pronounced [on the same platform]
with Shanghai processors and with mainline kernels. With 31-rc6 on
Shanghai processors and 288 file systems on 288 fibre attached storage
volumes, the curves [jpm vs load] are both quite flat with the patched
kernel consistently producing ~3.9% better throughput [~80K jpm vs ~77K
jpm] than the unpatched kernel.

Profiling indicated that the "slow" runs were incurring high[er]
contention on an anon_vma lock in vma_adjust(), apparently called from the
sbrk() system call.

The patch:

A comment in mm/mmap.c:vma_adjust() suggests that we don't really need the
anon_vma lock when we're only adjusting the end of a vma, as is the case
for brk(). The comment questions whether it's worth while to optimize for
this case. Apparently, on the newer, larger x86_64 platforms, with
interesting NUMA topologies, it is worth while--especially considering
that the patch [if correct!] is quite simple.

We can detect this condition--no overlap with next vma--by noting a NULL
"importer". The anon_vma pointer will also be NULL in this case, so
simply avoid loading vma->anon_vma to avoid the lock.

However, we DO need to take the anon_vma lock when we're inserting a vma
['insert' non-NULL] even when we have no overlap [NULL "importer"], so we
need to check for 'insert', as well. And Hugh points out that we should
also take it when adjusting vm_start (so that rmap.c can rely upon
vma_address() while it holds the anon_vma lock).

akpm: Zhang Yanmin reprts a 150% throughput improvement with aim7, so it
might be -stable material even though thiss isn't a regression: "this
issue is not clear on dual socket Nehalem machine (2*4*2 cpu), but is
severe on large machine (4*8*2 cpu)"

[hugh.dickins@tiscali.co.uk: test vma start too]
Signed-off-by: Lee Schermerhorn
Signed-off-by: Hugh Dickins
Cc: Nick Piggin
Cc: Eric Whitney
Tested-by: "Zhang, Yanmin"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lee Schermerhorn
2009-09-22 22:17:41 +0800
3f96b79ad tmpfs: depend on shmem ... Browse Code »

CONFIG_SHMEM off gives you (ramfs masquerading as) tmpfs, even when
CONFIG_TMPFS is off: that's a little anomalous, and I'd intended to make
more sense of it by removing CONFIG_TMPFS altogether, always enabling its
code when CONFIG_SHMEM; but so many defconfigs have CONFIG_SHMEM on
CONFIG_TMPFS off that we'd better leave that as is.

But there is no point in asking for CONFIG_TMPFS if CONFIG_SHMEM is off:
make TMPFS depend on SHMEM, which also prevents TMPFS_POSIX_ACL
shmem_acl.o being pointlessly built into the kernel when SHMEM is off.

And a selfish change, to prevent the world from being rebuilt when I
switch between CONFIG_SHMEM on and off: the only CONFIG_SHMEM in the
header files is mm.h shmem_lock() - give that a shmem.c stub instead.

Signed-off-by: Hugh Dickins
Acked-by: Matt Mackall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-09-22 22:17:41 +0800
cdf7b3418 mmap: remove unnecessary code ... Browse Code »

If (flags & MAP_LOCKED) is true, it means vm_flags has already contained
the bit VM_LOCKED which is set by calc_vm_flag_bits().

So there is no need to reset it again, just remove it.

Signed-off-by: Huang Shijie
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Shijie
2009-09-22 22:17:41 +0800
03f6462a3 mm: move highest_memmap_pfn ... Browse Code »

Move highest_memmap_pfn __read_mostly from page_alloc.c next to zero_pfn
__read_mostly in memory.c: to help them share a cacheline, since they're
very often tested together in vm_normal_page().

Signed-off-by: Hugh Dickins
Cc: Rik van Riel
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Mel Gorman
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-09-22 22:17:41 +0800
62eede62d mm: ZERO_PAGE without PTE_SPECIAL ... Browse Code »

Reinstate anonymous use of ZERO_PAGE to all architectures, not just to
those which __HAVE_ARCH_PTE_SPECIAL: as suggested by Nick Piggin.

Contrary to how I'd imagined it, there's nothing ugly about this, just a
zero_pfn test built into one or another block of vm_normal_page().

But the MIPS ZERO_PAGE-of-many-colours case demands is_zero_pfn() and
my_zero_pfn() inlines. Reinstate its mremap move_pte() shuffling of
ZERO_PAGEs we did from 2.6.17 to 2.6.19? Not unless someone shouts for
that: it would have to take vm_flags to weed out some cases.

Signed-off-by: Hugh Dickins
Cc: Rik van Riel
Reviewed-by: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Mel Gorman
Cc: Minchan Kim
Cc: Ralf Baechle
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-09-22 22:17:41 +0800
3ae77f43b mm: hugetlbfs_pagecache_present ... Browse Code »

Rename hugetlbfs_backed() to hugetlbfs_pagecache_present()
and add more comments, as suggested by Mel Gorman.

Signed-off-by: Hugh Dickins
Cc: Rik van Riel
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Mel Gorman
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-09-22 22:17:41 +0800
6e919717c mm: m(un)lock avoid ZERO_PAGE ... Browse Code »

I'm still reluctant to clutter __get_user_pages() with another flag, just
to avoid touching ZERO_PAGE count in mlock(); though we can add that later
if it shows up as an issue in practice.

But when mlocking, we can test page->mapping slightly earlier, to avoid
the potentially bouncy rescheduling of lock_page on ZERO_PAGE - mlock
didn't lock_page in olden ZERO_PAGE days, so we might have regressed.

And when munlocking, it turns out that FOLL_DUMP coincidentally does
what's needed to avoid all updates to ZERO_PAGE, so use that here also.
Plus add comment suggested by KAMEZAWA Hiroyuki.

Signed-off-by: Hugh Dickins
Cc: Rik van Riel
Cc: KAMEZAWA Hiroyuki
Cc: Nick Piggin
Acked-by: Mel Gorman
Cc: Minchan Kim
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-09-22 22:17:40 +0800
58fa879e1 mm: FOLL flags for GUP flags ... Browse Code »

__get_user_pages() has been taking its own GUP flags, then processing
them into FOLL flags for follow_page(). Though oddly named, the FOLL
flags are more widely used, so pass them to __get_user_pages() now.
Sorry, VM flags, VM_FAULT flags and FAULT_FLAGs are still distinct.

(The patch to __get_user_pages() looks peculiar, with both gup_flags
and foll_flags: the gup_flags remain constant; but as before there's
an exceptional case, out of scope of the patch, in which foll_flags
per page have FOLL_WRITE masked off.)

Signed-off-by: Hugh Dickins
Cc: Rik van Riel
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Mel Gorman
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-09-22 22:17:40 +0800
a13ea5b75 mm: reinstate ZERO_PAGE ... Browse Code »

KAMEZAWA Hiroyuki has observed customers of earlier kernels taking
advantage of the ZERO_PAGE: which we stopped do_anonymous_page() from
using in 2.6.24. And there were a couple of regression reports on LKML.

Following suggestions from Linus, reinstate do_anonymous_page() use of
the ZERO_PAGE; but this time avoid dirtying its struct page cacheline
with (map)count updates - let vm_normal_page() regard it as abnormal.

Use it only on arches which __HAVE_ARCH_PTE_SPECIAL (x86, s390, sh32,
most powerpc): that's not essential, but minimizes additional branches
(keeping them in the unlikely pte_special case); and incidentally
excludes mips (some models of which needed eight colours of ZERO_PAGE
to avoid costly exceptions).

Don't be fanatical about avoiding ZERO_PAGE updates: get_user_pages()
callers won't want to make exceptions for it, so increment its count
there. Changes to mlock and migration? happily seems not needed.

In most places it's quicker to check pfn than struct page address:
prepare a __read_mostly zero_pfn for that. Does get_dump_page()
still need its ZERO_PAGE check? probably not, but keep it anyway.

Signed-off-by: Hugh Dickins
Acked-by: Rik van Riel
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Mel Gorman
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-09-22 22:17:40 +0800
1ac0cb5d0 mm: fix anonymous dirtying ... Browse Code »

do_anonymous_page() has been wrong to dirty the pte regardless.
If it's not going to mark the pte writable, then it won't help
to mark it dirty here, and clogs up memory with pages which will
need swap instead of being thrown away. Especially wrong if no
overcommit is chosen, and this vma is not yet VM_ACCOUNTed -
we could exceed the limit and OOM despite no overcommit.

Signed-off-by: Hugh Dickins
Cc:
Acked-by: Rik van Riel
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Mel Gorman
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-09-22 22:17:40 +0800
2a15efc95 mm: follow_hugetlb_page flags ... Browse Code »

follow_hugetlb_page() shouldn't be guessing about the coredump case
either: pass the foll_flags down to it, instead of just the write bit.

Remove that obscure huge_zeropage_ok() test. The decision is easy,
though unlike the non-huge case - here vm_ops->fault is always set.
But we know that a fault would serve up zeroes, unless there's
already a hugetlbfs pagecache page to back the range.

(Alternatively, since hugetlb pages aren't swapped out under pressure,
you could save more dump space by arguing that a page not yet faulted
into this process cannot be relevant to the dump; but that would be
more surprising.)

Signed-off-by: Hugh Dickins
Acked-by: Rik van Riel
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Mel Gorman
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-09-22 22:17:40 +0800
8e4b9a607 mm: FOLL_DUMP replace FOLL_ANON ... Browse Code »

The "FOLL_ANON optimization" and its use_zero_page() test have caused
confusion and bugs: why does it test VM_SHARED? for the very good but
unsatisfying reason that VMware crashed without. As we look to maybe
reinstating anonymous use of the ZERO_PAGE, we need to sort this out.

Easily done: it's silly for __get_user_pages() and follow_page() to
be guessing whether it's safe to assume that they're being used for
a coredump (which can take a shortcut snapshot where other uses must
handle a fault) - just tell them with GUP_FLAGS_DUMP and FOLL_DUMP.

get_dump_page() doesn't even want a ZERO_PAGE: an error suits fine.

Signed-off-by: Hugh Dickins
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Reviewed-by: Minchan Kim
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-09-22 22:17:40 +0800
f3e8fccd0 mm: add get_dump_page ... Browse Code »

In preparation for the next patch, add a simple get_dump_page(addr)
interface for the CONFIG_ELF_CORE dumpers to use, instead of calling
get_user_pages() directly. They're not interested in errors: they
just want to use holes as much as possible, to save space and make
sure that the data is aligned where the headers said it would be.

Oh, and don't use that horrid DUMP_SEEK(off) macro!

Signed-off-by: Hugh Dickins
Acked-by: Rik van Riel
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Mel Gorman
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-09-22 22:17:40 +0800
1c3aff1ce mm: remove unused GUP flags ... Browse Code »

GUP_FLAGS_IGNORE_VMA_PERMISSIONS and GUP_FLAGS_IGNORE_SIGKILL were
flags added solely to prevent __get_user_pages() from doing some of
what it usually does, in the munlock case: we can now remove them.

Signed-off-by: Hugh Dickins
Acked-by: Rik van Riel
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Mel Gorman
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-09-22 22:17:40 +0800