Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

29 Mar, 2012

37 commits

c6dd897f3 mm: move page-types.c from Documentation to tools/vm ... Browse Code »

tools/ is the better place for vm tools which are used by many people.
Moving them to tools also make them open to more users instead of hide in
Documentation folder.

This patch moves page-types.c to tools/vm/page-types.c. Also add a
Makefile in tools/vm and fix two coding style problems: a) change const
arrary to 'const char * const', b) change a space to tab for indent.

Signed-off-by: Dave Young
Acked-by: Wu Fengguang
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Frederic Weisbecker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Young
2012-03-29 08:14:37 +0800
cab6b0560 selftests/Makefile: make `run_tests' depend on `all' ... Browse Code »

So a "make run_tests" will build the tests before trying to run them.

Acked-by: Frederic Weisbecker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2012-03-29 08:14:37 +0800
f467f7140 selftests: launch individual selftests from the main Makefile ... Browse Code »

Remove the run_tests script and launch the selftests by calling "make
run_tests" from the selftests top directory instead. This delegates to
the Makefile in each selftest directory, where it is decided how to launch
the local test.

This removes the need to add each selftest directory to the now removed
"run_tests" top script.

Signed-off-by: Frederic Weisbecker
Cc: Dave Young
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Frederic Weisbecker
2012-03-29 08:14:37 +0800
0fc9d1040 radix-tree: use iterators in find_get_pages* functions ... Browse Code »

Replace radix_tree_gang_lookup_slot() and
radix_tree_gang_lookup_tag_slot() in page-cache lookup functions with
brand-new radix-tree direct iterating. This avoids the double-scanning
and pointer copying.

Iterator don't stop after nr_pages page-get fails in a row, it continue
lookup till the radix-tree end. Thus we can safely remove these restart
conditions.

Unfortunately, old implementation didn't forbid nr_pages == 0, this corner
case does not fit into new code, so the patch adds an extra check at the
beginning.

Signed-off-by: Konstantin Khlebnikov
Tested-by: Hugh Dickins
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-03-29 08:14:37 +0800
cebbd29e1 radix-tree: rewrite gang lookup using iterator ... Browse Code »
5

Rewrite radix_tree_gang_lookup_* functions using the new radix-tree
iterator.

Signed-off-by: Konstantin Khlebnikov
Tested-by: Hugh Dickins
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-03-29 08:14:37 +0800
78c1d7848 radix-tree: introduce bit-optimized iterator ... Browse Code »

A series of radix tree cleanups, and usage of them in the core pagecache
code.

Micro-benchmark:

lookup 14 slots (typical page-vector size)
in radix-tree there earch slot filled and tagged
before/after - nsec per full scan through tree

* Intel Sandy Bridge i7-2620M 4Mb L3
New code always faster

* AMD Athlon 6000+ 2x1Mb L2, without L3
New code generally faster,
Minor degradation (marked with "*") for huge sparse trees

* i386 on Sandy Bridge
New code faster for common cases: tagged and dense trees.
Some degradations for non-tagged lookup on sparse trees.

Ideally, there might help __ffs() analog for searching first non-zero
long element in array, gcc sometimes cannot optimize this loop corretly.

Numbers:

CPU: Intel Sandy Bridge i7-2620M 4Mb L3

radix-tree with 1024 slots:

tagged lookup

step 1 before 7156 after 3613
step 2 before 5399 after 2696
step 3 before 4779 after 1928
step 4 before 4456 after 1429
step 5 before 4292 after 1213
step 6 before 4183 after 1052
step 7 before 4157 after 951
step 8 before 4016 after 812
step 9 before 3952 after 851
step 10 before 3937 after 732
step 11 before 4023 after 709
step 12 before 3872 after 657
step 13 before 3892 after 633
step 14 before 3720 after 591
step 15 before 3879 after 578
step 16 before 3561 after 513

normal lookup

step 1 before 4266 after 3301
step 2 before 2695 after 2129
step 3 before 2083 after 1712
step 4 before 1801 after 1534
step 5 before 1628 after 1313
step 6 before 1551 after 1263
step 7 before 1475 after 1185
step 8 before 1432 after 1167
step 9 before 1373 after 1092
step 10 before 1339 after 1134
step 11 before 1292 after 1056
step 12 before 1319 after 1030
step 13 before 1276 after 1004
step 14 before 1256 after 987
step 15 before 1228 after 992
step 16 before 1247 after 999

radix-tree with 1024*1024*128 slots:

tagged lookup

step 1 before 1086102841 after 674196409
step 2 before 816839155 after 498138306
step 7 before 599728907 after 240676762
step 15 before 555729253 after 185219677
step 63 before 606637748 after 128585664
step 64 before 608384432 after 102945089
step 65 before 596987114 after 123996019
step 128 before 304459225 after 56783056
step 256 before 158846855 after 31232481
step 512 before 86085652 after 18950595
step 12345 before 6517189 after 1674057

normal lookup

step 1 before 626064869 after 544418266
step 2 before 418809975 after 336321473
step 7 before 242303598 after 207755560
step 15 before 208380563 after 176496355
step 63 before 186854206 after 167283638
step 64 before 176188060 after 170143976
step 65 before 185139608 after 167487116
step 128 before 88181865 after 86913490
step 256 before 45733628 after 45143534
step 512 before 24506038 after 23859036
step 12345 before 2177425 after 2018662

* AMD Athlon 6000+ 2x1Mb L2, without L3

radix-tree with 1024 slots:

tag-lookup

step 1 before 8164 after 5379
step 2 before 5818 after 5581
step 3 before 4959 after 4213
step 4 before 4371 after 3386
step 5 before 4204 after 2997
step 6 before 4950 after 2744
step 7 before 4598 after 2480
step 8 before 4251 after 2288
step 9 before 4262 after 2243
step 10 before 4175 after 2131
step 11 before 3999 after 2024
step 12 before 3979 after 1994
step 13 before 3842 after 1929
step 14 before 3750 after 1810
step 15 before 3735 after 1810
step 16 before 3532 after 1660

normal-lookup

step 1 before 7875 after 5847
step 2 before 4808 after 4071
step 3 before 4073 after 3462
step 4 before 3677 after 3074
step 5 before 4308 after 2978
step 6 before 3911 after 3807
step 7 before 3635 after 3522
step 8 before 3313 after 3202
step 9 before 3280 after 3257
step 10 before 3166 after 3083
step 11 before 3066 after 3026
step 12 before 2985 after 2982
step 13 before 2925 after 2924
step 14 before 2834 after 2808
step 15 before 2805 after 2803
step 16 before 2647 after 2622

radix-tree with 1024*1024*128 slots:

tag-lookup

step 1 before 1288059720 after 951736580
step 2 before 961292300 after 884212140
step 7 before 768905140 after 547267580
step 15 before 771319480 after 456550640
step 63 before 504847640 after 242704304
step 64 before 392484800 after 177920786
step 65 before 491162160 after 246895264
step 128 before 208084064 after 97348392
step 256 before 112401035 after 51408126
step 512 before 75825834 after 29145070
step 12345 before 5603166 after 2847330

normal-lookup

step 1 before 1025677120 after 861375100
step 2 before 647220080 after 572258540
step 7 before 505518960 after 484041813
step 15 before 430483053 after 444815320 *
step 63 before 388113453 after 404250546 *
step 64 before 374154666 after 396027440 *
step 65 before 381423973 after 396704853 *
step 128 before 190078700 after 202619384 *
step 256 before 100886756 after 102829108 *
step 512 before 64074505 after 56158720
step 12345 before 4237289 after 4422299 *

* i686 on Sandy bridge

radix-tree with 1024 slots:

tagged lookup

step 1 before 7990 after 4019
step 2 before 5698 after 2897
step 3 before 5013 after 2475
step 4 before 4630 after 1721
step 5 before 4346 after 1759
step 6 before 4299 after 1556
step 7 before 4098 after 1513
step 8 before 4115 after 1222
step 9 before 3983 after 1390
step 10 before 4077 after 1207
step 11 before 3921 after 1231
step 12 before 3894 after 1116
step 13 before 3840 after 1147
step 14 before 3799 after 1090
step 15 before 3797 after 1059
step 16 before 3783 after 745

normal lookup

step 1 before 5103 after 3499
step 2 before 3299 after 2550
step 3 before 2489 after 2370
step 4 before 2034 after 2302 *
step 5 before 1846 after 2268 *
step 6 before 1752 after 2249 *
step 7 before 1679 after 2164 *
step 8 before 1627 after 2153 *
step 9 before 1542 after 2095 *
step 10 before 1479 after 2109 *
step 11 before 1469 after 2009 *
step 12 before 1445 after 2039 *
step 13 before 1411 after 2013 *
step 14 before 1374 after 2046 *
step 15 before 1340 after 1975 *
step 16 before 1331 after 2000 *

radix-tree with 1024*1024*128 slots:

tagged lookup

step 1 before 1225865377 after 667153553
step 2 before 842427423 after 471533007
step 7 before 609296153 after 276260116
step 15 before 544232060 after 226859105
step 63 before 519209199 after 141343043
step 64 before 588980279 after 141951339
step 65 before 521099710 after 138282060
step 128 before 298476778 after 83390628
step 256 before 149358342 after 43602609
step 512 before 76994713 after 22911077
step 12345 before 5328666 after 1472111

normal lookup

step 1 before 819284564 after 533635310
step 2 before 512421605 after 364956155
step 7 before 271443305 after 305721345 *
step 15 before 223591630 after 273960216 *
step 63 before 190320247 after 217770207 *
step 64 before 178538168 after 267411372 *
step 65 before 186400423 after 215347937 *
step 128 before 88106045 after 140540612 *
step 256 before 44812420 after 70660377 *
step 512 before 24435438 after 36328275 *
step 12345 before 2123924 after 2148062 *

bloat-o-meter delta for this patchset + patchset with related shmem cleanups

bloat-o-meter: x86_64

add/remove: 4/3 grow/shrink: 5/6 up/down: 928/-939 (-11)
function old new delta
radix_tree_next_chunk - 499 +499
shmem_unuse 428 554 +126
shmem_radix_tree_replace 131 227 +96
find_get_pages_tag 354 419 +65
find_get_pages_contig 345 407 +62
find_get_pages 362 396 +34
__kstrtab_radix_tree_next_chunk - 22 +22
__ksymtab_radix_tree_next_chunk - 16 +16
__kcrctab_radix_tree_next_chunk - 8 +8
radix_tree_gang_lookup_slot 204 203 -1
static.shmem_xattr_set 384 381 -3
radix_tree_gang_lookup_tag_slot 208 191 -17
radix_tree_gang_lookup 231 187 -44
radix_tree_gang_lookup_tag 247 199 -48
shmem_unlock_mapping 278 190 -88
__lookup 217 - -217
__lookup_tag 242 - -242
radix_tree_locate_item 279 - -279

bloat-o-meter: i386

add/remove: 3/3 grow/shrink: 8/9 up/down: 1075/-1275 (-200)
function old new delta
radix_tree_next_chunk - 757 +757
shmem_unuse 352 449 +97
find_get_pages_contig 269 322 +53
shmem_radix_tree_replace 113 154 +41
find_get_pages_tag 277 318 +41
dcache_dir_lseek 426 458 +32
__kstrtab_radix_tree_next_chunk - 22 +22
vc_do_resize 968 977 +9
snd_pcm_lib_read1 725 733 +8
__ksymtab_radix_tree_next_chunk - 8 +8
netlbl_cipsov4_list 1120 1127 +7
find_get_pages 293 291 -2
new_slab 467 459 -8
bitfill_unaligned_rev 425 417 -8
radix_tree_gang_lookup_tag_slot 177 146 -31
blk_dump_cmd 267 229 -38
radix_tree_gang_lookup_slot 212 134 -78
shmem_unlock_mapping 221 128 -93
radix_tree_gang_lookup_tag 275 162 -113
radix_tree_gang_lookup 255 126 -129
__lookup 227 - -227
__lookup_tag 271 - -271
radix_tree_locate_item 277 - -277

This patch:

Implement a clean, simple and effective radix-tree iteration routine.

Iterating divided into two phases:
* lookup next chunk in radix-tree leaf node
* iterating through slots in this chunk

Main iterator function radix_tree_next_chunk() returns pointer to first
slot, and stores in the struct radix_tree_iter index of next-to-last slot.
For tagged-iterating it also constuct bitmask of tags for retunted chunk.
All additional logic implemented as static-inline functions and macroses.

Also adds radix_tree_find_next_bit() static-inline variant of
find_next_bit() optimized for small constant size arrays, because
find_next_bit() too heavy for searching in an array with one/two long
elements.

[akpm@linux-foundation.org: rework comments a bit]
Signed-off-by: Konstantin Khlebnikov
Tested-by: Hugh Dickins
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-03-29 08:14:37 +0800
4c619aa0b fs/proc/namespaces.c: prevent crash when ns_entries[] is empty ... Browse Code »

If CONFIG_NET_NS, CONFIG_UTS_NS and CONFIG_IPC_NS are disabled,
ns_entries[] becomes empty and things like
ns_entries[ARRAY_SIZE(ns_entries) - 1] will explode.

Reported-by: Richard Weinberger
Cc: "Eric W. Biederman"
Cc: Daniel Lezcano
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2012-03-29 08:14:37 +0800
f4507164e nbd: rename the nbd_device variable from lo to nbd ... Browse Code »

rename the nbd_device variable from "lo" to "nbd", since "lo" is just a name
copied from loop.c.

Signed-off-by: Wanlong Gao
Cc: Paul Clements
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanlong Gao
2012-03-29 08:14:37 +0800
cf3f89214 pidns: add reboot_pid_ns() to handle the reboot syscall ... Browse Code »

In the case of a child pid namespace, rebooting the system does not really
makes sense. When the pid namespace is used in conjunction with the other
namespaces in order to create a linux container, the reboot syscall leads
to some problems.

A container can reboot the host. That can be fixed by dropping the
sys_reboot capability but we are unable to correctly to poweroff/
halt/reboot a container and the container stays stuck at the shutdown time
with the container's init process waiting indefinitively.

After several attempts, no solution from userspace was found to reliabily
handle the shutdown from a container.

This patch propose to make the init process of the child pid namespace to
exit with a signal status set to : SIGINT if the child pid namespace
called "halt/poweroff" and SIGHUP if the child pid namespace called
"reboot". When the reboot syscall is called and we are not in the initial
pid namespace, we kill the pid namespace for "HALT", "POWEROFF",
"RESTART", and "RESTART2". Otherwise we return EINVAL.

Returning EINVAL is also an easy way to check if this feature is supported
by the kernel when invoking another 'reboot' option like CAD.

By this way the parent process of the child pid namespace knows if it
rebooted or not and can take the right decision.

Test case:
==========

#include
#include
#include
#include
#include
#include
#include
#include

#include

static int do_reboot(void *arg)
{
int *cmd = arg;

if (reboot(*cmd))
printf("failed to reboot(%d): %m\n", *cmd);
}

int test_reboot(int cmd, int sig)
{
long stack_size = 4096;
void *stack = alloca(stack_size) + stack_size;
int status;
pid_t ret;

ret = clone(do_reboot, stack, CLONE_NEWPID | SIGCHLD, &cmd);
if (ret < 0) {
printf("failed to clone: %m\n");
return -1;
}

if (wait(&status) < 0) {
printf("unexpected wait error: %m\n");
return -1;
}

if (!WIFSIGNALED(status)) {
printf("child process exited but was not signaled\n");
return -1;
}

if (WTERMSIG(status) != sig) {
printf("signal termination is not the one expected\n");
return -1;
}

return 0;
}

int main(int argc, char *argv[])
{
int status;

status = test_reboot(LINUX_REBOOT_CMD_RESTART, SIGHUP);
if (status < 0)
return 1;
printf("reboot(LINUX_REBOOT_CMD_RESTART) succeed\n");

status = test_reboot(LINUX_REBOOT_CMD_RESTART2, SIGHUP);
if (status < 0)
return 1;
printf("reboot(LINUX_REBOOT_CMD_RESTART2) succeed\n");

status = test_reboot(LINUX_REBOOT_CMD_HALT, SIGINT);
if (status < 0)
return 1;
printf("reboot(LINUX_REBOOT_CMD_HALT) succeed\n");

status = test_reboot(LINUX_REBOOT_CMD_POWER_OFF, SIGINT);
if (status < 0)
return 1;
printf("reboot(LINUX_REBOOT_CMD_POWERR_OFF) succeed\n");

status = test_reboot(LINUX_REBOOT_CMD_CAD_ON, -1);
if (status >= 0) {
printf("reboot(LINUX_REBOOT_CMD_CAD_ON) should have failed\n");
return 1;
}
printf("reboot(LINUX_REBOOT_CMD_CAD_ON) has failed as expected\n");

return 0;
}

[akpm@linux-foundation.org: tweak and add comments]
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Daniel Lezcano
Acked-by: Serge Hallyn
Tested-by: Serge Hallyn
Reviewed-by: Oleg Nesterov
Cc: Michael Kerrisk
Cc: "Eric W. Biederman"
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daniel Lezcano
2012-03-29 08:14:36 +0800
5a04cca6c sysctl: use bitmap library functions ... Browse Code »

Use bitmap_set() instead of using set_bit() for each bit. This conversion
is valid because the bitmap is private in the function call and atomic
bitops were unnecessary.

This also includes minor change.
- Use bitmap_copy() for shorter typing

Signed-off-by: Akinobu Mita
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2012-03-29 08:14:36 +0800
423a5bb49 ipmi: use locks on watchdog timeout set on reboot ... Browse Code »

The IPMI watchdog timer clears or extends the timer on reboot/shutdown.
It was using the non-locking routine for setting the watchdog timer, but
this was causing race conditions. Instead, use the locking version to
avoid the races. It seems to work fine.

Signed-off-by: Corey Minyard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Corey Minyard
2012-03-29 08:14:36 +0800
f60adf42a ipmi: simplify locking ... Browse Code »

Now that the the IPMI driver is using a tasklet, we can simplify the
locking in the driver and get rid of the message lock.

Signed-off-by: Corey Minyard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Corey Minyard
2012-03-29 08:14:36 +0800
895dcfd1c ipmi: fix message handling during panics ... Browse Code »

The part of the IPMI driver that delivered panic information to the event
log and extended the watchdog timeout during a panic was not properly
handling the messages. It used static messages to avoid allocation, but
wasn't properly waiting for these, or wasn't properly handling the
refcounts.

Signed-off-by: Corey Minyard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Corey Minyard
2012-03-29 08:14:36 +0800
7adf579c8 ipmi: use a tasklet for handling received messages ... Browse Code »

The IPMI driver would release a lock, deliver a message, then relock.
This is obviously ugly, and this patch converts the message handler
interface to use a tasklet to schedule work. This lets the receive
handler be called from an interrupt handler with interrupts enabled.

Signed-off-by: Corey Minyard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Corey Minyard
2012-03-29 08:14:36 +0800
828dc9da5 ipmi: increase KCS timeouts ... Browse Code »

We currently time out and retry KCS transactions after 1 second of waiting
for IBF or OBF. This appears to be too short for some hardware. The IPMI
spec says "All system software wait loops should include error timeouts.
For simplicity, such timeouts are not shown explicitly in the flow
diagrams. A five-second timeout or greater is recommended". Change the
timeout to five seconds to satisfy the slow hardware.

Signed-off-by: Matthew Garrett
Signed-off-by: Corey Minyard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Garrett
2012-03-29 08:14:36 +0800
b88e76936 ipmi: decrease the IPMI message transaction time in interrupt mode ... Browse Code »

Call the event handler immediately after starting the next message.

This change considerably decreases the IPMI transaction time (cuts off
~9ms for a single ipmitool transaction).

Signed-off-by: Srinivas_Gowda
Signed-off-by: Corey Minyard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Srinivas_Gowda
2012-03-29 08:14:36 +0800
09c71bfd8 kdump x86: fix total mem size calculation for reservation ... Browse Code »

crashkernel reservation need know the total memory size. Current
get_total_mem simply use max_pfn - min_low_pfn. It is wrong because it
will including memory holes in the middle.

Especially for kvm guest with memory > 0xe0000000, there's below in qemu
code: qemu split memory as below:

if (ram_size >= 0xe0000000 ) {
above_4g_mem_size = ram_size - 0xe0000000;
below_4g_mem_size = 0xe0000000;
} else {
below_4g_mem_size = ram_size;
}

So for 4G mem guest, seabios will insert a 512M usable region beyond of
4G. Thus in above case max_pfn - min_low_pfn will be more than original
memsize.

Fixing this issue by using memblock_phys_mem_size() to get the total
memsize.

Signed-off-by: Dave Young
Reviewed-by: WANG Cong
Reviewed-by: Simon Horman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Young
2012-03-29 08:14:36 +0800
eaa3be6ad kexec: add further check to crashkernel ... Browse Code »

When using crashkernel=2M-256M, the kernel doesn't give any warning. This
is misleading sometimes.

Signed-off-by: Zhenzhong Duan
Acked-by: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zhenzhong Duan
2012-03-29 08:14:36 +0800
d034cfab4 kexec: crash: don't save swapper_pg_dir for !CONFIG_MMU configurations ... Browse Code »

nommu platforms don't have very interesting swapper_pg_dir pointers and
usually just #define them to NULL, meaning that we can't include them in
the vmcoreinfo on the kexec crash path.

This patch only saves the swapper_pg_dir if we have an MMU.

Signed-off-by: Will Deacon
Reviewed-by: Simon Horman
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Will Deacon
2012-03-29 08:14:36 +0800
7d7f98488 arch/ia64: remove references to cpu_*_map ... Browse Code »

This was marked as obsolete for quite a while now.. Now it is time to
remove it altogether. And while doing this, get rid of first_cpu() as
well. Also, remove the redundant setting of cpu_online_mask in
smp_prepare_cpus() because the generic code would have already set cpu 0
in cpu_online_mask.

Reported-by: Tony Luck
Signed-off-by: Srivatsa S. Bhat
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Srivatsa S. Bhat
2012-03-29 08:14:36 +0800
38b93780a lib/cpumask.c: remove __any_online_cpu() ... Browse Code »

__any_online_cpu() is not optimal and also unnecessary. So, replace its
use by faster cpumask_* operations.

Signed-off-by: Srivatsa S. Bhat
Cc: Eric Dumazet
Cc: Venkatesh Pallipadi
Cc: Rusty Russell
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Srivatsa S. Bhat
2012-03-29 08:14:35 +0800
74046494e mm: only IPI CPUs to drain local pages if they exist ... Browse Code »

Calculate a cpumask of CPUs with per-cpu pages in any zone and only send
an IPI requesting CPUs to drain these pages to the buddy allocator if they
actually have pages when asked to flush.

This patch saves 85%+ of IPIs asking to drain per-cpu pages in case of
severe memory pressure that leads to OOM since in these cases multiple,
possibly concurrent, allocation requests end up in the direct reclaim code
path so when the per-cpu pages end up reclaimed on first allocation
failure for most of the proceeding allocation attempts until the memory
pressure is off (possibly via the OOM killer) there are no per-cpu pages
on most CPUs (and there can easily be hundreds of them).

This also has the side effect of shortening the average latency of direct
reclaim by 1 or more order of magnitude since waiting for all the CPUs to
ACK the IPI takes a long time.

Tested by running "hackbench 400" on a 8 CPU x86 VM and observing the
difference between the number of direct reclaim attempts that end up in
drain_all_pages() and those were more then 1/2 of the online CPU had any
per-cpu page in them, using the vmstat counters introduced in the next
patch in the series and using proc/interrupts.

In the test sceanrio, this was seen to save around 3600 global
IPIs after trigerring an OOM on a concurrent workload:

$ cat /proc/vmstat | tail -n 2
pcp_global_drain 0
pcp_global_ipi_saved 0

$ cat /proc/interrupts | grep CAL
CAL: 1 2 1 2
2 2 2 2 Function call interrupts

$ hackbench 400
[OOM messages snipped]

$ cat /proc/vmstat | tail -n 2
pcp_global_drain 3647
pcp_global_ipi_saved 3642

$ cat /proc/interrupts | grep CAL
CAL: 6 13 6 3
3 3 1 2 7 Function call interrupts

Please note that if the global drain is removed from the direct reclaim
path as a patch from Mel Gorman currently suggests this should be replaced
with an on_each_cpu_cond invocation.

Signed-off-by: Gilad Ben-Yossef
Acked-by: Mel Gorman
Cc: KOSAKI Motohiro
Acked-by: Christoph Lameter
Acked-by: Peter Zijlstra
Cc: Pekka Enberg
Cc: Rik van Riel
Cc: Andi Kleen
Acked-by: Michal Nazarewicz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gilad Ben-Yossef
2012-03-29 08:14:35 +0800
42be35d03 fs: only send IPI to invalidate LRU BH when needed ... Browse Code »

In several code paths, such as when unmounting a file system (but not
only) we send an IPI to ask each cpu to invalidate its local LRU BHs.

For multi-cores systems that have many cpus that may not have any LRU BH
because they are idle or because they have not performed any file system
accesses since last invalidation (e.g. CPU crunching on high perfomance
computing nodes that write results to shared memory or only using
filesystems that do not use the bh layer.) This can lead to loss of
performance each time someone switches the KVM (the virtual keyboard and
screen type, not the hypervisor) if it has a USB storage stuck in.

This patch attempts to only send an IPI to cpus that have LRU BH.

Signed-off-by: Gilad Ben-Yossef
Acked-by: Peter Zijlstra
Cc: Alexander Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gilad Ben-Yossef
2012-03-29 08:14:35 +0800
a8364d555 slub: only IPI CPUs that have per cpu obj to flush ... Browse Code »

flush_all() is called for each kmem_cache_destroy(). So every cache being
destroyed dynamically ends up sending an IPI to each CPU in the system,
regardless if the cache has ever been used there.

For example, if you close the Infinband ipath driver char device file, the
close file ops calls kmem_cache_destroy(). So running some infiniband
config tool on one a single CPU dedicated to system tasks might interrupt
the rest of the 127 CPUs dedicated to some CPU intensive or latency
sensitive task.

I suspect there is a good chance that every line in the output of "git
grep kmem_cache_destroy linux/ | grep '\->'" has a similar scenario.

This patch attempts to rectify this issue by sending an IPI to flush the
per cpu objects back to the free lists only to CPUs that seem to have such
objects.

The check which CPU to IPI is racy but we don't care since asking a CPU
without per cpu objects to flush does no damage and as far as I can tell
the flush_all by itself is racy against allocs on remote CPUs anyway, so
if you required the flush_all to be determinstic, you had to arrange for
locking regardless.

Without this patch the following artificial test case:

$ cd /sys/kernel/slab
$ for DIR in *; do cat $DIR/alloc_calls > /dev/null; done

produces 166 IPIs on an cpuset isolated CPU. With it it produces none.

The code path of memory allocation failure for CPUMASK_OFFSTACK=y
config was tested using fault injection framework.

Signed-off-by: Gilad Ben-Yossef
Acked-by: Christoph Lameter
Cc: Chris Metcalf
Acked-by: Peter Zijlstra
Cc: Frederic Weisbecker
Cc: Russell King
Cc: Pekka Enberg
Cc: Matt Mackall
Cc: Sasha Levin
Cc: Rik van Riel
Cc: Andi Kleen
Cc: Mel Gorman
Cc: Alexander Viro
Cc: Avi Kivity
Cc: Michal Nazarewicz
Cc: Kosaki Motohiro
Cc: Milton Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gilad Ben-Yossef
2012-03-29 08:14:35 +0800
b3a7e98e0 smp: add func to IPI cpus based on parameter func ... Browse Code »

Add the on_each_cpu_cond() function that wraps on_each_cpu_mask() and
calculates the cpumask of cpus to IPI by calling a function supplied as a
parameter in order to determine whether to IPI each specific cpu.

The function works around allocation failure of cpumask variable in
CONFIG_CPUMASK_OFFSTACK=y by itereating over cpus sending an IPI a time
via smp_call_function_single().

The function is useful since it allows to seperate the specific code that
decided in each case whether to IPI a specific cpu for a specific request
from the common boilerplate code of handling creating the mask, handling
failures etc.

[akpm@linux-foundation.org: s/gfpflags/gfp_flags/]
[akpm@linux-foundation.org: avoid double-evaluation of `info' (per Michal), parenthesise evaluation of `cond_func']
[akpm@linux-foundation.org: s/CPU/CPUs, use all 80 cols in comment]
Signed-off-by: Gilad Ben-Yossef
Cc: Chris Metcalf
Cc: Christoph Lameter
Acked-by: Peter Zijlstra
Cc: Frederic Weisbecker
Cc: Russell King
Cc: Pekka Enberg
Cc: Matt Mackall
Cc: Sasha Levin
Cc: Rik van Riel
Cc: Andi Kleen
Cc: Alexander Viro
Cc: Avi Kivity
Acked-by: Michal Nazarewicz
Cc: Kosaki Motohiro
Cc: Milton Miller
Reviewed-by: "Srivatsa S. Bhat"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gilad Ben-Yossef
2012-03-29 08:14:35 +0800
3fc498f16 smp: introduce a generic on_each_cpu_mask() function ... Browse Code »

We have lots of infrastructure in place to partition multi-core systems
such that we have a group of CPUs that are dedicated to specific task:
cgroups, scheduler and interrupt affinity, and cpuisol= boot parameter.
Still, kernel code will at times interrupt all CPUs in the system via IPIs
for various needs. These IPIs are useful and cannot be avoided
altogether, but in certain cases it is possible to interrupt only specific
CPUs that have useful work to do and not the entire system.

This patch set, inspired by discussions with Peter Zijlstra and Frederic
Weisbecker when testing the nohz task patch set, is a first stab at trying
to explore doing this by locating the places where such global IPI calls
are being made and turning the global IPI into an IPI for a specific group
of CPUs. The purpose of the patch set is to get feedback if this is the
right way to go for dealing with this issue and indeed, if the issue is
even worth dealing with at all. Based on the feedback from this patch set
I plan to offer further patches that address similar issue in other code
paths.

This patch creates an on_each_cpu_mask() and on_each_cpu_cond()
infrastructure API (the former derived from existing arch specific
versions in Tile and Arm) and uses them to turn several global IPI
invocation to per CPU group invocations.

Core kernel:

on_each_cpu_mask() calls a function on processors specified by cpumask,
which may or may not include the local processor.

You must not call this function with disabled interrupts or from a
hardware interrupt handler or from a bottom half handler.

arch/arm:

Note that the generic version is a little different then the Arm one:

1. It has the mask as first parameter
2. It calls the function on the calling CPU with interrupts disabled,
but this should be OK since the function is called on the other CPUs
with interrupts disabled anyway.

arch/tile:

The API is the same as the tile private one, but the generic version
also calls the function on the with interrupts disabled in UP case

This is OK since the function is called on the other CPUs
with interrupts disabled.

Signed-off-by: Gilad Ben-Yossef
Reviewed-by: Christoph Lameter
Acked-by: Chris Metcalf
Acked-by: Peter Zijlstra
Cc: Frederic Weisbecker
Cc: Russell King
Cc: Pekka Enberg
Cc: Matt Mackall
Cc: Rik van Riel
Cc: Andi Kleen
Cc: Sasha Levin
Cc: Mel Gorman
Cc: Alexander Viro
Cc: Avi Kivity
Acked-by: Michal Nazarewicz
Cc: Kosaki Motohiro
Cc: Milton Miller
Cc: Russell King
Acked-by: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gilad Ben-Yossef
2012-03-29 08:14:35 +0800
d15cab975 swapon: check validity of swap_flags ... Browse Code »

Most system calls taking flags first check that the flags passed in are
valid, and that helps userspace to detect when new flags are supported.

But swapon never did so: start checking now, to help if we ever want to
support more swap_flags in future.

It's difficult to get stray bits set in an int, and swapon is not widely
used, so this is most unlikely to break any userspace; but we can just
revert if it turns out to do so.

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-03-29 08:14:35 +0800
29fd66d28 mm, coredump: fail allocations when coredumping instead of oom killing ... Browse Code »

The size of coredump files is limited by RLIMIT_CORE, however, allocating
large amounts of memory results in three negative consequences:

- the coredumping process may be chosen for oom kill and quickly deplete
all memory reserves in oom conditions preventing further progress from
being made or tasks from exiting,

- the coredumping process may cause other processes to be oom killed
without fault of their own as the result of a SIGSEGV, for example, in
the coredumping process, or

- the coredumping process may result in a livelock while writing to the
dump file if it needs memory to allocate while other threads are in
the exit path waiting on the coredumper to complete.

This is fixed by implying __GFP_NORETRY in the page allocator for
coredumping processes when reclaim has failed so the allocations fail and
the process continues to exit.

Signed-off-by: David Rientjes
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: Minchan Kim
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-03-29 08:14:35 +0800
45f83cefe mm: thp: fix up pmd_trans_unstable() locations ... Browse Code »

pmd_trans_unstable() should be called before pmd_offset_map() in the
locations where the mmap_sem is held for reading.

Signed-off-by: Andrea Arcangeli
Cc: Mel Gorman
Cc: Hugh Dickins
Cc: Larry Woodman
Cc: Ulrich Obergfell
Cc: Rik van Riel
Cc: Mark Salter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2012-03-29 08:14:35 +0800
623e3db9f mm for fs: add truncate_pagecache_range() ... Browse Code »

Holepunching filesystems ext4 and xfs are using truncate_inode_pages_range
but forgetting to unmap pages first (ocfs2 remembers). This is not really
a bug, since races already require truncate_inode_page() to handle that
case once the page is locked; but it can be very inefficient if the file
being punched happens to be mapped into many vmas.

Provide a drop-in replacement truncate_pagecache_range() which does the
unmapping pass first, handling the awkward mismatch between arguments to
truncate_inode_pages_range() and arguments to unmap_mapping_range().

Note that holepunching does not unmap privately COWed pages in the range:
POSIX requires that we do so when truncating, but it's hard to justify,
difficult to implement without an i_size cutoff, and no filesystem is
attempting to implement it.

Signed-off-by: Hugh Dickins
Cc: "Theodore Ts'o"
Cc: Andreas Dilger
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Ben Myers
Cc: Alex Elder
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-03-29 08:14:35 +0800
3748b2f15 procfs: fix /proc/statm ... Browse Code »

bda7bad62bc4 ("procfs: speed up /proc/pid/stat, statm") broke /proc/statm
- 'text' is printed twice by mistake.

Signed-off-by: KAMEZAWA Hiroyuki
Reported-by: Ulrich Drepper
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2012-03-29 08:14:35 +0800
529b73fc0 Merge tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux ... Browse Code »

Pull trivial writeback fixes from Wu Fengguang:
"They've been tested in linux-next for 20 days actually."

* tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: Remove outdated comment
fs: Remove bogus wait in write_inode_now()

Linus Torvalds
2012-03-29 01:07:27 +0800
69e1aaddd Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 ... Browse Code »

Pull ext4 updates for 3.4 from Ted Ts'o:
"Ext4 commits for 3.3 merge window; mostly cleanups and bug fixes

The changes to export dirty_writeback_interval are from Artem's s_dirt
cleanup patch series. The same is true of the change to remove the
s_dirt helper functions which never got used by anyone in-tree. I've
run these changes by Al Viro, and am carrying them so that Artem can
more easily fix up the rest of the file systems during the next merge
window. (Originally we had hopped to remove the use of s_dirt from
ext4 during this merge window, but his patches had some bugs, so I
ultimately ended dropping them from the ext4 tree.)"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (66 commits)
vfs: remove unused superblock helpers
mm: export dirty_writeback_interval
ext4: remove useless s_dirt assignment
ext4: write superblock only once on unmount
ext4: do not mark superblock as dirty unnecessarily
ext4: correct ext4_punch_hole return codes
ext4: remove restrictive checks for EOFBLOCKS_FL
ext4: always set then trimmed blocks count into len
ext4: fix trimmed block count accunting
ext4: fix start and len arguments handling in ext4_trim_fs()
ext4: update s_free_{inodes,blocks}_count during online resize
ext4: change some printk() calls to use ext4_msg() instead
ext4: avoid output message interleaving in ext4_error_()
ext4: remove trailing newlines from ext4_msg() and ext4_error() messages
ext4: add no_printk argument validation, fix fallout
ext4: remove redundant "EXT4-fs: " from uses of ext4_msg
ext4: give more helpful error message in ext4_ext_rm_leaf()
ext4: remove unused code from ext4_ext_map_blocks()
ext4: rewrite punch hole to use ext4_ext_remove_space()
jbd2: cleanup journal tail after transaction commit
...

Linus Torvalds
2012-03-29 01:02:55 +0800
56b59b429 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client ... Browse Code »

Pull Ceph updates for 3.4-rc1 from Sage Weil:
"Alex has been busy. There are a range of rbd and libceph cleanups,
especially surrounding device setup and teardown, and a few critical
fixes in that code. There are more cleanups in the messenger code,
virtual xattrs, a fix for CRC calculation/checks, and lots of other
miscellaneous stuff.

There's a patch from Amon Ott to make inos behave a bit better on
32-bit boxes, some decode check fixes from Xi Wang, and network
throttling fix from Jim Schutt, and a couple RBD fixes from Josh
Durgin.

No new functionality, just a lot of cleanup and bug fixing."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (65 commits)
rbd: move snap_rwsem to the device, rename to header_rwsem
ceph: fix three bugs, two in ceph_vxattrcb_file_layout()
libceph: isolate kmap() call in write_partial_msg_pages()
libceph: rename "page_shift" variable to something sensible
libceph: get rid of zero_page_address
libceph: only call kernel_sendpage() via helper
libceph: use kernel_sendpage() for sending zeroes
libceph: fix inverted crc option logic
libceph: some simple changes
libceph: small refactor in write_partial_kvec()
libceph: do crc calculations outside loop
libceph: separate CRC calculation from byte swapping
libceph: use "do" in CRC-related Boolean variables
ceph: ensure Boolean options support both senses
libceph: a few small changes
libceph: make ceph_tcp_connect() return int
libceph: encapsulate some messenger cleanup code
libceph: make ceph_msgr_wq private
libceph: encapsulate connection kvec operations
libceph: move prepare_write_banner()
...

Linus Torvalds
2012-03-29 01:01:29 +0800
9a7259d5c Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs ... Browse Code »

Pull ext3, UDF, and quota fixes from Jan Kara:
"A couple of ext3 & UDF fixes and also one improvement in quota
locking."

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
ext3: fix start and len arguments handling in ext3_trim_fs()
udf: Fix deadlock in udf_release_file()
udf: Fix file entry logicalBlocksRecorded
udf: Fix handling of i_blocks
quota: Make quota code not call tty layer with dqptr_sem held
udf: Init/maintain file entry checkpoint field
ext3: Update ctime in ext3_splice_branch() only when needed
ext3: Don't call dquot_free_block() if we don't update anything
udf: Remove unnecessary OOM messages

Linus Torvalds
2012-03-29 01:00:14 +0800
e9c0f1529 Merge tag 'for-linus-3.4-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs ... Browse Code »

Pull 9p changes for the 3.4 merge window from Eric Van Hensbergen.

* tag 'for-linus-3.4-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
9p: statfs should not override server f_type
net/9p: handle flushed Tclunk/Tremove
net/9p: don't allow Tflush to be interrupted

Linus Torvalds
2012-03-29 00:58:38 +0800
b18dafc86 vfs: fix d_ancestor() case in d_materialize_unique ... Browse Code »

In d_materialise_unique() there are 3 subcases to the 'aliased dentry'
case; in two subcases the inode i_lock is properly released but this
does not occur in the -ELOOP subcase.

This seems to have been introduced by commit 1836750115f2 ("fix loop
checks in d_materialise_unique()").

Signed-off-by: Michel Lespinasse
Cc: stable@vger.kernel.org # v3.0+
[ Added a comment, and moved the unlock to where we generate the -ELOOP,
which seems to be more natural.

You probably can't actually trigger this without a buggy network file
server - d_materialize_unique() is for finding aliases on non-local
filesystems, and the d_ancestor() case is for a hardlinked directory
loop.

But we should be robust in the case of such buggy servers anyway. ]
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-03-29 00:54:34 +0800

28 Mar, 2012

3 commits

6658a6991 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux ... Browse Code »

Pull s390 patches part 2 from Martin Schwidefsky:
"Some minor improvements and one additional feature for the 3.4 merge
window: Hendrik added perf support for the s390 CPU counters."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
[S390] register cpu devices for SMP=n
[S390] perf: add support for s390x CPU counters
[S390] oprofile: Allow multiple users of the measurement alert interrupt
[S390] qdio: log all adapter characteristics
[S390] Remove unncessary export of arch_pick_mmap_layout

Linus Torvalds
2012-03-28 09:36:38 +0800
fa453a625 Merge branch 'for-linus-3.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml ... Browse Code »

Pull UML changes from Richard Weinberger:
"Mostly bug fixes and cleanups"

* 'for-linus-3.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml: (35 commits)
um: Update defconfig
um: Switch to large mcmodel on x86_64
MTD: Relax dependencies
um: Wire CONFIG_GENERIC_IO up
um: Serve io_remap_pfn_range()
Introduce CONFIG_GENERIC_IO
um: allow SUBARCH=x86
um: most of the SUBARCH uses can be killed
um: deadlock in line_write_interrupt()
um: don't bother trying to rebuild CHECKFLAGS for USER_OBJS
um: use the right ifdef around exports in user_syms.c
um: a bunch of headers can be killed by using generic-y
um: ptrace-generic.h doesn't need user.h
um: kill HOST_TASK_PID
um: remove pointless include of asm/fixmap.h from asm/pgtable.h
um: asm-offsets.h might as well come from underlying arch...
um: merge processor_{32,64}.h a bit...
um: switch close_chan() to struct line
um: race fix: initialize delayed_work *before* registering IRQ
um: line->have_irq is never checked...
...

Linus Torvalds
2012-03-28 09:29:53 +0800
30eebb54b Merge branch 'next' of git://git.monstr.eu/linux-2.6-microblaze ... Browse Code »

Pull arch/microblaze fixes from Michal Simek

* 'next' of git://git.monstr.eu/linux-2.6-microblaze:
microblaze: Handle TLB skip size dynamically
microblaze: Introduce TLB skip size
microblaze: Improve TLB calculation for small systems
microblaze: Extend space for compiled-in FDT to 32kB
microblaze: Clear all MSR flags on the first kernel instruction
microblaze: Use node name instead of compatible string
microblaze: Fix mapin_ram function
microblaze: Highmem support
microblaze: Use active regions
microblaze: Show more detailed information about memory
microblaze: Introduce fixmap
microblaze: mm: Fix lowmem max memory size limits
microblaze: mm: Use ZONE_DMA instead of ZONE_NORMAL
microblaze: trivial: Fix typo fault in timer.c
microblaze: Use vsprintf extention %pf with builtin_return_address
microblaze: Add PVR version string for MB 8.20.b and 8.30.a
microblaze: Fix makefile to work with latest toolchain
microblaze: Fix typo in early_printk.c

Linus Torvalds
2012-03-28 09:20:56 +0800