29 Mar, 2012

37 commits

  • tools/ is the better place for vm tools which are used by many people.
    Moving them to tools also make them open to more users instead of hide in
    Documentation folder.

    This patch moves page-types.c to tools/vm/page-types.c. Also add a
    Makefile in tools/vm and fix two coding style problems: a) change const
    arrary to 'const char * const', b) change a space to tab for indent.

    Signed-off-by: Dave Young
    Acked-by: Wu Fengguang
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Frederic Weisbecker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Young
     
  • So a "make run_tests" will build the tests before trying to run them.

    Acked-by: Frederic Weisbecker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Remove the run_tests script and launch the selftests by calling "make
    run_tests" from the selftests top directory instead. This delegates to
    the Makefile in each selftest directory, where it is decided how to launch
    the local test.

    This removes the need to add each selftest directory to the now removed
    "run_tests" top script.

    Signed-off-by: Frederic Weisbecker
    Cc: Dave Young
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Frederic Weisbecker
     
  • Replace radix_tree_gang_lookup_slot() and
    radix_tree_gang_lookup_tag_slot() in page-cache lookup functions with
    brand-new radix-tree direct iterating. This avoids the double-scanning
    and pointer copying.

    Iterator don't stop after nr_pages page-get fails in a row, it continue
    lookup till the radix-tree end. Thus we can safely remove these restart
    conditions.

    Unfortunately, old implementation didn't forbid nr_pages == 0, this corner
    case does not fit into new code, so the patch adds an extra check at the
    beginning.

    Signed-off-by: Konstantin Khlebnikov
    Tested-by: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Rewrite radix_tree_gang_lookup_* functions using the new radix-tree
    iterator.

    Signed-off-by: Konstantin Khlebnikov
    Tested-by: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • A series of radix tree cleanups, and usage of them in the core pagecache
    code.

    Micro-benchmark:

    lookup 14 slots (typical page-vector size)
    in radix-tree there earch slot filled and tagged
    before/after - nsec per full scan through tree

    * Intel Sandy Bridge i7-2620M 4Mb L3
    New code always faster

    * AMD Athlon 6000+ 2x1Mb L2, without L3
    New code generally faster,
    Minor degradation (marked with "*") for huge sparse trees

    * i386 on Sandy Bridge
    New code faster for common cases: tagged and dense trees.
    Some degradations for non-tagged lookup on sparse trees.

    Ideally, there might help __ffs() analog for searching first non-zero
    long element in array, gcc sometimes cannot optimize this loop corretly.

    Numbers:

    CPU: Intel Sandy Bridge i7-2620M 4Mb L3

    radix-tree with 1024 slots:

    tagged lookup

    step 1 before 7156 after 3613
    step 2 before 5399 after 2696
    step 3 before 4779 after 1928
    step 4 before 4456 after 1429
    step 5 before 4292 after 1213
    step 6 before 4183 after 1052
    step 7 before 4157 after 951
    step 8 before 4016 after 812
    step 9 before 3952 after 851
    step 10 before 3937 after 732
    step 11 before 4023 after 709
    step 12 before 3872 after 657
    step 13 before 3892 after 633
    step 14 before 3720 after 591
    step 15 before 3879 after 578
    step 16 before 3561 after 513

    normal lookup

    step 1 before 4266 after 3301
    step 2 before 2695 after 2129
    step 3 before 2083 after 1712
    step 4 before 1801 after 1534
    step 5 before 1628 after 1313
    step 6 before 1551 after 1263
    step 7 before 1475 after 1185
    step 8 before 1432 after 1167
    step 9 before 1373 after 1092
    step 10 before 1339 after 1134
    step 11 before 1292 after 1056
    step 12 before 1319 after 1030
    step 13 before 1276 after 1004
    step 14 before 1256 after 987
    step 15 before 1228 after 992
    step 16 before 1247 after 999

    radix-tree with 1024*1024*128 slots:

    tagged lookup

    step 1 before 1086102841 after 674196409
    step 2 before 816839155 after 498138306
    step 7 before 599728907 after 240676762
    step 15 before 555729253 after 185219677
    step 63 before 606637748 after 128585664
    step 64 before 608384432 after 102945089
    step 65 before 596987114 after 123996019
    step 128 before 304459225 after 56783056
    step 256 before 158846855 after 31232481
    step 512 before 86085652 after 18950595
    step 12345 before 6517189 after 1674057

    normal lookup

    step 1 before 626064869 after 544418266
    step 2 before 418809975 after 336321473
    step 7 before 242303598 after 207755560
    step 15 before 208380563 after 176496355
    step 63 before 186854206 after 167283638
    step 64 before 176188060 after 170143976
    step 65 before 185139608 after 167487116
    step 128 before 88181865 after 86913490
    step 256 before 45733628 after 45143534
    step 512 before 24506038 after 23859036
    step 12345 before 2177425 after 2018662

    * AMD Athlon 6000+ 2x1Mb L2, without L3

    radix-tree with 1024 slots:

    tag-lookup

    step 1 before 8164 after 5379
    step 2 before 5818 after 5581
    step 3 before 4959 after 4213
    step 4 before 4371 after 3386
    step 5 before 4204 after 2997
    step 6 before 4950 after 2744
    step 7 before 4598 after 2480
    step 8 before 4251 after 2288
    step 9 before 4262 after 2243
    step 10 before 4175 after 2131
    step 11 before 3999 after 2024
    step 12 before 3979 after 1994
    step 13 before 3842 after 1929
    step 14 before 3750 after 1810
    step 15 before 3735 after 1810
    step 16 before 3532 after 1660

    normal-lookup

    step 1 before 7875 after 5847
    step 2 before 4808 after 4071
    step 3 before 4073 after 3462
    step 4 before 3677 after 3074
    step 5 before 4308 after 2978
    step 6 before 3911 after 3807
    step 7 before 3635 after 3522
    step 8 before 3313 after 3202
    step 9 before 3280 after 3257
    step 10 before 3166 after 3083
    step 11 before 3066 after 3026
    step 12 before 2985 after 2982
    step 13 before 2925 after 2924
    step 14 before 2834 after 2808
    step 15 before 2805 after 2803
    step 16 before 2647 after 2622

    radix-tree with 1024*1024*128 slots:

    tag-lookup

    step 1 before 1288059720 after 951736580
    step 2 before 961292300 after 884212140
    step 7 before 768905140 after 547267580
    step 15 before 771319480 after 456550640
    step 63 before 504847640 after 242704304
    step 64 before 392484800 after 177920786
    step 65 before 491162160 after 246895264
    step 128 before 208084064 after 97348392
    step 256 before 112401035 after 51408126
    step 512 before 75825834 after 29145070
    step 12345 before 5603166 after 2847330

    normal-lookup

    step 1 before 1025677120 after 861375100
    step 2 before 647220080 after 572258540
    step 7 before 505518960 after 484041813
    step 15 before 430483053 after 444815320 *
    step 63 before 388113453 after 404250546 *
    step 64 before 374154666 after 396027440 *
    step 65 before 381423973 after 396704853 *
    step 128 before 190078700 after 202619384 *
    step 256 before 100886756 after 102829108 *
    step 512 before 64074505 after 56158720
    step 12345 before 4237289 after 4422299 *

    * i686 on Sandy bridge

    radix-tree with 1024 slots:

    tagged lookup

    step 1 before 7990 after 4019
    step 2 before 5698 after 2897
    step 3 before 5013 after 2475
    step 4 before 4630 after 1721
    step 5 before 4346 after 1759
    step 6 before 4299 after 1556
    step 7 before 4098 after 1513
    step 8 before 4115 after 1222
    step 9 before 3983 after 1390
    step 10 before 4077 after 1207
    step 11 before 3921 after 1231
    step 12 before 3894 after 1116
    step 13 before 3840 after 1147
    step 14 before 3799 after 1090
    step 15 before 3797 after 1059
    step 16 before 3783 after 745

    normal lookup

    step 1 before 5103 after 3499
    step 2 before 3299 after 2550
    step 3 before 2489 after 2370
    step 4 before 2034 after 2302 *
    step 5 before 1846 after 2268 *
    step 6 before 1752 after 2249 *
    step 7 before 1679 after 2164 *
    step 8 before 1627 after 2153 *
    step 9 before 1542 after 2095 *
    step 10 before 1479 after 2109 *
    step 11 before 1469 after 2009 *
    step 12 before 1445 after 2039 *
    step 13 before 1411 after 2013 *
    step 14 before 1374 after 2046 *
    step 15 before 1340 after 1975 *
    step 16 before 1331 after 2000 *

    radix-tree with 1024*1024*128 slots:

    tagged lookup

    step 1 before 1225865377 after 667153553
    step 2 before 842427423 after 471533007
    step 7 before 609296153 after 276260116
    step 15 before 544232060 after 226859105
    step 63 before 519209199 after 141343043
    step 64 before 588980279 after 141951339
    step 65 before 521099710 after 138282060
    step 128 before 298476778 after 83390628
    step 256 before 149358342 after 43602609
    step 512 before 76994713 after 22911077
    step 12345 before 5328666 after 1472111

    normal lookup

    step 1 before 819284564 after 533635310
    step 2 before 512421605 after 364956155
    step 7 before 271443305 after 305721345 *
    step 15 before 223591630 after 273960216 *
    step 63 before 190320247 after 217770207 *
    step 64 before 178538168 after 267411372 *
    step 65 before 186400423 after 215347937 *
    step 128 before 88106045 after 140540612 *
    step 256 before 44812420 after 70660377 *
    step 512 before 24435438 after 36328275 *
    step 12345 before 2123924 after 2148062 *

    bloat-o-meter delta for this patchset + patchset with related shmem cleanups

    bloat-o-meter: x86_64

    add/remove: 4/3 grow/shrink: 5/6 up/down: 928/-939 (-11)
    function old new delta
    radix_tree_next_chunk - 499 +499
    shmem_unuse 428 554 +126
    shmem_radix_tree_replace 131 227 +96
    find_get_pages_tag 354 419 +65
    find_get_pages_contig 345 407 +62
    find_get_pages 362 396 +34
    __kstrtab_radix_tree_next_chunk - 22 +22
    __ksymtab_radix_tree_next_chunk - 16 +16
    __kcrctab_radix_tree_next_chunk - 8 +8
    radix_tree_gang_lookup_slot 204 203 -1
    static.shmem_xattr_set 384 381 -3
    radix_tree_gang_lookup_tag_slot 208 191 -17
    radix_tree_gang_lookup 231 187 -44
    radix_tree_gang_lookup_tag 247 199 -48
    shmem_unlock_mapping 278 190 -88
    __lookup 217 - -217
    __lookup_tag 242 - -242
    radix_tree_locate_item 279 - -279

    bloat-o-meter: i386

    add/remove: 3/3 grow/shrink: 8/9 up/down: 1075/-1275 (-200)
    function old new delta
    radix_tree_next_chunk - 757 +757
    shmem_unuse 352 449 +97
    find_get_pages_contig 269 322 +53
    shmem_radix_tree_replace 113 154 +41
    find_get_pages_tag 277 318 +41
    dcache_dir_lseek 426 458 +32
    __kstrtab_radix_tree_next_chunk - 22 +22
    vc_do_resize 968 977 +9
    snd_pcm_lib_read1 725 733 +8
    __ksymtab_radix_tree_next_chunk - 8 +8
    netlbl_cipsov4_list 1120 1127 +7
    find_get_pages 293 291 -2
    new_slab 467 459 -8
    bitfill_unaligned_rev 425 417 -8
    radix_tree_gang_lookup_tag_slot 177 146 -31
    blk_dump_cmd 267 229 -38
    radix_tree_gang_lookup_slot 212 134 -78
    shmem_unlock_mapping 221 128 -93
    radix_tree_gang_lookup_tag 275 162 -113
    radix_tree_gang_lookup 255 126 -129
    __lookup 227 - -227
    __lookup_tag 271 - -271
    radix_tree_locate_item 277 - -277

    This patch:

    Implement a clean, simple and effective radix-tree iteration routine.

    Iterating divided into two phases:
    * lookup next chunk in radix-tree leaf node
    * iterating through slots in this chunk

    Main iterator function radix_tree_next_chunk() returns pointer to first
    slot, and stores in the struct radix_tree_iter index of next-to-last slot.
    For tagged-iterating it also constuct bitmask of tags for retunted chunk.
    All additional logic implemented as static-inline functions and macroses.

    Also adds radix_tree_find_next_bit() static-inline variant of
    find_next_bit() optimized for small constant size arrays, because
    find_next_bit() too heavy for searching in an array with one/two long
    elements.

    [akpm@linux-foundation.org: rework comments a bit]
    Signed-off-by: Konstantin Khlebnikov
    Tested-by: Hugh Dickins
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • If CONFIG_NET_NS, CONFIG_UTS_NS and CONFIG_IPC_NS are disabled,
    ns_entries[] becomes empty and things like
    ns_entries[ARRAY_SIZE(ns_entries) - 1] will explode.

    Reported-by: Richard Weinberger
    Cc: "Eric W. Biederman"
    Cc: Daniel Lezcano
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • rename the nbd_device variable from "lo" to "nbd", since "lo" is just a name
    copied from loop.c.

    Signed-off-by: Wanlong Gao
    Cc: Paul Clements
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanlong Gao
     
  • In the case of a child pid namespace, rebooting the system does not really
    makes sense. When the pid namespace is used in conjunction with the other
    namespaces in order to create a linux container, the reboot syscall leads
    to some problems.

    A container can reboot the host. That can be fixed by dropping the
    sys_reboot capability but we are unable to correctly to poweroff/
    halt/reboot a container and the container stays stuck at the shutdown time
    with the container's init process waiting indefinitively.

    After several attempts, no solution from userspace was found to reliabily
    handle the shutdown from a container.

    This patch propose to make the init process of the child pid namespace to
    exit with a signal status set to : SIGINT if the child pid namespace
    called "halt/poweroff" and SIGHUP if the child pid namespace called
    "reboot". When the reboot syscall is called and we are not in the initial
    pid namespace, we kill the pid namespace for "HALT", "POWEROFF",
    "RESTART", and "RESTART2". Otherwise we return EINVAL.

    Returning EINVAL is also an easy way to check if this feature is supported
    by the kernel when invoking another 'reboot' option like CAD.

    By this way the parent process of the child pid namespace knows if it
    rebooted or not and can take the right decision.

    Test case:
    ==========

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #include

    static int do_reboot(void *arg)
    {
    int *cmd = arg;

    if (reboot(*cmd))
    printf("failed to reboot(%d): %m\n", *cmd);
    }

    int test_reboot(int cmd, int sig)
    {
    long stack_size = 4096;
    void *stack = alloca(stack_size) + stack_size;
    int status;
    pid_t ret;

    ret = clone(do_reboot, stack, CLONE_NEWPID | SIGCHLD, &cmd);
    if (ret < 0) {
    printf("failed to clone: %m\n");
    return -1;
    }

    if (wait(&status) < 0) {
    printf("unexpected wait error: %m\n");
    return -1;
    }

    if (!WIFSIGNALED(status)) {
    printf("child process exited but was not signaled\n");
    return -1;
    }

    if (WTERMSIG(status) != sig) {
    printf("signal termination is not the one expected\n");
    return -1;
    }

    return 0;
    }

    int main(int argc, char *argv[])
    {
    int status;

    status = test_reboot(LINUX_REBOOT_CMD_RESTART, SIGHUP);
    if (status < 0)
    return 1;
    printf("reboot(LINUX_REBOOT_CMD_RESTART) succeed\n");

    status = test_reboot(LINUX_REBOOT_CMD_RESTART2, SIGHUP);
    if (status < 0)
    return 1;
    printf("reboot(LINUX_REBOOT_CMD_RESTART2) succeed\n");

    status = test_reboot(LINUX_REBOOT_CMD_HALT, SIGINT);
    if (status < 0)
    return 1;
    printf("reboot(LINUX_REBOOT_CMD_HALT) succeed\n");

    status = test_reboot(LINUX_REBOOT_CMD_POWER_OFF, SIGINT);
    if (status < 0)
    return 1;
    printf("reboot(LINUX_REBOOT_CMD_POWERR_OFF) succeed\n");

    status = test_reboot(LINUX_REBOOT_CMD_CAD_ON, -1);
    if (status >= 0) {
    printf("reboot(LINUX_REBOOT_CMD_CAD_ON) should have failed\n");
    return 1;
    }
    printf("reboot(LINUX_REBOOT_CMD_CAD_ON) has failed as expected\n");

    return 0;
    }

    [akpm@linux-foundation.org: tweak and add comments]
    [akpm@linux-foundation.org: checkpatch fixes]
    Signed-off-by: Daniel Lezcano
    Acked-by: Serge Hallyn
    Tested-by: Serge Hallyn
    Reviewed-by: Oleg Nesterov
    Cc: Michael Kerrisk
    Cc: "Eric W. Biederman"
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Lezcano
     
  • Use bitmap_set() instead of using set_bit() for each bit. This conversion
    is valid because the bitmap is private in the function call and atomic
    bitops were unnecessary.

    This also includes minor change.
    - Use bitmap_copy() for shorter typing

    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • The IPMI watchdog timer clears or extends the timer on reboot/shutdown.
    It was using the non-locking routine for setting the watchdog timer, but
    this was causing race conditions. Instead, use the locking version to
    avoid the races. It seems to work fine.

    Signed-off-by: Corey Minyard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Corey Minyard
     
  • Now that the the IPMI driver is using a tasklet, we can simplify the
    locking in the driver and get rid of the message lock.

    Signed-off-by: Corey Minyard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Corey Minyard
     
  • The part of the IPMI driver that delivered panic information to the event
    log and extended the watchdog timeout during a panic was not properly
    handling the messages. It used static messages to avoid allocation, but
    wasn't properly waiting for these, or wasn't properly handling the
    refcounts.

    Signed-off-by: Corey Minyard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Corey Minyard
     
  • The IPMI driver would release a lock, deliver a message, then relock.
    This is obviously ugly, and this patch converts the message handler
    interface to use a tasklet to schedule work. This lets the receive
    handler be called from an interrupt handler with interrupts enabled.

    Signed-off-by: Corey Minyard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Corey Minyard
     
  • We currently time out and retry KCS transactions after 1 second of waiting
    for IBF or OBF. This appears to be too short for some hardware. The IPMI
    spec says "All system software wait loops should include error timeouts.
    For simplicity, such timeouts are not shown explicitly in the flow
    diagrams. A five-second timeout or greater is recommended". Change the
    timeout to five seconds to satisfy the slow hardware.

    Signed-off-by: Matthew Garrett
    Signed-off-by: Corey Minyard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Garrett
     
  • Call the event handler immediately after starting the next message.

    This change considerably decreases the IPMI transaction time (cuts off
    ~9ms for a single ipmitool transaction).

    Signed-off-by: Srinivas_Gowda
    Signed-off-by: Corey Minyard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Srinivas_Gowda
     
  • crashkernel reservation need know the total memory size. Current
    get_total_mem simply use max_pfn - min_low_pfn. It is wrong because it
    will including memory holes in the middle.

    Especially for kvm guest with memory > 0xe0000000, there's below in qemu
    code: qemu split memory as below:

    if (ram_size >= 0xe0000000 ) {
    above_4g_mem_size = ram_size - 0xe0000000;
    below_4g_mem_size = 0xe0000000;
    } else {
    below_4g_mem_size = ram_size;
    }

    So for 4G mem guest, seabios will insert a 512M usable region beyond of
    4G. Thus in above case max_pfn - min_low_pfn will be more than original
    memsize.

    Fixing this issue by using memblock_phys_mem_size() to get the total
    memsize.

    Signed-off-by: Dave Young
    Reviewed-by: WANG Cong
    Reviewed-by: Simon Horman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Young
     
  • When using crashkernel=2M-256M, the kernel doesn't give any warning. This
    is misleading sometimes.

    Signed-off-by: Zhenzhong Duan
    Acked-by: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhenzhong Duan
     
  • nommu platforms don't have very interesting swapper_pg_dir pointers and
    usually just #define them to NULL, meaning that we can't include them in
    the vmcoreinfo on the kexec crash path.

    This patch only saves the swapper_pg_dir if we have an MMU.

    Signed-off-by: Will Deacon
    Reviewed-by: Simon Horman
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Will Deacon
     
  • This was marked as obsolete for quite a while now.. Now it is time to
    remove it altogether. And while doing this, get rid of first_cpu() as
    well. Also, remove the redundant setting of cpu_online_mask in
    smp_prepare_cpus() because the generic code would have already set cpu 0
    in cpu_online_mask.

    Reported-by: Tony Luck
    Signed-off-by: Srivatsa S. Bhat
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Srivatsa S. Bhat
     
  • __any_online_cpu() is not optimal and also unnecessary. So, replace its
    use by faster cpumask_* operations.

    Signed-off-by: Srivatsa S. Bhat
    Cc: Eric Dumazet
    Cc: Venkatesh Pallipadi
    Cc: Rusty Russell
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Srivatsa S. Bhat
     
  • Calculate a cpumask of CPUs with per-cpu pages in any zone and only send
    an IPI requesting CPUs to drain these pages to the buddy allocator if they
    actually have pages when asked to flush.

    This patch saves 85%+ of IPIs asking to drain per-cpu pages in case of
    severe memory pressure that leads to OOM since in these cases multiple,
    possibly concurrent, allocation requests end up in the direct reclaim code
    path so when the per-cpu pages end up reclaimed on first allocation
    failure for most of the proceeding allocation attempts until the memory
    pressure is off (possibly via the OOM killer) there are no per-cpu pages
    on most CPUs (and there can easily be hundreds of them).

    This also has the side effect of shortening the average latency of direct
    reclaim by 1 or more order of magnitude since waiting for all the CPUs to
    ACK the IPI takes a long time.

    Tested by running "hackbench 400" on a 8 CPU x86 VM and observing the
    difference between the number of direct reclaim attempts that end up in
    drain_all_pages() and those were more then 1/2 of the online CPU had any
    per-cpu page in them, using the vmstat counters introduced in the next
    patch in the series and using proc/interrupts.

    In the test sceanrio, this was seen to save around 3600 global
    IPIs after trigerring an OOM on a concurrent workload:

    $ cat /proc/vmstat | tail -n 2
    pcp_global_drain 0
    pcp_global_ipi_saved 0

    $ cat /proc/interrupts | grep CAL
    CAL: 1 2 1 2
    2 2 2 2 Function call interrupts

    $ hackbench 400
    [OOM messages snipped]

    $ cat /proc/vmstat | tail -n 2
    pcp_global_drain 3647
    pcp_global_ipi_saved 3642

    $ cat /proc/interrupts | grep CAL
    CAL: 6 13 6 3
    3 3 1 2 7 Function call interrupts

    Please note that if the global drain is removed from the direct reclaim
    path as a patch from Mel Gorman currently suggests this should be replaced
    with an on_each_cpu_cond invocation.

    Signed-off-by: Gilad Ben-Yossef
    Acked-by: Mel Gorman
    Cc: KOSAKI Motohiro
    Acked-by: Christoph Lameter
    Acked-by: Peter Zijlstra
    Cc: Pekka Enberg
    Cc: Rik van Riel
    Cc: Andi Kleen
    Acked-by: Michal Nazarewicz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gilad Ben-Yossef
     
  • In several code paths, such as when unmounting a file system (but not
    only) we send an IPI to ask each cpu to invalidate its local LRU BHs.

    For multi-cores systems that have many cpus that may not have any LRU BH
    because they are idle or because they have not performed any file system
    accesses since last invalidation (e.g. CPU crunching on high perfomance
    computing nodes that write results to shared memory or only using
    filesystems that do not use the bh layer.) This can lead to loss of
    performance each time someone switches the KVM (the virtual keyboard and
    screen type, not the hypervisor) if it has a USB storage stuck in.

    This patch attempts to only send an IPI to cpus that have LRU BH.

    Signed-off-by: Gilad Ben-Yossef
    Acked-by: Peter Zijlstra
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gilad Ben-Yossef
     
  • flush_all() is called for each kmem_cache_destroy(). So every cache being
    destroyed dynamically ends up sending an IPI to each CPU in the system,
    regardless if the cache has ever been used there.

    For example, if you close the Infinband ipath driver char device file, the
    close file ops calls kmem_cache_destroy(). So running some infiniband
    config tool on one a single CPU dedicated to system tasks might interrupt
    the rest of the 127 CPUs dedicated to some CPU intensive or latency
    sensitive task.

    I suspect there is a good chance that every line in the output of "git
    grep kmem_cache_destroy linux/ | grep '\->'" has a similar scenario.

    This patch attempts to rectify this issue by sending an IPI to flush the
    per cpu objects back to the free lists only to CPUs that seem to have such
    objects.

    The check which CPU to IPI is racy but we don't care since asking a CPU
    without per cpu objects to flush does no damage and as far as I can tell
    the flush_all by itself is racy against allocs on remote CPUs anyway, so
    if you required the flush_all to be determinstic, you had to arrange for
    locking regardless.

    Without this patch the following artificial test case:

    $ cd /sys/kernel/slab
    $ for DIR in *; do cat $DIR/alloc_calls > /dev/null; done

    produces 166 IPIs on an cpuset isolated CPU. With it it produces none.

    The code path of memory allocation failure for CPUMASK_OFFSTACK=y
    config was tested using fault injection framework.

    Signed-off-by: Gilad Ben-Yossef
    Acked-by: Christoph Lameter
    Cc: Chris Metcalf
    Acked-by: Peter Zijlstra
    Cc: Frederic Weisbecker
    Cc: Russell King
    Cc: Pekka Enberg
    Cc: Matt Mackall
    Cc: Sasha Levin
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Mel Gorman
    Cc: Alexander Viro
    Cc: Avi Kivity
    Cc: Michal Nazarewicz
    Cc: Kosaki Motohiro
    Cc: Milton Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gilad Ben-Yossef
     
  • Add the on_each_cpu_cond() function that wraps on_each_cpu_mask() and
    calculates the cpumask of cpus to IPI by calling a function supplied as a
    parameter in order to determine whether to IPI each specific cpu.

    The function works around allocation failure of cpumask variable in
    CONFIG_CPUMASK_OFFSTACK=y by itereating over cpus sending an IPI a time
    via smp_call_function_single().

    The function is useful since it allows to seperate the specific code that
    decided in each case whether to IPI a specific cpu for a specific request
    from the common boilerplate code of handling creating the mask, handling
    failures etc.

    [akpm@linux-foundation.org: s/gfpflags/gfp_flags/]
    [akpm@linux-foundation.org: avoid double-evaluation of `info' (per Michal), parenthesise evaluation of `cond_func']
    [akpm@linux-foundation.org: s/CPU/CPUs, use all 80 cols in comment]
    Signed-off-by: Gilad Ben-Yossef
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Acked-by: Peter Zijlstra
    Cc: Frederic Weisbecker
    Cc: Russell King
    Cc: Pekka Enberg
    Cc: Matt Mackall
    Cc: Sasha Levin
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Alexander Viro
    Cc: Avi Kivity
    Acked-by: Michal Nazarewicz
    Cc: Kosaki Motohiro
    Cc: Milton Miller
    Reviewed-by: "Srivatsa S. Bhat"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gilad Ben-Yossef
     
  • We have lots of infrastructure in place to partition multi-core systems
    such that we have a group of CPUs that are dedicated to specific task:
    cgroups, scheduler and interrupt affinity, and cpuisol= boot parameter.
    Still, kernel code will at times interrupt all CPUs in the system via IPIs
    for various needs. These IPIs are useful and cannot be avoided
    altogether, but in certain cases it is possible to interrupt only specific
    CPUs that have useful work to do and not the entire system.

    This patch set, inspired by discussions with Peter Zijlstra and Frederic
    Weisbecker when testing the nohz task patch set, is a first stab at trying
    to explore doing this by locating the places where such global IPI calls
    are being made and turning the global IPI into an IPI for a specific group
    of CPUs. The purpose of the patch set is to get feedback if this is the
    right way to go for dealing with this issue and indeed, if the issue is
    even worth dealing with at all. Based on the feedback from this patch set
    I plan to offer further patches that address similar issue in other code
    paths.

    This patch creates an on_each_cpu_mask() and on_each_cpu_cond()
    infrastructure API (the former derived from existing arch specific
    versions in Tile and Arm) and uses them to turn several global IPI
    invocation to per CPU group invocations.

    Core kernel:

    on_each_cpu_mask() calls a function on processors specified by cpumask,
    which may or may not include the local processor.

    You must not call this function with disabled interrupts or from a
    hardware interrupt handler or from a bottom half handler.

    arch/arm:

    Note that the generic version is a little different then the Arm one:

    1. It has the mask as first parameter
    2. It calls the function on the calling CPU with interrupts disabled,
    but this should be OK since the function is called on the other CPUs
    with interrupts disabled anyway.

    arch/tile:

    The API is the same as the tile private one, but the generic version
    also calls the function on the with interrupts disabled in UP case

    This is OK since the function is called on the other CPUs
    with interrupts disabled.

    Signed-off-by: Gilad Ben-Yossef
    Reviewed-by: Christoph Lameter
    Acked-by: Chris Metcalf
    Acked-by: Peter Zijlstra
    Cc: Frederic Weisbecker
    Cc: Russell King
    Cc: Pekka Enberg
    Cc: Matt Mackall
    Cc: Rik van Riel
    Cc: Andi Kleen
    Cc: Sasha Levin
    Cc: Mel Gorman
    Cc: Alexander Viro
    Cc: Avi Kivity
    Acked-by: Michal Nazarewicz
    Cc: Kosaki Motohiro
    Cc: Milton Miller
    Cc: Russell King
    Acked-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gilad Ben-Yossef
     
  • Most system calls taking flags first check that the flags passed in are
    valid, and that helps userspace to detect when new flags are supported.

    But swapon never did so: start checking now, to help if we ever want to
    support more swap_flags in future.

    It's difficult to get stray bits set in an int, and swapon is not widely
    used, so this is most unlikely to break any userspace; but we can just
    revert if it turns out to do so.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The size of coredump files is limited by RLIMIT_CORE, however, allocating
    large amounts of memory results in three negative consequences:

    - the coredumping process may be chosen for oom kill and quickly deplete
    all memory reserves in oom conditions preventing further progress from
    being made or tasks from exiting,

    - the coredumping process may cause other processes to be oom killed
    without fault of their own as the result of a SIGSEGV, for example, in
    the coredumping process, or

    - the coredumping process may result in a livelock while writing to the
    dump file if it needs memory to allocate while other threads are in
    the exit path waiting on the coredumper to complete.

    This is fixed by implying __GFP_NORETRY in the page allocator for
    coredumping processes when reclaim has failed so the allocations fail and
    the process continues to exit.

    Signed-off-by: David Rientjes
    Cc: Mel Gorman
    Cc: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • pmd_trans_unstable() should be called before pmd_offset_map() in the
    locations where the mmap_sem is held for reading.

    Signed-off-by: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Larry Woodman
    Cc: Ulrich Obergfell
    Cc: Rik van Riel
    Cc: Mark Salter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Holepunching filesystems ext4 and xfs are using truncate_inode_pages_range
    but forgetting to unmap pages first (ocfs2 remembers). This is not really
    a bug, since races already require truncate_inode_page() to handle that
    case once the page is locked; but it can be very inefficient if the file
    being punched happens to be mapped into many vmas.

    Provide a drop-in replacement truncate_pagecache_range() which does the
    unmapping pass first, handling the awkward mismatch between arguments to
    truncate_inode_pages_range() and arguments to unmap_mapping_range().

    Note that holepunching does not unmap privately COWed pages in the range:
    POSIX requires that we do so when truncating, but it's hard to justify,
    difficult to implement without an i_size cutoff, and no filesystem is
    attempting to implement it.

    Signed-off-by: Hugh Dickins
    Cc: "Theodore Ts'o"
    Cc: Andreas Dilger
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Ben Myers
    Cc: Alex Elder
    Cc: Christoph Hellwig
    Cc: Dave Chinner
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • bda7bad62bc4 ("procfs: speed up /proc/pid/stat, statm") broke /proc/statm
    - 'text' is printed twice by mistake.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reported-by: Ulrich Drepper
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Pull trivial writeback fixes from Wu Fengguang:
    "They've been tested in linux-next for 20 days actually."

    * tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
    writeback: Remove outdated comment
    fs: Remove bogus wait in write_inode_now()

    Linus Torvalds
     
  • Pull ext4 updates for 3.4 from Ted Ts'o:
    "Ext4 commits for 3.3 merge window; mostly cleanups and bug fixes

    The changes to export dirty_writeback_interval are from Artem's s_dirt
    cleanup patch series. The same is true of the change to remove the
    s_dirt helper functions which never got used by anyone in-tree. I've
    run these changes by Al Viro, and am carrying them so that Artem can
    more easily fix up the rest of the file systems during the next merge
    window. (Originally we had hopped to remove the use of s_dirt from
    ext4 during this merge window, but his patches had some bugs, so I
    ultimately ended dropping them from the ext4 tree.)"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (66 commits)
    vfs: remove unused superblock helpers
    mm: export dirty_writeback_interval
    ext4: remove useless s_dirt assignment
    ext4: write superblock only once on unmount
    ext4: do not mark superblock as dirty unnecessarily
    ext4: correct ext4_punch_hole return codes
    ext4: remove restrictive checks for EOFBLOCKS_FL
    ext4: always set then trimmed blocks count into len
    ext4: fix trimmed block count accunting
    ext4: fix start and len arguments handling in ext4_trim_fs()
    ext4: update s_free_{inodes,blocks}_count during online resize
    ext4: change some printk() calls to use ext4_msg() instead
    ext4: avoid output message interleaving in ext4_error_()
    ext4: remove trailing newlines from ext4_msg() and ext4_error() messages
    ext4: add no_printk argument validation, fix fallout
    ext4: remove redundant "EXT4-fs: " from uses of ext4_msg
    ext4: give more helpful error message in ext4_ext_rm_leaf()
    ext4: remove unused code from ext4_ext_map_blocks()
    ext4: rewrite punch hole to use ext4_ext_remove_space()
    jbd2: cleanup journal tail after transaction commit
    ...

    Linus Torvalds
     
  • Pull Ceph updates for 3.4-rc1 from Sage Weil:
    "Alex has been busy. There are a range of rbd and libceph cleanups,
    especially surrounding device setup and teardown, and a few critical
    fixes in that code. There are more cleanups in the messenger code,
    virtual xattrs, a fix for CRC calculation/checks, and lots of other
    miscellaneous stuff.

    There's a patch from Amon Ott to make inos behave a bit better on
    32-bit boxes, some decode check fixes from Xi Wang, and network
    throttling fix from Jim Schutt, and a couple RBD fixes from Josh
    Durgin.

    No new functionality, just a lot of cleanup and bug fixing."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (65 commits)
    rbd: move snap_rwsem to the device, rename to header_rwsem
    ceph: fix three bugs, two in ceph_vxattrcb_file_layout()
    libceph: isolate kmap() call in write_partial_msg_pages()
    libceph: rename "page_shift" variable to something sensible
    libceph: get rid of zero_page_address
    libceph: only call kernel_sendpage() via helper
    libceph: use kernel_sendpage() for sending zeroes
    libceph: fix inverted crc option logic
    libceph: some simple changes
    libceph: small refactor in write_partial_kvec()
    libceph: do crc calculations outside loop
    libceph: separate CRC calculation from byte swapping
    libceph: use "do" in CRC-related Boolean variables
    ceph: ensure Boolean options support both senses
    libceph: a few small changes
    libceph: make ceph_tcp_connect() return int
    libceph: encapsulate some messenger cleanup code
    libceph: make ceph_msgr_wq private
    libceph: encapsulate connection kvec operations
    libceph: move prepare_write_banner()
    ...

    Linus Torvalds
     
  • Pull ext3, UDF, and quota fixes from Jan Kara:
    "A couple of ext3 & UDF fixes and also one improvement in quota
    locking."

    * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
    ext3: fix start and len arguments handling in ext3_trim_fs()
    udf: Fix deadlock in udf_release_file()
    udf: Fix file entry logicalBlocksRecorded
    udf: Fix handling of i_blocks
    quota: Make quota code not call tty layer with dqptr_sem held
    udf: Init/maintain file entry checkpoint field
    ext3: Update ctime in ext3_splice_branch() only when needed
    ext3: Don't call dquot_free_block() if we don't update anything
    udf: Remove unnecessary OOM messages

    Linus Torvalds
     
  • Pull 9p changes for the 3.4 merge window from Eric Van Hensbergen.

    * tag 'for-linus-3.4-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
    9p: statfs should not override server f_type
    net/9p: handle flushed Tclunk/Tremove
    net/9p: don't allow Tflush to be interrupted

    Linus Torvalds
     
  • In d_materialise_unique() there are 3 subcases to the 'aliased dentry'
    case; in two subcases the inode i_lock is properly released but this
    does not occur in the -ELOOP subcase.

    This seems to have been introduced by commit 1836750115f2 ("fix loop
    checks in d_materialise_unique()").

    Signed-off-by: Michel Lespinasse
    Cc: stable@vger.kernel.org # v3.0+
    [ Added a comment, and moved the unlock to where we generate the -ELOOP,
    which seems to be more natural.

    You probably can't actually trigger this without a buggy network file
    server - d_materialize_unique() is for finding aliases on non-local
    filesystems, and the d_ancestor() case is for a hardlinked directory
    loop.

    But we should be robust in the case of such buggy servers anyway. ]
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

28 Mar, 2012

3 commits

  • Pull s390 patches part 2 from Martin Schwidefsky:
    "Some minor improvements and one additional feature for the 3.4 merge
    window: Hendrik added perf support for the s390 CPU counters."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
    [S390] register cpu devices for SMP=n
    [S390] perf: add support for s390x CPU counters
    [S390] oprofile: Allow multiple users of the measurement alert interrupt
    [S390] qdio: log all adapter characteristics
    [S390] Remove unncessary export of arch_pick_mmap_layout

    Linus Torvalds
     
  • Pull UML changes from Richard Weinberger:
    "Mostly bug fixes and cleanups"

    * 'for-linus-3.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml: (35 commits)
    um: Update defconfig
    um: Switch to large mcmodel on x86_64
    MTD: Relax dependencies
    um: Wire CONFIG_GENERIC_IO up
    um: Serve io_remap_pfn_range()
    Introduce CONFIG_GENERIC_IO
    um: allow SUBARCH=x86
    um: most of the SUBARCH uses can be killed
    um: deadlock in line_write_interrupt()
    um: don't bother trying to rebuild CHECKFLAGS for USER_OBJS
    um: use the right ifdef around exports in user_syms.c
    um: a bunch of headers can be killed by using generic-y
    um: ptrace-generic.h doesn't need user.h
    um: kill HOST_TASK_PID
    um: remove pointless include of asm/fixmap.h from asm/pgtable.h
    um: asm-offsets.h might as well come from underlying arch...
    um: merge processor_{32,64}.h a bit...
    um: switch close_chan() to struct line
    um: race fix: initialize delayed_work *before* registering IRQ
    um: line->have_irq is never checked...
    ...

    Linus Torvalds
     
  • Pull arch/microblaze fixes from Michal Simek

    * 'next' of git://git.monstr.eu/linux-2.6-microblaze:
    microblaze: Handle TLB skip size dynamically
    microblaze: Introduce TLB skip size
    microblaze: Improve TLB calculation for small systems
    microblaze: Extend space for compiled-in FDT to 32kB
    microblaze: Clear all MSR flags on the first kernel instruction
    microblaze: Use node name instead of compatible string
    microblaze: Fix mapin_ram function
    microblaze: Highmem support
    microblaze: Use active regions
    microblaze: Show more detailed information about memory
    microblaze: Introduce fixmap
    microblaze: mm: Fix lowmem max memory size limits
    microblaze: mm: Use ZONE_DMA instead of ZONE_NORMAL
    microblaze: trivial: Fix typo fault in timer.c
    microblaze: Use vsprintf extention %pf with builtin_return_address
    microblaze: Add PVR version string for MB 8.20.b and 8.30.a
    microblaze: Fix makefile to work with latest toolchain
    microblaze: Fix typo in early_printk.c

    Linus Torvalds