25 Jun, 2015

1 commit

  • Change the default behavior of watchdog so it only runs on the
    housekeeping cores when nohz_full is enabled at build and boot time.
    Allow modifying the set of cores the watchdog is currently running on
    with a new kernel.watchdog_cpumask sysctl.

    In the current system, the watchdog subsystem runs a periodic timer that
    schedules the watchdog kthread to run. However, nohz_full cores are
    designed to allow userspace application code running on those cores to
    have 100% access to the CPU. So the watchdog system prevents the
    nohz_full application code from being able to run the way it wants to,
    thus the motivation to suppress the watchdog on nohz_full cores, which
    this patchset provides by default.

    However, if we disable the watchdog globally, then the housekeeping
    cores can't benefit from the watchdog functionality. So we allow
    disabling it only on some cores. See Documentation/lockup-watchdogs.txt
    for more information.

    [jhubbard@nvidia.com: fix a watchdog crash in some configurations]
    Signed-off-by: Chris Metcalf
    Acked-by: Don Zickus
    Cc: Ingo Molnar
    Cc: Ulrich Obergfell
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Frederic Weisbecker
    Signed-off-by: John Hubbard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Metcalf
     

19 Jun, 2015

1 commit

  • Eric reported that the timer_migration sysctl is not really nice
    performance wise as it needs to check at every timer insertion whether
    the feature is enabled or not. Further the check does not live in the
    timer code, so we have an extra function call which checks an extra
    cache line to figure out that it is disabled.

    We can do better and store that information in the per cpu (hr)timer
    bases. I pondered to use a static key, but that's a nightmare to
    update from the nohz code and the timer base cache line is hot anyway
    when we select a timer base.

    The old logic enabled the timer migration unconditionally if
    CONFIG_NO_HZ was set even if nohz was disabled on the kernel command
    line.

    With this modification, we start off with migration disabled. The user
    visible sysctl is still set to enabled. If the kernel switches to NOHZ
    migration is enabled, if the user did not disable it via the sysctl
    prior to the switch. If nohz=off is on the kernel command line,
    migration stays disabled no matter what.

    Before:
    47.76% hog [.] main
    14.84% [kernel] [k] _raw_spin_lock_irqsave
    9.55% [kernel] [k] _raw_spin_unlock_irqrestore
    6.71% [kernel] [k] mod_timer
    6.24% [kernel] [k] lock_timer_base.isra.38
    3.76% [kernel] [k] detach_if_pending
    3.71% [kernel] [k] del_timer
    2.50% [kernel] [k] internal_add_timer
    1.51% [kernel] [k] get_nohz_timer_target
    1.28% [kernel] [k] __internal_add_timer
    0.78% [kernel] [k] timerfn
    0.48% [kernel] [k] wake_up_nohz_cpu

    After:
    48.10% hog [.] main
    15.25% [kernel] [k] _raw_spin_lock_irqsave
    9.76% [kernel] [k] _raw_spin_unlock_irqrestore
    6.50% [kernel] [k] mod_timer
    6.44% [kernel] [k] lock_timer_base.isra.38
    3.87% [kernel] [k] detach_if_pending
    3.80% [kernel] [k] del_timer
    2.67% [kernel] [k] internal_add_timer
    1.33% [kernel] [k] __internal_add_timer
    0.73% [kernel] [k] timerfn
    0.54% [kernel] [k] wake_up_nohz_cpu

    Reported-by: Eric Dumazet
    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Paul McKenney
    Cc: Frederic Weisbecker
    Cc: Viresh Kumar
    Cc: John Stultz
    Cc: Joonwoo Park
    Cc: Wenbo Wang
    Link: http://lkml.kernel.org/r/20150526224512.127050787@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

17 Apr, 2015

2 commits

  • When converting unsigned long to int overflows may occur. These currently
    are not detected when writing to the sysctl file system.

    E.g. on a system where int has 32 bits and long has 64 bits
    echo 0x800001234 > /proc/sys/kernel/threads-max
    has the same effect as
    echo 0x1234 > /proc/sys/kernel/threads-max

    The patch adds the missing check in do_proc_dointvec_conv.

    With the patch an overflow will result in an error EINVAL when writing to
    the the sysctl file system.

    Signed-off-by: Heinrich Schuchardt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heinrich Schuchardt
     
  • Users can change the maximum number of threads by writing to
    /proc/sys/kernel/threads-max.

    With the patch the value entered is checked against the same limits that
    apply when fork_init is called.

    Signed-off-by: Heinrich Schuchardt
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: Guenter Roeck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heinrich Schuchardt
     

16 Apr, 2015

1 commit

  • Currently, pages which are marked as unevictable are protected from
    compaction, but not from other types of migration. The POSIX real time
    extension explicitly states that mlock() will prevent a major page
    fault, but the spirit of this is that mlock() should give a process the
    ability to control sources of latency, including minor page faults.
    However, the mlock manpage only explicitly says that a locked page will
    not be written to swap and this can cause some confusion. The
    compaction code today does not give a developer who wants to avoid swap
    but wants to have large contiguous areas available any method to achieve
    this state. This patch introduces a sysctl for controlling compaction
    behavior with respect to the unevictable lru. Users who demand no page
    faults after a page is present can set compact_unevictable_allowed to 0
    and users who need the large contiguous areas can enable compaction on
    locked memory by leaving the default value of 1.

    To illustrate this problem I wrote a quick test program that mmaps a
    large number of 1MB files filled with random data. These maps are
    created locked and read only. Then every other mmap is unmapped and I
    attempt to allocate huge pages to the static huge page pool. When the
    compact_unevictable_allowed sysctl is 0, I cannot allocate hugepages
    after fragmenting memory. When the value is set to 1, allocations
    succeed.

    Signed-off-by: Eric B Munson
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Acked-by: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Thomas Gleixner
    Cc: Christoph Lameter
    Cc: Peter Zijlstra
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     

15 Apr, 2015

3 commits

  • Merge first patchbomb from Andrew Morton:

    - arch/sh updates

    - ocfs2 updates

    - kernel/watchdog feature

    - about half of mm/

    * emailed patches from Andrew Morton : (122 commits)
    Documentation: update arch list in the 'memtest' entry
    Kconfig: memtest: update number of test patterns up to 17
    arm: add support for memtest
    arm64: add support for memtest
    memtest: use phys_addr_t for physical addresses
    mm: move memtest under mm
    mm, hugetlb: abort __get_user_pages if current has been oom killed
    mm, mempool: do not allow atomic resizing
    memcg: print cgroup information when system panics due to panic_on_oom
    mm: numa: remove migrate_ratelimited
    mm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZE
    mm: split ET_DYN ASLR from mmap ASLR
    s390: redefine randomize_et_dyn for ELF_ET_DYN_BASE
    mm: expose arch_mmap_rnd when available
    s390: standardize mmap_rnd() usage
    powerpc: standardize mmap_rnd() usage
    mips: extract logic for mmap_rnd()
    arm64: standardize mmap_rnd() usage
    x86: standardize mmap_rnd() usage
    arm: factor out mmap ASLR into mmap_rnd
    ...

    Linus Torvalds
     
  • With the current user interface of the watchdog mechanism it is only
    possible to disable or enable both lockup detectors at the same time.
    This series introduces new kernel parameters and changes the semantics of
    some existing kernel parameters, so that the hard lockup detector and the
    soft lockup detector can be disabled or enabled individually. With this
    series applied, the user interface is as follows.

    - parameters in /proc/sys/kernel

    . soft_watchdog
    This is a new parameter to control and examine the run state of
    the soft lockup detector.

    . nmi_watchdog
    The semantics of this parameter have changed. It can now be used
    to control and examine the run state of the hard lockup detector.

    . watchdog
    This parameter is still available to control the run state of both
    lockup detectors at the same time. If this parameter is examined,
    it shows the logical OR of soft_watchdog and nmi_watchdog.

    . watchdog_thresh
    The semantics of this parameter are not affected by the patch.

    - kernel command line parameters

    . nosoftlockup
    The semantics of this parameter have changed. It can now be used
    to disable the soft lockup detector at boot time.

    . nmi_watchdog=0 or nmi_watchdog=1
    Disable or enable the hard lockup detector at boot time. The patch
    introduces '=1' as a new option.

    . nowatchdog
    The semantics of this parameter are not affected by the patch. It
    is still available to disable both lockup detectors at boot time.

    Also, remove the proc_dowatchdog() function which is no longer needed.

    [dzickus@redhat.com: wrote changelog]
    [dzickus@redhat.com: update documentation for kernel params and sysctl]
    Signed-off-by: Ulrich Obergfell
    Signed-off-by: Don Zickus
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ulrich Obergfell
     
  • Pull vfs update from Al Viro:
    "Part one:

    - struct filename-related cleanups

    - saner iov_iter_init() replacements (and switching the syscalls to
    use of those)

    - ntfs switch to ->write_iter() (Anton)

    - aio cleanups and splitting iocb into common and async parts
    (Christoph)

    - assorted fixes (me, bfields, Andrew Elble)

    There's a lot more, including the completion of switchover to
    ->{read,write}_iter(), d_inode/d_backing_inode annotations, f_flags
    race fixes, etc, but that goes after #for-davem merge. David has
    pulled it, and once it's in I'll send the next vfs pull request"

    * 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (35 commits)
    sg_start_req(): use import_iovec()
    sg_start_req(): make sure that there's not too many elements in iovec
    blk_rq_map_user(): use import_single_range()
    sg_io(): use import_iovec()
    process_vm_access: switch to {compat_,}import_iovec()
    switch keyctl_instantiate_key_common() to iov_iter
    switch {compat_,}do_readv_writev() to {compat_,}import_iovec()
    aio_setup_vectored_rw(): switch to {compat_,}import_iovec()
    vmsplice_to_user(): switch to import_iovec()
    kill aio_setup_single_vector()
    aio: simplify arguments of aio_setup_..._rw()
    aio: lift iov_iter_init() into aio_setup_..._rw()
    lift iov_iter into {compat_,}do_readv_writev()
    NFS: fix BUG() crash in notify_change() with patch to chown_common()
    dcache: return -ESTALE not -EBUSY on distributed fs race
    NTFS: Version 2.1.32 - Update file write from aio_write to write_iter.
    VFS: Add iov_iter_fault_in_multipages_readable()
    drop bogus check in file_open_root()
    switch security_inode_getattr() to struct path *
    constify tomoyo_realpath_from_path()
    ...

    Linus Torvalds
     

26 Mar, 2015

1 commit


18 Mar, 2015

1 commit


11 Feb, 2015

1 commit

  • Commit ed4d4902ebdd ("mm, hugetlb: remove hugetlb_zero and
    hugetlb_infinity") replaced 'unsigned long hugetlb_zero' with 'int zero'
    leading to out-of-bounds access in proc_doulongvec_minmax(). Use
    '.extra1 = NULL' instead of '.extra1 = &zero'. Passing NULL is
    equivalent to passing minimal value, which is 0 for unsigned types.

    Fixes: ed4d4902ebdd ("mm, hugetlb: remove hugetlb_zero and hugetlb_infinity")
    Signed-off-by: Andrey Ryabinin
    Reported-by: Dmitry Vyukov
    Suggested-by: Manfred Spraul
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

17 Dec, 2014

1 commit

  • Pull tracing updates from Steven Rostedt:
    "As the merge window is still open, and this code was not as complex as
    I thought it might be. I'm pushing this in now.

    This will allow Thomas to debug his irq work for 3.20.

    This adds two new features:

    1) Allow traceopoints to be enabled right after mm_init().

    By passing in the trace_event= kernel command line parameter,
    tracepoints can be enabled at boot up. For debugging things like
    the initialization of interrupts, it is needed to have tracepoints
    enabled very early. People have asked about this before and this
    has been on my todo list. As it can be helpful for Thomas to debug
    his upcoming 3.20 IRQ work, I'm pushing this now. This way he can
    add tracepoints into the IRQ set up and have users enable them when
    things go wrong.

    2) Have the tracepoints printed via printk() (the console) when they
    are triggered.

    If the irq code locks up or reboots the box, having the tracepoint
    output go into the kernel ring buffer is useless for debugging.
    But being able to add the tp_printk kernel command line option
    along with the trace_event= option will have these tracepoints
    printed as they occur, and that can be really useful for debugging
    early lock up or reboot problems.

    This code is not that intrusive and it passed all my tests. Thomas
    tried them out too and it works for his needs.

    Link: http://lkml.kernel.org/r/20141214201609.126831471@goodmis.org"

    * tag 'trace-3.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Add tp_printk cmdline to have tracepoints go to printk()
    tracing: Move enabling tracepoints to just after rcu_init()

    Linus Torvalds
     

15 Dec, 2014

1 commit

  • Add the kernel command line tp_printk option that will have tracepoints
    that are active sent to printk() as well as to the trace buffer.

    Passing "tp_printk" will activate this. To turn it off, the sysctl
    /proc/sys/kernel/tracepoint_printk can have '0' echoed into it. Note,
    this only works if the cmdline option is used. Echoing 1 into the sysctl
    file without the cmdline option will have no affect.

    Note, this is a dangerous option. Having high frequency tracepoints send
    their data to printk() can possibly cause a live lock. This is another
    reason why this is only active if the command line option is used.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1412121539300.16494@nanos

    Suggested-by: Thomas Gleixner
    Tested-by: Thomas Gleixner
    Acked-by: Thomas Gleixner
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     

11 Dec, 2014

1 commit

  • There have been several times where I have had to rebuild a kernel to
    cause a panic when hitting a WARN() in the code in order to get a crash
    dump from a system. Sometimes this is easy to do, other times (such as
    in the case of a remote admin) it is not trivial to send new images to
    the user.

    A much easier method would be a switch to change the WARN() over to a
    panic. This makes debugging easier in that I can now test the actual
    image the WARN() was seen on and I do not have to engage in remote
    debugging.

    This patch adds a panic_on_warn kernel parameter and
    /proc/sys/kernel/panic_on_warn calls panic() in the
    warn_slowpath_common() path. The function will still print out the
    location of the warning.

    An example of the panic_on_warn output:

    The first line below is from the WARN_ON() to output the WARN_ON()'s
    location. After that the panic() output is displayed.

    WARNING: CPU: 30 PID: 11698 at /home/prarit/dummy_module/dummy-module.c:25 init_dummy+0x1f/0x30 [dummy_module]()
    Kernel panic - not syncing: panic_on_warn set ...

    CPU: 30 PID: 11698 Comm: insmod Tainted: G W OE 3.17.0+ #57
    Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.00.29.D696.1311111329 11/11/2013
    0000000000000000 000000008e3f87df ffff88080f093c38 ffffffff81665190
    0000000000000000 ffffffff818aea3d ffff88080f093cb8 ffffffff8165e2ec
    ffffffff00000008 ffff88080f093cc8 ffff88080f093c68 000000008e3f87df
    Call Trace:
    [] dump_stack+0x46/0x58
    [] panic+0xd0/0x204
    [] ? init_dummy+0x1f/0x30 [dummy_module]
    [] warn_slowpath_common+0xd0/0xd0
    [] ? dummy_greetings+0x40/0x40 [dummy_module]
    [] warn_slowpath_null+0x1a/0x20
    [] init_dummy+0x1f/0x30 [dummy_module]
    [] do_one_initcall+0xd4/0x210
    [] ? __vunmap+0xc2/0x110
    [] load_module+0x16a9/0x1b30
    [] ? store_uevent+0x70/0x70
    [] ? copy_module_from_fd.isra.44+0x129/0x180
    [] SyS_finit_module+0xa6/0xd0
    [] system_call_fastpath+0x12/0x17

    Successfully tested by me.

    hpa said: There is another very valid use for this: many operators would
    rather a machine shuts down than being potentially compromised either
    functionally or security-wise.

    Signed-off-by: Prarit Bhargava
    Cc: Jonathan Corbet
    Cc: Rusty Russell
    Cc: "H. Peter Anvin"
    Cc: Andi Kleen
    Cc: Masami Hiramatsu
    Acked-by: Yasuaki Ishimatsu
    Cc: Fabian Frederick
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Prarit Bhargava
     

28 Oct, 2014

1 commit

  • File /proc/sys/kernel/numa_balancing_scan_size_mb allows writing of zero.

    This bash command reproduces problem:

    $ while :; do echo 0 > /proc/sys/kernel/numa_balancing_scan_size_mb; \
    echo 256 > /proc/sys/kernel/numa_balancing_scan_size_mb; done

    divide error: 0000 [#1] SMP
    Modules linked in:
    CPU: 0 PID: 24112 Comm: bash Not tainted 3.17.0+ #8
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    task: ffff88013c852600 ti: ffff880037a68000 task.ti: ffff880037a68000
    RIP: 0010:[] [] task_scan_min+0x21/0x50
    RSP: 0000:ffff880037a6bce0 EFLAGS: 00010246
    RAX: 0000000000000a00 RBX: 00000000000003e8 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013c852600
    RBP: ffff880037a6bcf0 R08: 0000000000000001 R09: 0000000000015c90
    R10: ffff880239bf6c00 R11: 0000000000000016 R12: 0000000000003fff
    R13: ffff88013c852600 R14: ffffea0008d1b000 R15: 0000000000000003
    FS: 00007f12bb048700(0000) GS:ffff88007da00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 0000000001505678 CR3: 0000000234770000 CR4: 00000000000006f0
    Stack:
    ffff88013c852600 0000000000003fff ffff880037a6bd18 ffffffff810741d1
    ffff88013c852600 0000000000003fff 000000000002bfff ffff880037a6bda8
    ffffffff81077ef7 ffffea0008a56d40 0000000000000001 0000000000000001
    Call Trace:
    [] task_scan_max+0x11/0x40
    [] task_numa_fault+0x1f7/0xae0
    [] ? migrate_misplaced_page+0x276/0x300
    [] handle_mm_fault+0x62d/0xba0
    [] __do_page_fault+0x191/0x510
    [] ? native_smp_send_reschedule+0x42/0x60
    [] ? check_preempt_curr+0x80/0xa0
    [] ? wake_up_new_task+0x11c/0x1a0
    [] ? do_fork+0x14d/0x340
    [] ? get_unused_fd_flags+0x2b/0x30
    [] ? __fd_install+0x1f/0x60
    [] do_page_fault+0xc/0x10
    [] page_fault+0x22/0x30
    RIP [] task_scan_min+0x21/0x50
    RSP
    ---[ end trace 9a826d16936c04de ]---

    Also fix race in task_scan_min (it depends on compiler behaviour).

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Aaron Tomlin
    Cc: Andrew Morton
    Cc: Dario Faggioli
    Cc: David Rientjes
    Cc: Jens Axboe
    Cc: Kees Cook
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/1413455977.24793.78.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     

13 Oct, 2014

1 commit

  • Pull RCU updates from Ingo Molnar:
    "The main changes in this cycle were:

    - changes related to No-CBs CPUs and NO_HZ_FULL

    - RCU-tasks implementation

    - torture-test updates

    - miscellaneous fixes

    - locktorture updates

    - RCU documentation updates"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (81 commits)
    workqueue: Use cond_resched_rcu_qs macro
    workqueue: Add quiescent state between work items
    locktorture: Cleanup header usage
    locktorture: Cannot hold read and write lock
    locktorture: Fix __acquire annotation for spinlock irq
    locktorture: Support rwlocks
    rcu: Eliminate deadlock between CPU hotplug and expedited grace periods
    locktorture: Document boot/module parameters
    rcutorture: Rename rcutorture_runnable parameter
    locktorture: Add test scenario for rwsem_lock
    locktorture: Add test scenario for mutex_lock
    locktorture: Make torture scripting account for new _runnable name
    locktorture: Introduce torture context
    locktorture: Support rwsems
    locktorture: Add infrastructure for torturing read locks
    torture: Address race in module cleanup
    locktorture: Make statistics generic
    locktorture: Teach about lock debugging
    locktorture: Support mutexes
    locktorture: Add documentation
    ...

    Linus Torvalds
     

10 Oct, 2014

1 commit

  • The deprecation warnings for the scan_unevictable interface triggers by
    scripts doing `sysctl -a | grep something else'. This is annoying and not
    helpful.

    The interface has been defunct since 264e56d8247e ("mm: disable user
    interface to manually rescue unevictable pages"), which was in 2011, and
    there haven't been any reports of usecases for it, only reports that the
    deprecation warnings are annying. It's unlikely that anybody is using
    this interface specifically at this point, so remove it.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

17 Sep, 2014

1 commit


07 Aug, 2014

1 commit

  • They are unnecessary: "zero" can be used in place of "hugetlb_zero" and
    passing extra2 == NULL is equivalent to infinity.

    Signed-off-by: David Rientjes
    Cc: Joonsoo Kim
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Luiz Capitulino
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

24 Jun, 2014

2 commits

  • A 'softlockup' is defined as a bug that causes the kernel to loop in
    kernel mode for more than a predefined period to time, without giving
    other tasks a chance to run.

    Currently, upon detection of this condition by the per-cpu watchdog
    task, debug information (including a stack trace) is sent to the system
    log.

    On some occasions, we have observed that the "victim" rather than the
    actual "culprit" (i.e. the owner/holder of the contended resource) is
    reported to the user. Often this information has proven to be
    insufficient to assist debugging efforts.

    To avoid loss of useful debug information, for architectures which
    support NMI, this patch makes it possible to improve soft lockup
    reporting. This is accomplished by issuing an NMI to each cpu to obtain
    a stack trace.

    If NMI is not supported we just revert back to the old method. A sysctl
    and boot-time parameter is available to toggle this feature.

    [dzickus@redhat.com: add CONFIG_SMP in certain areas]
    [akpm@linux-foundation.org: additional CONFIG_SMP=n optimisations]
    [mq@suse.cz: fix warning]
    Signed-off-by: Aaron Tomlin
    Signed-off-by: Don Zickus
    Cc: David S. Miller
    Cc: Mateusz Guzik
    Cc: Oleg Nesterov
    Signed-off-by: Jan Moskyto Matejka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Tomlin
     
  • Oleg reports a division by zero error on zero-length write() to the
    percpu_pagelist_fraction sysctl:

    divide error: 0000 [#1] SMP DEBUG_PAGEALLOC
    CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000
    RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120
    RSP: 0018:ffff8800d87a3e78 EFLAGS: 00010246
    RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010
    RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50
    R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060
    R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800
    FS: 00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0
    Call Trace:
    proc_sys_call_handler+0xb3/0xc0
    proc_sys_write+0x14/0x20
    vfs_write+0xba/0x1e0
    SyS_write+0x46/0xb0
    tracesys+0xe1/0xe6

    However, if the percpu_pagelist_fraction sysctl is set by the user, it
    is also impossible to restore it to the kernel default since the user
    cannot write 0 to the sysctl.

    This patch allows the user to write 0 to restore the default behavior.
    It still requires a fraction equal to or larger than 8, however, as
    stated by the documentation for sanity. If a value in the range [1, 7]
    is written, the sysctl will return EINVAL.

    This successfully solves the divide by zero issue at the same time.

    Signed-off-by: David Rientjes
    Reported-by: Oleg Drokin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

20 Jun, 2014

1 commit

  • Pull sparc fixes from David Miller:
    "Sparc sparse fixes from Sam Ravnborg"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next: (67 commits)
    sparc64: fix sparse warnings in int_64.c
    sparc64: fix sparse warning in ftrace.c
    sparc64: fix sparse warning in kprobes.c
    sparc64: fix sparse warning in kgdb_64.c
    sparc64: fix sparse warnings in compat_audit.c
    sparc64: fix sparse warnings in init_64.c
    sparc64: fix sparse warnings in aes_glue.c
    sparc: fix sparse warnings in smp_32.c + smp_64.c
    sparc64: fix sparse warnings in perf_event.c
    sparc64: fix sparse warnings in kprobes.c
    sparc64: fix sparse warning in tsb.c
    sparc64: clean up compat_sigset_t.seta handling
    sparc64: fix sparse "Should it be static?" warnings in signal32.c
    sparc64: fix sparse warnings in sys_sparc32.c
    sparc64: fix sparse warning in pci.c
    sparc64: fix sparse warnings in smp_64.c
    sparc64: fix sparse warning in prom_64.c
    sparc64: fix sparse warning in btext.c
    sparc64: fix sparse warnings in sys_sparc_64.c + unaligned_64.c
    sparc64: fix sparse warning in process_64.c
    ...

    Conflicts:
    arch/sparc/include/asm/pgtable_64.h

    Linus Torvalds
     

13 Jun, 2014

1 commit

  • Pull networking updates from David Miller:

    1) Seccomp BPF filters can now be JIT'd, from Alexei Starovoitov.

    2) Multiqueue support in xen-netback and xen-netfront, from Andrew J
    Benniston.

    3) Allow tweaking of aggregation settings in cdc_ncm driver, from Bjørn
    Mork.

    4) BPF now has a "random" opcode, from Chema Gonzalez.

    5) Add more BPF documentation and improve test framework, from Daniel
    Borkmann.

    6) Support TCP fastopen over ipv6, from Daniel Lee.

    7) Add software TSO helper functions and use them to support software
    TSO in mvneta and mv643xx_eth drivers. From Ezequiel Garcia.

    8) Support software TSO in fec driver too, from Nimrod Andy.

    9) Add Broadcom SYSTEMPORT driver, from Florian Fainelli.

    10) Handle broadcasts more gracefully over macvlan when there are large
    numbers of interfaces configured, from Herbert Xu.

    11) Allow more control over fwmark used for non-socket based responses,
    from Lorenzo Colitti.

    12) Do TCP congestion window limiting based upon measurements, from Neal
    Cardwell.

    13) Support busy polling in SCTP, from Neal Horman.

    14) Allow RSS key to be configured via ethtool, from Venkata Duvvuru.

    15) Bridge promisc mode handling improvements from Vlad Yasevich.

    16) Don't use inetpeer entries to implement ID generation any more, it
    performs poorly, from Eric Dumazet.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1522 commits)
    rtnetlink: fix userspace API breakage for iproute2 < v3.9.0
    tcp: fixing TLP's FIN recovery
    net: fec: Add software TSO support
    net: fec: Add Scatter/gather support
    net: fec: Increase buffer descriptor entry number
    net: fec: Factorize feature setting
    net: fec: Enable IP header hardware checksum
    net: fec: Factorize the .xmit transmit function
    bridge: fix compile error when compiling without IPv6 support
    bridge: fix smatch warning / potential null pointer dereference
    via-rhine: fix full-duplex with autoneg disable
    bnx2x: Enlarge the dorq threshold for VFs
    bnx2x: Check for UNDI in uncommon branch
    bnx2x: Fix 1G-baseT link
    bnx2x: Fix link for KR with swapped polarity lane
    sctp: Fix sk_ack_backlog wrap-around problem
    net/core: Add VF link state control policy
    net/fsl: xgmac_mdio is dependent on OF_MDIO
    net/fsl: Make xgmac_mdio read error message useful
    net_sched: drr: warn when qdisc is not work conserving
    ...

    Linus Torvalds
     

07 Jun, 2014

4 commits

  • This typedef is unnecessary and should just be removed.

    Signed-off-by: Joe Perches
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • When writing to a sysctl string, each write, regardless of VFS position,
    begins writing the string from the start. This means the contents of
    the last write to the sysctl controls the string contents instead of the
    first:

    open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
    write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
    write(1, "/bin/true", 9) = 9
    close(1) = 0

    $ cat /proc/sys/kernel/modprobe
    /bin/true

    Expected behaviour would be to have the sysctl be "AAAA..." capped at
    maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
    contents of the second write. Similarly, multiple short writes would
    not append to the sysctl.

    The old behavior is unlike regular POSIX files enough that doing audits
    of software that interact with sysctls can end up in unexpected or
    dangerous situations. For example, "as long as the input starts with a
    trusted path" turns out to be an insufficient filter, as what must also
    happen is for the input to be entirely contained in a single write
    syscall -- not a common consideration, especially for high level tools.

    This provides kernel.sysctl_writes_strict as a way to make this behavior
    act in a less surprising manner for strings, and disallows non-zero file
    position when writing numeric sysctls (similar to what is already done
    when reading from non-zero file positions). For now, the default (0) is
    to warn about non-zero file position use, but retain the legacy
    behavior. Setting this to -1 disables the warning, and setting this to
    1 enables the file position respecting behavior.

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: move misplaced hunk, per Randy]
    Signed-off-by: Kees Cook
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Consolidate buffer length checking with new-line/end-of-line checking.
    Additionally, instead of reading user memory twice, just do the
    assignment during the loop.

    This change doesn't affect the potential races here. It was already
    possible to read a sysctl that was in the middle of a write. In both
    cases, the string will always be NULL terminated. The pre-existing race
    remains a problem to be solved.

    Signed-off-by: Kees Cook
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • When writing to a sysctl string, each write, regardless of VFS position,
    began writing the string from the start. This meant the contents of the
    last write to the sysctl controlled the string contents instead of the
    first.

    This misbehavior was featured in an exploit against Chrome OS. While
    it's not in itself a vulnerability, it's a weirdness that isn't on the
    mind of most auditors: "This filter looks correct, the first line
    written would not be meaningful to sysctl" doesn't apply here, since the
    size of the write and the contents of the final write are what matter
    when writing to sysctls.

    This adds the sysctl kernel.sysctl_writes_strict to control the write
    behavior. The default (0) reports when VFS position is non-0 on a
    write, but retains legacy behavior, -1 disables the warning, and 1
    enables the position-respecting behavior.

    The long-term plan here is to wait for userspace to be fixed in response
    to the new warning and to then switch the default kernel behavior to the
    new position-respecting behavior.

    This patch (of 4):

    The char buffer arguments are needlessly cast in weird places. Clean it
    up so things are easier to read.

    Signed-off-by: Kees Cook
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

05 Jun, 2014

1 commit

  • Pull x86 cdso updates from Peter Anvin:
    "Vdso cleanups and improvements largely from Andy Lutomirski. This
    makes the vdso a lot less ''special''"

    * 'x86/vdso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/vdso, build: Make LE access macros clearer, host-safe
    x86/vdso, build: Fix cross-compilation from big-endian architectures
    x86/vdso, build: When vdso2c fails, unlink the output
    x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET
    x86, mm: Replace arch_vma_name with vm_ops->name for vsyscalls
    x86, mm: Improve _install_special_mapping and fix x86 vdso naming
    mm, fs: Add vm_ops->name as an alternative to arch_vma_name
    x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET
    x86, vdso: Remove vestiges of VDSO_PRELINK and some outdated comments
    x86, vdso: Move the vvar and hpet mappings next to the 64-bit vDSO
    x86, vdso: Move the 32-bit vdso special pages after the text
    x86, vdso: Reimplement vdso.so preparation in build-time C
    x86, vdso: Move syscall and sysenter setup into kernel/cpu/common.c
    x86, vdso: Clean up 32-bit vs 64-bit vdso params
    x86, mm: Ensure correct alignment of the fixmap

    Linus Torvalds
     

19 May, 2014

1 commit

  • Fix following warning:
    tsb.c:290:5: warning: symbol 'sysctl_tsb_ratio' was not declared. Should it be static?

    Add extern declaration in asm/setup.h and remove local declaration
    in kernel/sysctl.c

    Signed-off-by: Sam Ravnborg
    Signed-off-by: David S. Miller

    Sam Ravnborg
     

15 May, 2014

1 commit

  • ip_local_port_range is already per netns, so should ip_local_reserved_ports
    be. And since it is none by default we don't actually need it when we don't
    enable CONFIG_SYSCTL.

    By the way, rename inet_is_reserved_local_port() to inet_is_local_reserved_port()

    Cc: "David S. Miller"
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong
     

06 May, 2014

1 commit


26 Apr, 2014

1 commit


08 Apr, 2014

1 commit

  • As sysctl_hung_task_timeout_sec is unsigned long, when this value is
    larger then LONG_MAX/HZ, the function schedule_timeout_interruptible in
    watchdog will return immediately without sleep and with print :

    schedule_timeout: wrong timeout value ffffffffffffff83

    and then the funtion watchdog will call schedule_timeout_interruptible
    again and again. The screen will be filled with

    "schedule_timeout: wrong timeout value ffffffffffffff83"

    This patch does some check and correction in sysctl, to let the function
    schedule_timeout_interruptible allways get the valid parameter.

    Signed-off-by: Liu Hua
    Tested-by: Satoru Takeuchi
    Cc: [3.4+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Liu Hua
     

04 Apr, 2014

1 commit

  • There is plenty of anecdotal evidence and a load of blog posts
    suggesting that using "drop_caches" periodically keeps your system
    running in "tip top shape". Perhaps adding some kernel documentation
    will increase the amount of accurate data on its use.

    If we are not shrinking caches effectively, then we have real bugs.
    Using drop_caches will simply mask the bugs and make them harder to
    find, but certainly does not fix them, nor is it an appropriate
    "workaround" to limit the size of the caches. On the contrary, there
    have been bug reports on issues that turned out to be misguided use of
    cache dropping.

    Dropping caches is a very drastic and disruptive operation that is good
    for debugging and running tests, but if it creates bug reports from
    production use, kernel developers should be aware of its use.

    Add a bit more documentation about it, a syslog message to track down
    abusers, and vmstat drop counters to help analyze problem reports.

    [akpm@linux-foundation.org: checkpatch fixes]
    [hannes@cmpxchg.org: add runtime suppression control]
    Signed-off-by: Dave Hansen
    Signed-off-by: Michal Hocko
    Acked-by: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     

02 Apr, 2014

1 commit

  • Pull core block layer updates from Jens Axboe:
    "This is the pull request for the core block IO bits for the 3.15
    kernel. It's a smaller round this time, it contains:

    - Various little blk-mq fixes and additions from Christoph and
    myself.

    - Cleanup of the IPI usage from the block layer, and associated
    helper code. From Frederic Weisbecker and Jan Kara.

    - Duplicate code cleanup in bio-integrity from Gu Zheng. This will
    give you a merge conflict, but that should be easy to resolve.

    - blk-mq notify spinlock fix for RT from Mike Galbraith.

    - A blktrace partial accounting bug fix from Roman Pen.

    - Missing REQ_SYNC detection fix for blk-mq from Shaohua Li"

    * 'for-3.15/core' of git://git.kernel.dk/linux-block: (25 commits)
    blk-mq: add REQ_SYNC early
    rt,blk,mq: Make blk_mq_cpu_notify_lock a raw spinlock
    blk-mq: support partial I/O completions
    blk-mq: merge blk_mq_insert_request and blk_mq_run_request
    blk-mq: remove blk_mq_alloc_rq
    blk-mq: don't dump CPU -> hw queue map on driver load
    blk-mq: fix wrong usage of hctx->state vs hctx->flags
    blk-mq: allow blk_mq_init_commands() to return failure
    block: remove old blk_iopoll_enabled variable
    blktrace: fix accounting of partially completed requests
    smp: Rename __smp_call_function_single() to smp_call_function_single_async()
    smp: Remove wait argument from __smp_call_function_single()
    watchdog: Simplify a little the IPI call
    smp: Move __smp_call_function_single() below its safe version
    smp: Consolidate the various smp_call_function_single() declensions
    smp: Teach __smp_call_function_single() to check for offline cpus
    smp: Remove unused list_head from csd
    smp: Iterate functions through llist_for_each_entry_safe()
    block: Stop abusing rq->csd.list in blk-softirq
    block: Remove useless IPI struct initialization
    ...

    Linus Torvalds
     

13 Mar, 2014

1 commit

  • This was a debugging measure to toggle enabled/disabled
    when testing. But for real production setups, it's not
    safe to toggle this setting without either reloading
    drivers of quiescing IO first. Neither of which the toggle
    enforces.

    Additionally, it makes drivers deal with the conditional
    state.

    Remove it completely. It's up to the driver whether iopoll
    is enabled or not.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

02 Feb, 2014

1 commit


01 Feb, 2014

1 commit

  • Pull core debug changes from Ingo Molnar:
    "This contains mostly kernel debugging related updates:

    - make hung_task detection more configurable to distros
    - add final bits for x86 UV NMI debugging, with related KGDB changes
    - update the mailing-list of MAINTAINERS entries I'm involved with"

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    hung_task: Display every hung task warning
    sysctl: Add neg_one as a standard constraint
    x86/uv/nmi, kgdb/kdb: Fix UV NMI handler when KDB not configured
    x86/uv/nmi: Fix Sparse warnings
    kgdb/kdb: Fix no KDB config problem
    MAINTAINERS: Restore "L: linux-kernel@vger.kernel.org" entries

    Linus Torvalds
     

28 Jan, 2014

1 commit

  • Excessive migration of pages can hurt the performance of workloads
    that span multiple NUMA nodes. However, it turns out that the
    p->numa_migrate_deferred knob is a really big hammer, which does
    reduce migration rates, but does not actually help performance.

    Now that the second stage of the automatic numa balancing code
    has stabilized, it is time to replace the simplistic migration
    deferral code with something smarter.

    Signed-off-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Peter Zijlstra
    Cc: Chegu Vinod
    Link: http://lkml.kernel.org/r/1390860228-21539-2-git-send-email-riel@redhat.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     

25 Jan, 2014

1 commit

  • When khungtaskd detects hung tasks, it prints out
    backtraces from a number of those tasks.

    Limiting the number of backtraces being printed
    out can result in the user not seeing the information
    necessary to debug the issue. The hung_task_warnings
    sysctl controls this feature.

    This patch makes it possible for hung_task_warnings
    to accept a special value to print an unlimited
    number of backtraces when khungtaskd detects hung
    tasks.

    The special value is -1. To use this value it is
    necessary to change types from ulong to int.

    Signed-off-by: Aaron Tomlin
    Reviewed-by: Rik van Riel
    Acked-by: David Rientjes
    Cc: oleg@redhat.com
    Link: http://lkml.kernel.org/r/1390239253-24030-3-git-send-email-atomlin@redhat.com
    [ Build warning fix. ]
    Signed-off-by: Ingo Molnar

    Aaron Tomlin