07 Jun, 2014

23 commits

  • + fix small typo

    Signed-off-by: Fabian Frederick
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • trailing whitespace

    Signed-off-by: Paul McQuade
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul McQuade
     
  • Use #include instead of
    Use #include instead of

    Signed-off-by: Paul McQuade
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul McQuade
     
  • schedstr, sleepstr and kvmstr are only used in strcmp & strlen

    Signed-off-by: Fabian Frederick
    Cc: Paul Gortmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • Signed-off-by: Fabian Frederick
    Cc: Paul Gortmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • -uid->gid
    -split some function declarations
    -if/then/else warning

    Signed-off-by: Fabian Frederick
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • When writing to a sysctl string, each write, regardless of VFS position,
    begins writing the string from the start. This means the contents of
    the last write to the sysctl controls the string contents instead of the
    first:

    open("/proc/sys/kernel/modprobe", O_WRONLY) = 1
    write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
    write(1, "/bin/true", 9) = 9
    close(1) = 0

    $ cat /proc/sys/kernel/modprobe
    /bin/true

    Expected behaviour would be to have the sysctl be "AAAA..." capped at
    maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
    contents of the second write. Similarly, multiple short writes would
    not append to the sysctl.

    The old behavior is unlike regular POSIX files enough that doing audits
    of software that interact with sysctls can end up in unexpected or
    dangerous situations. For example, "as long as the input starts with a
    trusted path" turns out to be an insufficient filter, as what must also
    happen is for the input to be entirely contained in a single write
    syscall -- not a common consideration, especially for high level tools.

    This provides kernel.sysctl_writes_strict as a way to make this behavior
    act in a less surprising manner for strings, and disallows non-zero file
    position when writing numeric sysctls (similar to what is already done
    when reading from non-zero file positions). For now, the default (0) is
    to warn about non-zero file position use, but retain the legacy
    behavior. Setting this to -1 disables the warning, and setting this to
    1 enables the file position respecting behavior.

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: move misplaced hunk, per Randy]
    Signed-off-by: Kees Cook
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Consolidate buffer length checking with new-line/end-of-line checking.
    Additionally, instead of reading user memory twice, just do the
    assignment during the loop.

    This change doesn't affect the potential races here. It was already
    possible to read a sysctl that was in the middle of a write. In both
    cases, the string will always be NULL terminated. The pre-existing race
    remains a problem to be solved.

    Signed-off-by: Kees Cook
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • When writing to a sysctl string, each write, regardless of VFS position,
    began writing the string from the start. This meant the contents of the
    last write to the sysctl controlled the string contents instead of the
    first.

    This misbehavior was featured in an exploit against Chrome OS. While
    it's not in itself a vulnerability, it's a weirdness that isn't on the
    mind of most auditors: "This filter looks correct, the first line
    written would not be meaningful to sysctl" doesn't apply here, since the
    size of the write and the contents of the final write are what matter
    when writing to sysctls.

    This adds the sysctl kernel.sysctl_writes_strict to control the write
    behavior. The default (0) reports when VFS position is non-0 on a
    write, but retains legacy behavior, -1 disables the warning, and 1
    enables the position-respecting behavior.

    The long-term plan here is to wait for userspace to be fixed in response
    to the new warning and to then switch the default kernel behavior to the
    new position-respecting behavior.

    This patch (of 4):

    The char buffer arguments are needlessly cast in weird places. Clean it
    up so things are easier to read.

    Signed-off-by: Kees Cook
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • + some pr_warning -> pr_warn and checkpatch warning fixes

    Signed-off-by: Fabian Frederick
    Cc: Eric Biederman
    Cc: Vivek Goyal
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • Add a "crash_kexec_post_notifiers" boot option to run kdump after
    running panic_notifiers and dump kmsg. This can help rare situations
    where kdump fails because of unstable crashed kernel or hardware failure
    (memory corruption on critical data/code), or the 2nd kernel is already
    broken by the 1st kernel (it's a broken behavior, but who can guarantee
    that the "crashed" kernel works correctly?).

    Usage: add "crash_kexec_post_notifiers" to kernel boot option.

    Note that this actually increases risks of the failure of kdump. This
    option should be set only if you worry about the rare case of kdump
    failure rather than increasing the chance of success.

    Signed-off-by: Masami Hiramatsu
    Acked-by: Motohiro Kosaki
    Acked-by: Vivek Goyal
    Cc: Eric Biederman
    Cc: Yoshihiro YUNOMAE
    Cc: Satoru MORIYA
    Cc: Tomoki Sekiyama
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masami Hiramatsu
     
  • There is a longstanding problem related to CPU hotplug which causes IPIs
    to be delivered to offline CPUs, and the smp-call-function IPI handler
    code prints out a warning whenever this is detected. Every once in a
    while this (usually harmless) warning gets reported on LKML, but so far
    it has not been completely fixed. Usually the solution involves finding
    out the IPI sender and fixing it by adding appropriate synchronization
    with CPU hotplug.

    However, while going through one such internal bug reports, I found that
    there is a significant bug in the receiver side itself (more
    specifically, in stop-machine) that can lead to this problem even when
    the sender code is perfectly fine. This patchset fixes that
    synchronization problem in the CPU hotplug stop-machine code.

    Patch 1 adds some additional debug code to the smp-call-function
    framework, to help debug such issues easily.

    Patch 2 modifies the stop-machine code to ensure that any IPIs that were
    sent while the target CPU was online, would be noticed and handled by
    that CPU without fail before it goes offline. Thus, this avoids
    scenarios where IPIs are received on offline CPUs (as long as the sender
    uses proper hotplug synchronization).

    In fact, I debugged the problem by using Patch 1, and found that the
    payload of the IPI was always the block layer's trigger_softirq()
    function. But I was not able to find anything wrong with the block
    layer code. That's when I started looking at the stop-machine code and
    realized that there is a race-window which makes the IPI _receiver_ the
    culprit, not the sender. Patch 2 fixes that race and hence this should
    put an end to most of the hard-to-debug IPI-to-offline-CPU issues.

    This patch (of 2):

    Today the smp-call-function code just prints a warning if we get an IPI
    on an offline CPU. This info is sufficient to let us know that
    something went wrong, but often it is very hard to debug exactly who
    sent the IPI and why, from this info alone.

    In most cases, we get the warning about the IPI to an offline CPU,
    immediately after the CPU going offline comes out of the stop-machine
    phase and reenables interrupts. Since all online CPUs participate in
    stop-machine, the information regarding the sender of the IPI is already
    lost by the time we exit the stop-machine loop. So even if we dump the
    stack on each CPU at this point, we won't find anything useful since all
    of them will show the stack-trace of the stopper thread. So we need a
    better way to figure out who sent the IPI and why.

    To achieve this, when we detect an IPI targeted to an offline CPU, loop
    through the call-single-data linked list and print out the payload
    (i.e., the name of the function which was supposed to be executed by the
    target CPU). This would give us an insight as to who might have sent
    the IPI and help us debug this further.

    [akpm@linux-foundation.org: correctly suppress warning output on second and later occurrences]
    Signed-off-by: Srivatsa S. Bhat
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Tejun Heo
    Cc: Rusty Russell
    Cc: Frederic Weisbecker
    Cc: Christoph Hellwig
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Borislav Petkov
    Cc: Steven Rostedt
    Cc: Mike Galbraith
    Cc: Gautham R Shenoy
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Rafael J. Wysocki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Srivatsa S. Bhat
     
  • Now that we have kernel_sigaction() we can change wait_for_helper() to
    use it and cleans up the code a bit.

    Signed-off-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Al Viro
    Cc: David Woodhouse
    Cc: Frederic Weisbecker
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Cc: Mathieu Desnoyers
    Cc: Richard Weinberger
    Cc: Steven Rostedt
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Now that allow_signal() is really trivial we can unify it with
    disallow_signal(). Add the new helper, kernel_sigaction(), and
    reimplement allow_signal/disallow_signal as a trivial wrappers.

    This saves one EXPORT_SYMBOL() and the new helper can have more users.

    Signed-off-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Al Viro
    Cc: David Woodhouse
    Cc: Frederic Weisbecker
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Cc: Mathieu Desnoyers
    Cc: Richard Weinberger
    Cc: Steven Rostedt
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • disallow_signal() simply sets SIG_IGN, this is not enough and
    recalc_sigpending() is simply pointless because in can never change the
    state of TIF_SIGPENDING.

    If we ignore a signal, we also need to do flush_sigqueue_mask() for the
    case when this signal is pending, this way recalc_sigpending() can
    actually clear TIF_SIGPENDING and we do not "leak" the allocated
    siginfo's.

    Signed-off-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Al Viro
    Cc: David Woodhouse
    Cc: Frederic Weisbecker
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Cc: Mathieu Desnoyers
    Cc: Richard Weinberger
    Cc: Steven Rostedt
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • allow_signal() does sigdelset(current->blocked) due to historic reason,
    previously it could be called by a daemonize()'ed kthread, and
    daemonize() played with current->blocked.

    Now that daemonize() has gone away we can remove sigdelset() and
    recalc_sigpending(). If a user really wants to unblock a signal, it
    must use sigprocmask() or set_current_block() explicitely.

    Signed-off-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Al Viro
    Cc: David Woodhouse
    Cc: Frederic Weisbecker
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Cc: Mathieu Desnoyers
    Cc: Richard Weinberger
    Cc: Steven Rostedt
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Move the declaration/definition of allow_signal/disallow_signal to
    signal.h/signal.c. The new place is more logical and allows to use the
    static helpers in signal.c (see the next changes).

    While at it, make them return void and remove the valid_signal() check.
    Nobody checks the returned value, and in-kernel users must not pass the
    wrong signal number.

    Signed-off-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Al Viro
    Cc: David Woodhouse
    Cc: Frederic Weisbecker
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Cc: Mathieu Desnoyers
    Cc: Richard Weinberger
    Cc: Steven Rostedt
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The usage of "task_struct *t" and "current" in do_sigaction() looks really
    annoying and chaotic. Initially "t" is used as a cached value of current
    but not consistently, then it is reused as a loop variable and we have to
    use "current" again.

    Clean up this mess and also convert the code to use for_each_thread().

    Signed-off-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Al Viro
    Cc: David Woodhouse
    Cc: Frederic Weisbecker
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Cc: Mathieu Desnoyers
    Cc: Richard Weinberger
    Cc: Steven Rostedt
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • "rm_from_queue_full" looks ugly and misleading, especially now that
    rm_from_queue() has gone away. Rename it to flush_sigqueue_mask(), this
    matches flush_sigqueue() we already have.

    Also remove the obsolete comment which explains the difference with
    rm_from_queue() we already killed.

    Signed-off-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Al Viro
    Cc: David Woodhouse
    Cc: Frederic Weisbecker
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Cc: Mathieu Desnoyers
    Cc: Richard Weinberger
    Cc: Steven Rostedt
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • rm_from_queue() doesn't make sense. The only caller, prepare_signal(),
    can use rm_from_queue_full() with the same effect.

    While at it, change prepare_signal() to use for_each_thread() instead of
    do/while_each_thread.

    Signed-off-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Al Viro
    Cc: David Woodhouse
    Cc: Frederic Weisbecker
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Cc: Mathieu Desnoyers
    Cc: Richard Weinberger
    Cc: Steven Rostedt
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Cosmetic, but siginitset(0) looks a bit strange, sigemptyset() is what
    do_sigtimedwait() needs.

    Signed-off-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Al Viro
    Cc: David Woodhouse
    Cc: Frederic Weisbecker
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Cc: Mathieu Desnoyers
    Cc: Richard Weinberger
    Cc: Steven Rostedt
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • __wake_up_bit() checks waitqueue_active() and thus the caller needs mb()
    as wake_up_bit() documents, fix task_clear_jobctl_trapping().

    Signed-off-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • When tracing a process in another pid namespace, it's important for fork
    event messages to contain the child's pid as seen from the tracer's pid
    namespace, not the parent's. Otherwise, the tracer won't be able to
    correlate the fork event with later SIGTRAP signals it receives from the
    child.

    We still risk a race condition if a ptracer from a different pid
    namespace attaches after we compute the pid_t value. However, sending a
    bogus fork event message in this unlikely scenario is still a vast
    improvement over the status quo where we always send bogus fork event
    messages to debuggers in a different pid namespace than the forking
    process.

    Signed-off-by: Matthew Dempsky
    Acked-by: Oleg Nesterov
    Cc: Kees Cook
    Cc: Julien Tinnes
    Cc: Roland McGrath
    Cc: Jan Kratochvil
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Dempsky
     

06 Jun, 2014

1 commit

  • Pull ARM updates from Russell King:

    - Major clean-up of the L2 cache support code. The existing mess was
    becoming rather unmaintainable through all the additions that others
    have done over time. This turns it into a much nicer structure, and
    implements a few performance improvements as well.

    - Clean up some of the CP15 control register tweaks for alignment
    support, moving some code and data into alignment.c

    - DMA properties for ARM, from Santosh and reviewed by DT people. This
    adds DT properties to specify bus translations we can't discover
    automatically, and to indicate whether devices are coherent.

    - Hibernation support for ARM

    - Make ftrace work with read-only text in modules

    - add suspend support for PJ4B CPUs

    - rework interrupt masking for undefined instruction handling, which
    allows us to enable interrupts earlier in the handling of these
    exceptions.

    - support for big endian page tables

    - fix stacktrace support to exclude stacktrace functions from the
    trace, and add save_stack_trace_regs() implementation so that kprobes
    can record stack traces.

    - Add support for the Cortex-A17 CPU.

    - Remove last vestiges of ARM710 support.

    - Removal of ARM "meminfo" structure, finally converting us solely to
    memblock to handle the early memory initialisation.

    * 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-arm: (142 commits)
    ARM: ensure C page table setup code follows assembly code (part II)
    ARM: ensure C page table setup code follows assembly code
    ARM: consolidate last remaining open-coded alignment trap enable
    ARM: remove global cr_no_alignment
    ARM: remove CPU_CP15 conditional from alignment.c
    ARM: remove unused adjust_cr() function
    ARM: move "noalign" command line option to alignment.c
    ARM: provide common method to clear bits in CPU control register
    ARM: 8025/1: Get rid of meminfo
    ARM: 8060/1: mm: allow sub-architectures to override PCI I/O memory type
    ARM: 8066/1: correction for ARM patch 8031/2
    ARM: 8049/1: ftrace/add save_stack_trace_regs() implementation
    ARM: 8065/1: remove last use of CONFIG_CPU_ARM710
    ARM: 8062/1: Modify ldrt fixup handler to re-execute the userspace instruction
    ARM: 8047/1: rwsem: use asm-generic rwsem implementation
    ARM: l2c: trial at enabling some Cortex-A9 optimisations
    ARM: l2c: add warnings for stuff modifying aux_ctrl register values
    ARM: l2c: print a warning with L2C-310 caches if the cache size is modified
    ARM: l2c: remove old .set_debug method
    ARM: l2c: kill L2X0_AUX_CTRL_MASK before anyone else makes use of this
    ...

    Linus Torvalds
     

05 Jun, 2014

16 commits

  • Pull x86 cdso updates from Peter Anvin:
    "Vdso cleanups and improvements largely from Andy Lutomirski. This
    makes the vdso a lot less ''special''"

    * 'x86/vdso' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/vdso, build: Make LE access macros clearer, host-safe
    x86/vdso, build: Fix cross-compilation from big-endian architectures
    x86/vdso, build: When vdso2c fails, unlink the output
    x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET
    x86, mm: Replace arch_vma_name with vm_ops->name for vsyscalls
    x86, mm: Improve _install_special_mapping and fix x86 vdso naming
    mm, fs: Add vm_ops->name as an alternative to arch_vma_name
    x86, vdso: Fix an OOPS accessing the HPET mapping w/o an HPET
    x86, vdso: Remove vestiges of VDSO_PRELINK and some outdated comments
    x86, vdso: Move the vvar and hpet mappings next to the 64-bit vDSO
    x86, vdso: Move the 32-bit vdso special pages after the text
    x86, vdso: Reimplement vdso.so preparation in build-time C
    x86, vdso: Move syscall and sysenter setup into kernel/cpu/common.c
    x86, vdso: Clean up 32-bit vs 64-bit vdso params
    x86, mm: Ensure correct alignment of the fixmap

    Linus Torvalds
     
  • Russell King
     
  • Merge misc updates from Andrew Morton:

    - a few fixes for 3.16. Cc'ed to stable so they'll get there somehow.

    - various misc fixes and cleanups

    - most of the ocfs2 queue. Review is slow...

    - most of MM. The MM queue is pretty huge this time, but not much in
    the way of feature work.

    - some tweaks under kernel/

    - printk maintenance work

    - updates to lib/

    - checkpatch updates

    - tweaks to init/

    * emailed patches from Andrew Morton : (276 commits)
    fs/autofs4/dev-ioctl.c: add __init to autofs_dev_ioctl_init
    fs/ncpfs/getopt.c: replace simple_strtoul by kstrtoul
    init/main.c: remove an ifdef
    kthreads: kill CLONE_KERNEL, change kernel_thread(kernel_init) to avoid CLONE_SIGHAND
    init/main.c: add initcall_blacklist kernel parameter
    init/main.c: don't use pr_debug()
    fs/binfmt_flat.c: make old_reloc() static
    fs/binfmt_elf.c: fix bool assignements
    fs/efs: convert printk(KERN_DEBUG to pr_debug
    fs/efs: add pr_fmt / use __func__
    fs/efs: convert printk to pr_foo()
    scripts/checkpatch.pl: device_initcall is not the only __initcall substitute
    checkpatch: check stable email address
    checkpatch: warn on unnecessary void function return statements
    checkpatch: prefer kstrto to sscanf(buf, "%", &bar);
    checkpatch: add warning for kmalloc/kzalloc with multiply
    checkpatch: warn on #defines ending in semicolon
    checkpatch: make --strict a default for files in drivers/net and net/
    checkpatch: always warn on missing blank line after variable declaration block
    checkpatch: fix wildcard DT compatible string checking
    ...

    Linus Torvalds
     
  • Fix 4 checkpatch warnings
    WARNING: sizeof *tv should be sizeof(*tv)

    Signed-off-by: Fabian Frederick
    Cc: "H. Peter Anvin"
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • ... instead of naked numbers.

    Stuff in sysrq.c used to set it to 8 which is supposed to mean above
    default level so set it to DEBUG instead as we're terminating/killing all
    tasks and we want to be verbose there.

    Also, correct the check in x86_64_start_kernel which should be >= as
    we're clearly issuing the string there for all debug levels, not only
    the magical 10.

    Signed-off-by: Borislav Petkov
    Acked-by: Kees Cook
    Acked-by: Randy Dunlap
    Cc: Joe Perches
    Cc: Valdis Kletnieks
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Borislav Petkov
     
  • If the log ring buffer becomes full, we silently overwrite old messages
    with new data. console_unlock will detect this case and fast-forward the
    console_* pointers to skip over the corrupted data, but nothing will be
    reported to the user.

    This patch hijacks the first valid log message after detecting that we
    dropped messages and prefixes it with a note detailing how many messages
    were dropped. For long (~1000 char) messages, this will result in some
    truncation of the real message, but given that we're dropping things
    anyway, that doesn't seem to be the end of the world.

    Signed-off-by: Will Deacon
    Acked-by: Peter Zijlstra
    Cc: Kay Sievers
    Cc: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Will Deacon
     
  • Jiri Bohac pointed out that there are rare but potential deadlock
    possibilities when calling printk while holding the timekeeping
    seqlock.

    This is due to printk() triggering console sem wakeup, which can
    cause scheduling code to trigger hrtimers which may try to read
    the time.

    Specifically, as Jiri pointed out, that path is:
    printk
    vprintk_emit
    console_unlock
    up(&console_sem)
    __up
    wake_up_process
    try_to_wake_up
    ttwu_do_activate
    ttwu_activate
    activate_task
    enqueue_task
    enqueue_task_fair
    hrtick_update
    hrtick_start_fair
    hrtick_start_fair
    get_time
    ktime_get
    --> endless loop on
    read_seqcount_retry(&timekeeper_seq, ...)

    This patch tries to avoid this issue by using printk_deferred (previously
    named printk_sched) which should defer printing via a irq_work_queue.

    Signed-off-by: John Stultz
    Reported-by: Jiri Bohac
    Reviewed-by: Steven Rostedt
    Cc: Jan Kara
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Stultz
     
  • Two of the three prink_deferred uses are really printk_once style
    uses, so add a printk_deferred_once macro to simplify those call
    sites.

    Signed-off-by: John Stultz
    Reviewed-by: Steven Rostedt
    Reviewed-by: Jan Kara
    Cc: Peter Zijlstra
    Cc: Jiri Bohac
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Stultz
     
  • After learning we'll need some sort of deferred printk functionality in
    the timekeeping core, Peter suggested we rename the printk_sched function
    so it can be reused by needed subsystems.

    This only changes the function name. No logic changes.

    Signed-off-by: John Stultz
    Reviewed-by: Steven Rostedt
    Cc: Jan Kara
    Cc: Peter Zijlstra
    Cc: Jiri Bohac
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Stultz
     
  • An earlier change in -mm (printk: remove separate printk_sched
    buffers...), removed the printk_sched irqsave/restore lines since it was
    safe for current users. Since we may be expanding usage of
    printk_sched(), disable preepmtion for this function to make it more
    generally safe to call.

    Signed-off-by: John Stultz
    Reviewed-by: Jan Kara
    Cc: Peter Zijlstra
    Cc: Jiri Bohac
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Stultz
     
  • To prevent deadlocks with doing a printk inside the scheduler,
    printk_sched() was created. The issue is that printk has a console_sem
    that it can grab and release. The release does a wake up if there's a
    task pending on the sem, and this wake up grabs the rq locks that is
    held in the scheduler. This leads to a possible deadlock if the wake up
    uses the same rq as the one with the rq lock held already.

    What printk_sched() does is to save the printk write in a per cpu buffer
    and sets the PRINTK_PENDING_SCHED flag. On a timer tick, if this flag is
    set, the printk() is done against the buffer.

    There's a couple of issues with this approach.

    1) If two printk_sched()s are called before the tick, the second one
    will overwrite the first one.

    2) The temporary buffer is 512 bytes and is per cpu. This is a quite a
    bit of space wasted for something that is seldom used.

    In order to remove this, the printk_sched() can use the printk buffer
    instead, and delay the console_trylock()/console_unlock() to the queued
    work.

    Because printk_sched() would then be taking the logbuf_lock, the
    logbuf_lock must not be held while doing anything that may call into the
    scheduler functions, which includes wake ups. Unfortunately, printk()
    also has a console_sem that it uses, and on release, the up(&console_sem)
    may do a wake up of any pending waiters. This must be avoided while
    holding the logbuf_lock.

    Signed-off-by: Steven Rostedt
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • We need interrupts disabled when calling console_trylock_for_printk()
    only so that cpu id we pass to can_use_console() remains valid (for
    other things console_sem provides all the exclusion we need and
    deadlocks on console_sem due to interrupts are impossible because we use
    down_trylock()). However if we are rescheduled, we are guaranteed to
    run on an online cpu so we can easily just get the cpu id in
    can_use_console().

    We can lose a bit of performance when we enable interrupts in
    vprintk_emit() and then disable them again in console_unlock() but OTOH
    it can somewhat reduce interrupt latency caused by console_unlock()
    especially since later in the patch series we will want to spin on
    console_sem in console_trylock_for_printk().

    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Printk calls mutex_acquire() / mutex_release() by hand to instrument
    lockdep about console_sem. However in some corner cases the
    instrumentation is missing. Fix the problem by creating helper functions
    for locking / unlocking console_sem which take care of lockdep
    instrumentation as well.

    Signed-off-by: Jan Kara
    Reported-by: Fabio Estevam
    Reported-by: Andy Shevchenko
    Tested-by: Fabio Estevam
    Tested-By: Valdis Kletnieks
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • There's no reason to hold lockbuf_lock when entering
    console_trylock_for_printk().

    The first thing this function does is to call down_trylock(console_sem)
    and if that fails it immediately unlocks lockbuf_lock. So lockbuf_lock
    isn't needed for that branch. When down_trylock() succeeds, the rest of
    console_trylock() is OK without lockbuf_lock (it is called without it
    from other places), and the only remaining thing in
    console_trylock_for_printk() is can_use_console() call. For that call
    console_sem is enough (it iterates all consoles and checks CON_ANYTIME
    flag).

    So we drop logbuf_lock before entering console_trylock_for_printk() which
    simplifies the code.

    [akpm@linux-foundation.org: fix have_callable_console() comment]
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Comment about interesting interlocking between lockbuf_lock and
    console_sem is outdated.

    It was added in 2002 by commit a880f45a48be during conversion of
    console_lock to console_sem + lockbuf_lock.

    At that time release_console_sem() (today's equivalent is
    console_unlock()) was indeed using lockbuf_lock to avoid races between
    trylock on console_sem in printk() and unlock of console_sem. However
    these days the interlocking is gone and the races are avoided by
    rechecking logbuf state after releasing console_sem.

    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • I wonder if anyone uses printk return value but it is there and should be
    counted correctly.

    This patch modifies log_store() to return the number of really stored
    bytes from the 'text' part. Also it handles the return value in
    vprintk_emit().

    Note that log_store() is used also in cont_flush() but we could ignore the
    return value there. The function works with characters that were already
    counted earlier. In addition, the store could newer fail here because the
    length of the printed text is limited by the "cont" buffer and "dict" is
    NULL.

    Signed-off-by: Petr Mladek
    Cc: Jan Kara
    Cc: Jiri Kosina
    Cc: Kay Sievers
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Petr Mladek