25 Jan, 2013

1 commit

  • The MSI specification has several constraints in comparison with
    MSI-X, most notable of them is the inability to configure MSIs
    independently. As a result, it is impossible to dispatch
    interrupts from different queues to different CPUs. This is
    largely devalues the support of multiple MSIs in SMP systems.

    Also, a necessity to allocate a contiguous block of vector
    numbers for devices capable of multiple MSIs might cause a
    considerable pressure on x86 interrupt vector allocator and
    could lead to fragmentation of the interrupt vectors space.

    This patch overcomes both drawbacks in presense of IRQ remapping
    and lets devices take advantage of multiple queues and per-IRQ
    affinity assignments.

    Signed-off-by: Alexander Gordeev
    Cc: Bjorn Helgaas
    Cc: Suresh Siddha
    Cc: Yinghai Lu
    Cc: Matthew Wilcox
    Cc: Jeff Garzik
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/c8bd86ff56b5fc118257436768aaa04489ac0a4c.1353324359.git.agordeev@redhat.com
    Signed-off-by: Ingo Molnar

    Alexander Gordeev
     

23 Jan, 2013

5 commits

  • Commit 083b804c4d3e ("async: use workqueue for worker pool") made it
    possible that async jobs are moved from pending to running out-of-order.
    While pending async jobs will be queued and dispatched for execution in
    the same order, nothing guarantees they'll enter "1) move self to the
    running queue" of async_run_entry_fn() in the same order.

    Before the conversion, async implemented its own worker pool. An async
    worker, upon being woken up, fetches the first item from the pending
    list, which kept the executing lists sorted. The conversion to
    workqueue was done by adding work_struct to each async_entry and async
    just schedules the work item. The queueing and dispatching of such work
    items are still in order but now each worker thread is associated with a
    specific async_entry and moves that specific async_entry to the
    executing list. So, depending on which worker reaches that point
    earlier, which is non-deterministic, we may end up moving an async_entry
    with larger cookie before one with smaller one.

    This broke __lowest_in_progress(). running->domain may not be properly
    sorted and is not guaranteed to contain lower cookies than pending list
    when not empty. Fix it by ensuring sort-inserting to the running list
    and always looking at both pending and running when trying to determine
    the lowest cookie.

    Over time, the async synchronization implementation became quite messy.
    We better restructure it such that each async_entry is linked to two
    lists - one global and one per domain - and not move it when execution
    starts. There's no reason to distinguish pending and running. They
    behave the same for synchronization purposes.

    Signed-off-by: Tejun Heo
    Cc: Arjan van de Ven
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • Pull ftrace fix from Steven Rostedt:
    "Kprobes now uses the function tracer if it can. That is, if a probe
    is placed on a function mcount/nop location, and the arch supports it,
    instead of adding a breakpoint, kprobes will register a function
    callback as that is much more efficient.

    The function tracer requires to update modules before they run, and
    uses the module notifier to do so. But if something else in the
    module notifiers registers a kprobe at one of these locations, before
    ftrace can get to it, then the system could fail.

    The function tracer must be initialized early, otherwise module
    notifiers that probe will only work by chance."

    * tag 'trace-3.8-rc4-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    ftrace: Be first to run code modification on modules

    Linus Torvalds
     
  • wake_up_process() should never wakeup a TASK_STOPPED/TRACED task.
    Change it to use TASK_NORMAL and add the WARN_ON().

    TASK_ALL has no other users, probably can be killed.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • putreg() assumes that the tracee is not running and pt_regs_access() can
    safely play with its stack. However a killed tracee can return from
    ptrace_stop() to the low-level asm code and do RESTORE_REST, this means
    that debugger can actually read/modify the kernel stack until the tracee
    does SAVE_REST again.

    set_task_blockstep() can race with SIGKILL too and in some sense this
    race is even worse, the very fact the tracee can be woken up breaks the
    logic.

    As Linus suggested we can clear TASK_WAKEKILL around the arch_ptrace()
    call, this ensures that nobody can ever wakeup the tracee while the
    debugger looks at it. Not only this fixes the mentioned problems, we
    can do some cleanups/simplifications in arch_ptrace() paths.

    Probably ptrace_unfreeze_traced() needs more callers, for example it
    makes sense to make the tracee killable for oom-killer before
    access_process_vm().

    While at it, add the comment into may_ptrace_stop() to explain why
    ptrace_stop() still can't rely on SIGKILL and signal_pending_state().

    Reported-by: Salman Qazi
    Reported-by: Suleiman Souhlal
    Suggested-by: Linus Torvalds
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Cleanup and preparation for the next change.

    signal_wake_up(resume => true) is overused. None of ptrace/jctl callers
    actually want to wakeup a TASK_WAKEKILL task, but they can't specify the
    necessary mask.

    Turn signal_wake_up() into signal_wake_up_state(state), reintroduce
    signal_wake_up() as a trivial helper, and add ptrace_signal_wake_up()
    which adds __TASK_TRACED.

    This way ptrace_signal_wake_up() can work "inside" ptrace_request()
    even if the tracee doesn't have the TASK_WAKEKILL bit set.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

22 Jan, 2013

1 commit

  • If some other kernel subsystem has a module notifier, and adds a kprobe
    to a ftrace mcount point (now that kprobes work on ftrace points),
    when the ftrace notifier runs it will fail and disable ftrace, as well
    as kprobes that are attached to ftrace points.

    Here's the error:

    WARNING: at kernel/trace/ftrace.c:1618 ftrace_bug+0x239/0x280()
    Hardware name: Bochs
    Modules linked in: fat(+) stap_56d28a51b3fe546293ca0700b10bcb29__8059(F) nfsv4 auth_rpcgss nfs dns_resolver fscache xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack lockd sunrpc ppdev parport_pc parport microcode virtio_net i2c_piix4 drm_kms_helper ttm drm i2c_core [last unloaded: bid_shared]
    Pid: 8068, comm: modprobe Tainted: GF 3.7.0-0.rc8.git0.1.fc19.x86_64 #1
    Call Trace:
    [] warn_slowpath_common+0x7f/0xc0
    [] ? __probe_kernel_read+0x46/0x70
    [] ? 0xffffffffa017ffff
    [] ? 0xffffffffa017ffff
    [] warn_slowpath_null+0x1a/0x20
    [] ftrace_bug+0x239/0x280
    [] ftrace_process_locs+0x376/0x520
    [] ftrace_module_notify+0x47/0x50
    [] notifier_call_chain+0x4d/0x70
    [] __blocking_notifier_call_chain+0x58/0x80
    [] blocking_notifier_call_chain+0x16/0x20
    [] sys_init_module+0x73/0x220
    [] system_call_fastpath+0x16/0x1b
    ---[ end trace 9ef46351e53bbf80 ]---
    ftrace failed to modify [] init_once+0x0/0x20 [fat]
    actual: cc:bb:d2:4b:e1

    A kprobe was added to the init_once() function in the fat module on load.
    But this happened before ftrace could have touched the code. As ftrace
    didn't run yet, the kprobe system had no idea it was a ftrace point and
    simply added a breakpoint to the code (0xcc in the cc:bb:d2:4b:e1).

    Then when ftrace went to modify the location from a call to mcount/fentry
    into a nop, it didn't see a call op, but instead it saw the breakpoint op
    and not knowing what to do with it, ftrace shut itself down.

    The solution is to simply give the ftrace module notifier the max priority.
    This should have been done regardless, as the core code ftrace modification
    also happens very early on in boot up. This makes the module modification
    closer to core modification.

    Link: http://lkml.kernel.org/r/20130107140333.593683061@goodmis.org

    Cc: stable@vger.kernel.org
    Acked-by: Masami Hiramatsu
    Reported-by: Frank Ch. Eigler
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

21 Jan, 2013

4 commits

  • Commit 1fb9341ac348 ("module: put modules in list much earlier") moved
    some of the module initialization code around, and in the process
    changed the exit paths too. But for the duplicate export symbol error
    case the change made the ddebug_cleanup path jump to after the module
    mutex unlock, even though it happens with the mutex held.

    Rusty has some patches to split this function up into some helper
    functions, hopefully the mess of complex goto targets will go away
    eventually.

    Reported-by: Dan Carpenter
    Cc: Rusty Russell
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Pull module fixes and a virtio block fix from Rusty Russell:
    "Various minor fixes, but a slightly more complex one to fix the
    per-cpu overload problem introduced recently by kvm id changes."

    * tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
    module: put modules in list much earlier.
    module: add new state MODULE_STATE_UNFORMED.
    module: prevent warning when finit_module a 0 sized file
    virtio-blk: Don't free ida when disk is in use

    Linus Torvalds
     
  • Pull misc syscall fixes from Al Viro:

    - compat syscall fixes (discussed back in December)

    - a couple of "make life easier for sigaltstack stuff by reducing
    inter-tree dependencies"

    - fix up compiler/asmlinkage calling convention disagreement of
    sys_clone()

    - misc

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    sys_clone() needs asmlinkage_protect
    make sure that /linuxrc has std{in,out,err}
    x32: fix sigtimedwait
    x32: fix waitid()
    switch compat_sys_wait4() and compat_sys_waitid() to COMPAT_SYSCALL_DEFINE
    switch compat_sys_sigaltstack() to COMPAT_SYSCALL_DEFINE
    CONFIG_GENERIC_SIGALTSTACK build breakage with asm-generic/syscalls.h
    Ensure that kernel_init_freeable() is not inlined into non __init code

    Linus Torvalds
     
  • The ia64 function "thread_matches()" has no users since commit
    e868a55c2a8c ("[IA64] remove find_thread_for_addr()"). Remove it.

    This allows us to make ptrace_check_attach() static to kernel/ptrace.c,
    which is good since we'll need to change the semantics of it and fix up
    all the callers.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

20 Jan, 2013

1 commit


17 Jan, 2013

1 commit

  • If the default iosched is built as module, the kernel may deadlock
    while trying to load the iosched module on device probe if the probing
    was running off async. This is because async_synchronize_full() at
    the end of module init ends up waiting for the async job which
    initiated the module loading.

    async A modprobe

    1. finds a device
    2. registers the block device
    3. request_module(default iosched)
    4. modprobe in userland
    5. load and init module
    6. async_synchronize_full()

    Async A waits for modprobe to finish in request_module() and modprobe
    waits for async A to finish in async_synchronize_full().

    Because there's no easy to track dependency once control goes out to
    userland, implementing properly nested flushing is difficult. For
    now, make module init perform async_synchronize_full() iff module init
    has queued async jobs as suggested by Linus.

    This avoids the described deadlock because iosched module doesn't use
    async and thus wouldn't invoke async_synchronize_full(). This is
    hacky and incomplete. It will deadlock if async module loading nests;
    however, this works around the known problem case and seems to be the
    best of bad options.

    For more details, please refer to the following thread.

    http://thread.gmane.org/gmane.linux.kernel/1420814

    Signed-off-by: Tejun Heo
    Reported-by: Alex Riesen
    Tested-by: Ming Lei
    Tested-by: Alex Riesen
    Cc: Arjan van de Ven
    Cc: Jens Axboe
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

15 Jan, 2013

2 commits

  • …ernel/git/rostedt/linux-trace

    Pull tracing regression fixes from Steven Rostedt:
    "The clean up patch commit 0fb9656d957d "tracing: Make tracing_enabled
    be equal to tracing_on" caused two regressions.

    1) The irqs off latency tracer no longer starts if tracing_on is off
    when the tracer is set, and then tracing_on is enabled. The
    tracing_on file needs the hook that tracing_enabled had to enable
    tracers if they request it (call the tracer's start() method).

    2) That commit had a separate change that really should have been a
    separate patch, but it must have been added accidently with the -a
    option of git commit. But as the change is still related to the
    commit it wasn't noticed in review. That change, changed the way
    blocking is done by the trace_pipe file with respect to the
    tracing_on settings. I've been told that this change breaks
    current userspace, and this specific change is being reverted."

    * tag 'trace-3.8-rc3-regression-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Fix regression of trace_pipe
    tracing: Fix regression with irqsoff tracer and tracing_on file

    Linus Torvalds
     
  • Commit 0fb9656d "tracing: Make tracing_enabled be equal to tracing_on"
    changes the behaviour of trace_pipe, ie. it makes trace_pipe return if
    we've read something and tracing is enabled, and this means that we have
    to 'cat trace_pipe' again and again while running tests.

    IMO the right way is if tracing is enabled, we always block and wait for
    ring buffer, or we may lose what we want since ring buffer's size is limited.

    Link: http://lkml.kernel.org/r/1358132051-5410-1-git-send-email-bo.li.liu@oracle.com

    Signed-off-by: Liu Bo
    Signed-off-by: Steven Rostedt

    Liu Bo
     

12 Jan, 2013

7 commits

  • Prarit's excellent bug report:
    > In recent Fedora releases (F17 & F18) some users have reported seeing
    > messages similar to
    >
    > [ 15.478160] kvm: Could not allocate 304 bytes percpu data
    > [ 15.478174] PERCPU: allocation failed, size=304 align=32, alloc from
    > reserved chunk failed
    >
    > during system boot. In some cases, users have also reported seeing this
    > message along with a failed load of other modules.
    >
    > What is happening is systemd is loading an instance of the kvm module for
    > each cpu found (see commit e9bda3b). When the module load occurs the kernel
    > currently allocates the modules percpu data area prior to checking to see
    > if the module is already loaded or is in the process of being loaded. If
    > the module is already loaded, or finishes load, the module loading code
    > releases the current instance's module's percpu data.

    Now we have a new state MODULE_STATE_UNFORMED, we can insert the
    module into the list (and thus guarantee its uniqueness) before we
    allocate the per-cpu region.

    Reported-by: Prarit Bhargava
    Signed-off-by: Rusty Russell
    Tested-by: Prarit Bhargava

    Rusty Russell
     
  • You should never look at such a module, so it's excised from all paths
    which traverse the modules list.

    We add the state at the end, to avoid gratuitous ABI break (ksplice).

    Signed-off-by: Rusty Russell

    Rusty Russell
     
  • audit_log_start() performs the same jiffies comparison in two places.
    If sufficient time has elapsed between the two comparisons, the second
    one produces a negative sleep duration:

    schedule_timeout: wrong timeout value fffffffffffffff0
    Pid: 6606, comm: trinity-child1 Not tainted 3.8.0-rc1+ #43
    Call Trace:
    schedule_timeout+0x305/0x340
    audit_log_start+0x311/0x470
    audit_log_exit+0x4b/0xfb0
    __audit_syscall_exit+0x25f/0x2c0
    sysret_audit+0x17/0x21

    Fix it by performing the comparison a single time.

    Reported-by: Dave Jones
    Cc: Al Viro
    Cc: Eric Paris
    Reviewed-by: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • It's possible for audit_log_start() to return NULL. Handle it in the
    various callers.

    Signed-off-by: Kees Cook
    Cc: Al Viro
    Cc: Eric Paris
    Cc: Jeff Layton
    Cc: "Eric W. Biederman"
    Cc: Julien Tinnes
    Cc: Will Drewry
    Cc: Steve Grubb
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • The seccomp path was using AUDIT_ANOM_ABEND from when seccomp mode 1
    could only kill a process. While we still want to make sure an audit
    record is forced on a kill, this should use a separate record type since
    seccomp mode 2 introduces other behaviors.

    In the case of "handled" behaviors (process wasn't killed), only emit a
    record if the process is under inspection. This change also fixes
    userspace examination of seccomp audit events, since it was considered
    malformed due to missing fields of the AUDIT_ANOM_ABEND event type.

    Signed-off-by: Kees Cook
    Cc: Al Viro
    Cc: Eric Paris
    Cc: Jeff Layton
    Cc: "Eric W. Biederman"
    Cc: Julien Tinnes
    Acked-by: Will Drewry
    Acked-by: Steve Grubb
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • down_write_nest_lock() provides a means to annotate locking scenario
    where an outer lock is guaranteed to serialize the order nested locks
    are being acquired.

    This is analogoue to already existing mutex_lock_nest_lock() and
    spin_lock_nest_lock().

    Signed-off-by: Jiri Kosina
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Mel Gorman
    Tested-by: Sedat Dilek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Kosina
     
  • Commit 02404baf1b47 "tracing: Remove deprecated tracing_enabled file"
    removed the tracing_enabled file as it never worked properly and
    the tracing_on file should be used instead. But the tracing_on file
    didn't call into the tracers start/stop routines like the
    tracing_enabled file did. This caused trace-cmd to break when it
    enabled the irqsoff tracer.

    If you just did "echo irqsoff > current_tracer" then it would work
    properly. But the tool trace-cmd disables tracing first by writing
    "0" into the tracing_on file. Then it writes "irqsoff" into
    current_tracer and then writes "1" into tracing_on. Unfortunately,
    the above commit changed the irqsoff tracer to check the tracing_on
    status instead of the tracing_enabled status. If it's disabled then
    it does not start the tracer internals.

    The problem is that writing "1" into tracing_on does not call the
    tracers "start" routine like writing "1" into tracing_enabled did.
    This makes the irqsoff tracer not start when using the trace-cmd
    tool, and is a regression for userspace.

    Simple fix is to have the tracing_on file call the tracers start()
    method when being enabled (and the stop() method when disabled).

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

11 Jan, 2013

2 commits

  • Fix new kernel-doc warning in auditfilter.c:

    Warning(kernel/auditfilter.c:1157): Excess function parameter 'uid' description in 'audit_receive_filter'

    Signed-off-by: Randy Dunlap
    Cc: Al Viro
    Cc: Eric Paris
    Cc: linux-audit@redhat.com (subscribers-only)
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • …ernel/git/rostedt/linux-trace

    Pull tracing regression fix from Steven Rostedt:
    "A change that came in this merge window broke the writing to the
    trace_options file. It causes garbage to be read during the compare
    of option names, and breaks setting options via the trace_options
    file, although options can still be set via the options/<option>
    files."

    * tag 'trace-3.8-rc2-regression-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Fix regression of trace_options file setting

    Linus Torvalds
     

10 Jan, 2013

1 commit

  • The latest change to allow trace options to be set on the command
    line also broke the trace_options file.

    The zeroing of the last byte of the option name that is echoed into
    the trace_option file was removed with the consolidation of some
    of the code. The compare between the option and what was written to
    the trace_options file fails because the string holding the data
    written doesn't terminate with a null character.

    A zero needs to be added to the end of the string copied from
    user space.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

07 Jan, 2013

1 commit

  • Merge emailed fixes from Andrew Morton:
    "Bunch of fixes:

    - delayed IPC updates. I held back on this because of some possible
    outstanding bug reports, but they appear to have been addressed in
    later versions

    - A bunch of MAINTAINERS updates

    - Yet Another RTC driver. I'd held this back while a couple of
    little issues were being worked out.

    I'm expecting an intrusive-but-simple patchset from Joe Perches which
    splits up printk.c into kernel/printk/*. That will be a pig to
    maintain for two months so if it passes testing I'd like to get it
    upstream after a week or so."

    * emailed patches from Andrew Morton : (35 commits)
    printk: fix incorrect length from print_time() when seconds > 99999
    drivers/rtc/rtc-vt8500.c: fix handling of data passed in struct rtc_time
    drivers/rtc/rtc-vt8500.c: correct handling of CR_24H bitfield
    rtc: add RTC driver for TPS6586x
    MAINTAINERS: fix drivers/staging/sm7xx/
    MAINTAINERS: remove include/linux/of_pwm.h
    MAINTAINERS: remove arch/*/lib/perf_event*.c
    MAINTAINERS: remove drivers/mmc/host/imxmmc.*
    MAINTAINERS: fix Documentation/mei/
    MAINTAINERS: remove arch/x86/platform/mrst/pmu.*
    MAINTAINERS: remove firmware/isci/
    MAINTAINERS: fix drivers/ieee802154/
    MAINTAINERS: fix .../plat-mxc/include/mach/imxfb.h
    MAINTAINERS: remove drivers/video/epson1355fb.c
    MAINTAINERS: fix drivers/media/usb/dvb-usb/cxusb*
    MAINTAINERS: adjust for UAPI
    MAINTAINERS: fix drivers/media/platform/atmel-isi.c
    MAINTAINERS: fix arch/arm/mach-at91/include/mach/at_hdmac.h
    MAINTAINERS: fix drivers/rtc/rtc-vt8500.c
    MAINTAINERS: remove arch/arm/plat-s5p/
    ...

    Linus Torvalds
     

06 Jan, 2013

2 commits

  • Cleanup. And I think we need more cleanups, in particular
    __set_current_blocked() and sigprocmask() should die. Nobody should
    ever block SIGKILL or SIGSTOP.

    - Change set_current_blocked() to use __set_current_blocked()

    - Change sys_sigprocmask() to use set_current_blocked(), this way it
    should not worry about SIGKILL/SIGSTOP.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Commit 77097ae503b1 ("most of set_current_blocked() callers want
    SIGKILL/SIGSTOP removed from set") removed the initialization of newmask
    by accident, causing ltp to complain like this:

    ssetmask01 1 TFAIL : sgetmask() failed: TEST_ERRNO=???(0): Success

    Restore the proper initialization.

    Reported-and-tested-by: CAI Qian
    Signed-off-by: Oleg Nesterov
    Cc: stable@kernel.org # v3.5+
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

05 Jan, 2013

1 commit

  • print_prefix() passes a NULL buf to print_time() to get the length of
    the time prefix; when printk times are enabled, the current code just
    returns the constant 15, which matches the format "[%5lu.%06lu] " used
    to print the time value. However, this is obviously incorrect when the
    whole seconds part of the time gets beyond 5 digits (100000 seconds is a
    bit more than a day of uptime).

    The simple fix is to use snprintf(NULL, 0, ...) to calculate the actual
    length of the time prefix. This could be micro-optimized but it seems
    better to have simpler, more readable code here.

    The bug leads to the syslog system call miscomputing which messages fit
    into the userspace buffer. If there are enough messages to fill
    log_buf_len and some have a timestamp >= 100000, dmesg may fail with:

    # dmesg
    klogctl: Bad address

    When this happens, strace shows that the failure is indeed EFAULT due to
    the kernel mistakenly accessing past the end of dmesg's buffer, since
    dmesg asks the kernel how big a buffer it needs, allocates a bit more,
    and then gets an error when it asks the kernel to fill it:

    syslog(0xa, 0, 0) = 1048576
    mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa4d25d2000
    syslog(0x3, 0x7fa4d25d2010, 0x100008) = -1 EFAULT (Bad address)

    As far as I can see, the bug has been there as long as print_time(),
    which comes from commit 084681d14e42 ("printk: flush continuation lines
    immediately to console") in 3.5-rc5.

    Signed-off-by: Roland Dreier
    Signed-off-by: Greg Kroah-Hartman
    Cc: Joe Perches
    Cc: Sylvain Munaut
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland Dreier
     

03 Jan, 2013

1 commit


26 Dec, 2012

5 commits

  • It needs 64bit timespec. As it is, we end up truncating the timeout
    to whole seconds; usually it doesn't matter, but for having all
    sub-second timeouts truncated to one jiffy is visibly wrong.

    Signed-off-by: Al Viro

    Al Viro
     
  • It needs 64bit rusage and 32bit siginfo. glibc never calls it with
    non-NULL rusage pointer, or we would've seen breakage already...

    Signed-off-by: Al Viro

    Al Viro
     
  • Strictly speaking, ppc64 needs it for C ABI compliance. Realistically
    I would be very surprised if e.g. passing 0xffffffff as 'options'
    argument to waitid() from 32bit task would cause problems, but yes,
    it puts us into undefined behaviour territory. ppc64 expects int
    argument to be passed in 64bit register with bits 31..63 containing
    the same value. SYSCALL_DEFINE on ppc provides a wrapper that normalizes
    the value passed from userland; so does COMPAT_SYSCALL_DEFINE. Plain
    declaration of compat_sys_something() with an int argument obviously
    doesn't. Again, for wait4 and waitid I would be extremely surprised
    if gcc started to produce code depending on that value having been
    properly sign-extended - the argument(s) in question end up passed
    blindly to sys_wait4 and sys_waitid resp. and normalization for native
    syscalls takes care of their use there. Still, better to use
    COMPAT_SYSCALL_DEFINE here than worry about nasal daemons...

    Signed-off-by: Al Viro

    Al Viro
     
  • Makes sigaltstack conversion easier to split into per-architecture
    parts.

    Signed-off-by: Al Viro

    Al Viro
     
  • Oleg pointed out that in a pid namespace the sequence.
    - pid 1 becomes a zombie
    - setns(thepidns), fork,...
    - reaping pid 1.
    - The injected processes exiting.

    Can lead to processes attempting access their child reaper and
    instead following a stale pointer.

    That waitpid for init can return before all of the processes in
    the pid namespace have exited is also unfortunate.

    Avoid these problems by disabling the allocation of new pids in a pid
    namespace when init dies, instead of when the last process in a pid
    namespace is reaped.

    Pointed-out-by: Oleg Nesterov
    Reviewed-by: Oleg Nesterov
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

25 Dec, 2012

1 commit

  • The sequence:
    unshare(CLONE_NEWPID)
    clone(CLONE_THREAD|CLONE_SIGHAND|CLONE_VM)

    Creates a new process in the new pid namespace without setting
    pid_ns->child_reaper. After forking this results in a NULL
    pointer dereference.

    Avoid this and other nonsense scenarios that can show up after
    creating a new pid namespace with unshare by adding a new
    check in copy_prodcess.

    Pointed-out-by: Oleg Nesterov
    Acked-by: Oleg Nesterov
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

21 Dec, 2012

4 commits

  • Pull filesystem notification updates from Eric Paris:
    "This pull mostly is about locking changes in the fsnotify system. By
    switching the group lock from a spin_lock() to a mutex() we can now
    hold the lock across things like iput(). This fixes a problem
    involving unmounting a fs and having inodes be busy, first pointed out
    by FAT, but reproducible with tmpfs.

    This also restores signal driven I/O for inotify, which has been
    broken since about 2.6.32."

    Ugh. I *hate* the timing of this. It was rebased after the merge
    window opened, and then left to sit with the pull request coming the day
    before the merge window closes. That's just crap. But apparently the
    patches themselves have been around for over a year, just gathering
    dust, so now it's suddenly critical.

    Fixed up semantic conflict in fs/notify/fdinfo.c as per Stephen
    Rothwell's fixes from -next.

    * 'for-next' of git://git.infradead.org/users/eparis/notify:
    inotify: automatically restart syscalls
    inotify: dont skip removal of watch descriptor if creation of ignored event failed
    fanotify: dont merge permission events
    fsnotify: make fasync generic for both inotify and fanotify
    fsnotify: change locking order
    fsnotify: dont put marks on temporary list when clearing marks by group
    fsnotify: introduce locked versions of fsnotify_add_mark() and fsnotify_remove_mark()
    fsnotify: pass group to fsnotify_destroy_mark()
    fsnotify: use a mutex instead of a spinlock to protect a groups mark list
    fanotify: add an extra flag to mark_remove_from_mask that indicates wheather a mark should be destroyed
    fsnotify: take groups mark_lock before mark lock
    fsnotify: use reference counting for groups
    fsnotify: introduce fsnotify_get_group()
    inotify, fanotify: replace fsnotify_put_group() with fsnotify_destroy_group()

    Linus Torvalds
     
  • Merge the rest of Andrew's patches for -rc1:
    "A bunch of fixes and misc missed-out-on things.

    That'll do for -rc1. I still have a batch of IPC patches which still
    have a possible bug report which I'm chasing down."

    * emailed patches from Andrew Morton : (25 commits)
    keys: use keyring_alloc() to create module signing keyring
    keys: fix unreachable code
    sendfile: allows bypassing of notifier events
    SGI-XP: handle non-fatal traps
    fat: fix incorrect function comment
    Documentation: ABI: remove testing/sysfs-devices-node
    proc: fix inconsistent lock state
    linux/kernel.h: fix DIV_ROUND_CLOSEST with unsigned divisors
    memcg: don't register hotcpu notifier from ->css_alloc()
    checkpatch: warn on uapi #includes that #include
    mm: cma: WARN if freed memory is still in use
    exec: do not leave bprm->interp on stack
    ...

    Linus Torvalds
     
  • Pull signal handling cleanups from Al Viro:
    "sigaltstack infrastructure + conversion for x86, alpha and um,
    COMPAT_SYSCALL_DEFINE infrastructure.

    Note that there are several conflicts between "unify
    SS_ONSTACK/SS_DISABLE definitions" and UAPI patches in mainline;
    resolution is trivial - just remove definitions of SS_ONSTACK and
    SS_DISABLED from arch/*/uapi/asm/signal.h; they are all identical and
    include/uapi/linux/signal.h contains the unified variant."

    Fixed up conflicts as per Al.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    alpha: switch to generic sigaltstack
    new helpers: __save_altstack/__compat_save_altstack, switch x86 and um to those
    generic compat_sys_sigaltstack()
    introduce generic sys_sigaltstack(), switch x86 and um to it
    new helper: compat_user_stack_pointer()
    new helper: restore_altstack()
    unify SS_ONSTACK/SS_DISABLE definitions
    new helper: current_user_stack_pointer()
    missing user_stack_pointer() instances
    Bury the conditionals from kernel_thread/kernel_execve series
    COMPAT_SYSCALL_DEFINE: infrastructure

    Linus Torvalds
     
  • Use keyring_alloc() to create special keyrings now that it has
    a permissions parameter rather than using key_alloc() +
    key_instantiate_and_link().

    Signed-off-by: David Howells
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells