14 Jul, 2013

1 commit

  • commit 8d8022e8aba85192e937f1f0f7450e256d66ae5c upstream.

    v3.8-rc1-5-g1fb9341 was supposed to stop parallel kvm loads exhausting
    percpu memory on large machines:

    Now we have a new state MODULE_STATE_UNFORMED, we can insert the
    module into the list (and thus guarantee its uniqueness) before we
    allocate the per-cpu region.

    In my defence, it didn't actually say the patch did this. Just that
    we "can".

    This patch actually *does* it.

    Signed-off-by: Rusty Russell
    Tested-by: Jim Hull
    Signed-off-by: Greg Kroah-Hartman

    Rusty Russell
     

30 Jun, 2013

2 commits

  • This __put_user() could be used by unprivileged processes to write into
    kernel memory. The issue here is that even if copy_siginfo_to_user()
    fails, the error code is not checked before __put_user() is executed.

    Luckily, ptrace_peek_siginfo() has been added within the 3.10-rc cycle,
    so it has not hit a stable release yet.

    Signed-off-by: Mathieu Desnoyers
    Acked-by: Oleg Nesterov
    Cc: Andrey Vagin
    Cc: Roland McGrath
    Cc: Paul McKenney
    Cc: David Howells
    Cc: Dave Jones
    Cc: Pavel Emelyanov
    Cc: Pedro Alves
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathieu Desnoyers
     
  • Pull timer fix from Thomas Gleixner:
    "Correct an ordering issue in the tick broadcast code. I really wish
    we'd get compensation for pain and suffering for each line of code we
    write to work around dysfunctional timer hardware."

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    tick: Fix tick_broadcast_pending_mask not cleared

    Linus Torvalds
     

27 Jun, 2013

1 commit


22 Jun, 2013

1 commit

  • Pull x86 fixes from Peter Anvin:
    "This series fixes a couple of build failures, and fixes MTRR cleanup
    and memory setup on very specific memory maps.

    Finally, it fixes triggering backtraces on all CPUs, which was
    inadvertently disabled on x86."

    * 'x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/efi: Fix dummy variable buffer allocation
    x86: Fix trigger_all_cpu_backtrace() implementation
    x86: Fix section mismatch on load_ucode_ap
    x86: fix build error and kconfig for ia32_emulation and binfmt
    range: Do not add new blank slot with add_range_with_merge
    x86, mtrr: Fix original mtrr range get for mtrr_cleanup

    Linus Torvalds
     

21 Jun, 2013

5 commits

  • The recent modification in the cpuidle framework consolidated the
    timer broadcast code across the different drivers by setting a new
    flag in the idle state. It tells the cpuidle core code to enter/exit
    the broadcast mode for the cpu when entering a deep idle state. The
    broadcast timer enter/exit is no longer handled by the back-end
    driver.

    This change made the local interrupt to be enabled *before* calling
    CLOCK_EVENT_NOTIFY_EXIT.

    On a tegra114, a four cores system, when the flag has been introduced
    in the driver, the following warning appeared:

    WARNING: at kernel/time/tick-broadcast.c:578 tick_broadcast_oneshot_control
    CPU: 2 PID: 0 Comm: swapper/2 Not tainted 3.10.0-rc3-next-20130529+ #15
    [] (tick_broadcast_oneshot_control+0x1a4/0x1d0) from [] (tick_notify+0x240/0x40c)
    [] (tick_notify+0x240/0x40c) from [] (notifier_call_chain+0x44/0x84)
    [] (notifier_call_chain+0x44/0x84) from [] (raw_notifier_call_chain+0x18/0x20)
    [] (raw_notifier_call_chain+0x18/0x20) from [] (clockevents_notify+0x28/0x170)
    [] (clockevents_notify+0x28/0x170) from [] (cpuidle_idle_call+0x11c/0x168)
    [] (cpuidle_idle_call+0x11c/0x168) from [] (arch_cpu_idle+0x8/0x38)
    [] (arch_cpu_idle+0x8/0x38) from [] (cpu_startup_entry+0x60/0x134)
    [] (cpu_startup_entry+0x60/0x134) from [] (0x804fe9a4)

    I don't have the hardware, so I wasn't able to reproduce the warning
    but after looking a while at the code, I deduced the following:

    1. the CPU2 enters a deep idle state and sets the broadcast timer

    2. the timer expires, the tick_handle_oneshot_broadcast function is
    called, setting the tick_broadcast_pending_mask and waking up the
    idle cpu CPU2

    3. the CPU2 exits idle handles the interrupt and then invokes
    tick_broadcast_oneshot_control with CLOCK_EVENT_NOTIFY_EXIT which
    runs the following code:

    [...]
    if (dev->next_event.tv64 == KTIME_MAX)
    goto out;

    if (cpumask_test_and_clear_cpu(cpu,
    tick_broadcast_pending_mask))
    goto out;
    [...]

    So if there is no next event scheduled for CPU2, we fulfil the
    first condition and jump out without clearing the
    tick_broadcast_pending_mask.

    4. CPU2 goes to deep idle again and calls
    tick_broadcast_oneshot_control with CLOCK_NOTIFY_EVENT_ENTER but
    with the tick_broadcast_pending_mask set for CPU2, triggering the
    warning.

    The issue only surfaced due to the modifications of the cpuidle
    framework, which resulted in interrupts being enabled before the call
    to the clockevents code. If the call happens before interrupts have
    been enabled, the warning cannot trigger, because there is still the
    event pending which caused the broadcast timer expiry.

    Move the check for the next event below the check for the pending bit,
    so the pending bit gets cleared whether an event is scheduled on the
    cpu or not.

    [ tglx: Massaged changelog ]

    Signed-off-by: Daniel Lezcano
    Reported-and-tested-by: Joseph Lo
    Cc: Stephen Warren
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: linaro-kernel@lists.linaro.org
    Link: http://lkml.kernel.org/r/1371485735-31249-1-git-send-email-daniel.lezcano@linaro.org
    Signed-off-by: Thomas Gleixner

    Daniel Lezcano
     
  • Pull scheduler fixes from Ingo Molnar:
    "Two smaller fixes - plus a context tracking tracing fix that is a bit
    bigger"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    tracing/context-tracking: Add preempt_schedule_context() for tracing
    sched: Fix clear NOHZ_BALANCE_KICK
    sched/x86: Construct all sibling maps if smt

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "Four fixes. The mmap ones are unfortunately larger than desired -
    fuzzing uncovered bugs that needed perf context life time management
    changes to fix properly"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/x86: Fix broken PEBS-LL support on SNB-EP/IVB-EP
    perf: Fix mmap() accounting hole
    perf: Fix perf mmap bugs
    kprobes: Fix to free gone and unused optprobes

    Linus Torvalds
     
  • Pull cpu idle fixes from Thomas Gleixner:
    - Add a missing irq enable. Fallout of the idle conversion
    - Fix stackprotector wreckage caused by the idle conversion

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    idle: Enable interrupts in the weak arch_cpu_idle() implementation
    idle: Add the stack canary init to cpu_startup_entry()

    Linus Torvalds
     
  • Pull timer fixes from Thomas Gleixner:
    - Fix inconstinant clock usage in virtual time accounting
    - Fix a build error in KVM caused by the NOHZ work
    - Remove a pointless timekeeping duty assignment which breaks NOHZ
    - Use a proper notifier return value to avoid random behaviour

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    tick: Remove useless timekeeping duty attribution to broadcast source
    nohz: Fix notifier return val that enforce timekeeping
    kvm: Move guest entry/exit APIs to context_tracking
    vtime: Use consistent clocks among nohz accounting

    Linus Torvalds
     

20 Jun, 2013

2 commits

  • fetch_bp_busy_slots() and toggle_bp_slot() use
    for_each_online_cpu(), this is obviously wrong wrt cpu_up() or
    cpu_down(), we can over/under account the per-cpu numbers.

    For example:

    # echo 0 >> /sys/devices/system/cpu/cpu1/online
    # perf record -e mem:0x10 -p 1 &
    # echo 1 >> /sys/devices/system/cpu/cpu1/online
    # perf record -e mem:0x10,mem:0x10,mem:0x10,mem:0x10 -C1 -a &
    # taskset -p 0x2 1

    triggers the same WARN_ONCE("Can't find any breakpoint slot") in
    arch_install_hw_breakpoint().

    Reported-by: Vince Weaver
    Signed-off-by: Oleg Nesterov
    Acked-by: Frederic Weisbecker
    Cc:
    Link: http://lkml.kernel.org/r/20130620155009.GA6327@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • trinity fuzzer triggered WARN_ONCE("Can't find any breakpoint
    slot") in arch_install_hw_breakpoint() but the problem is not
    arch-specific.

    The problem is, task_bp_pinned(cpu) checks "cpu == iter->cpu"
    but this doesn't account the "all cpus" events with iter->cpu <
    0.

    This means that, say, register_user_hw_breakpoint(tsk) can
    happily create the arbitrary number > HBP_NUM of breakpoints
    which can not be activated. toggle_bp_task_slot() is equally
    wrong by the same reason and nr_task_bp_pinned[] can have
    negative entries.

    Simple test:

    # perl -e 'sleep 1 while 1' &
    # perf record -e mem:0x10,mem:0x10,mem:0x10,mem:0x10,mem:0x10 -p `pidof perl`

    Before this patch this triggers the same problem/WARN_ON(),
    after the patch it correctly fails with -ENOSPC.

    Reported-by: Vince Weaver
    Signed-off-by: Oleg Nesterov
    Acked-by: Frederic Weisbecker
    Cc:
    Link: http://lkml.kernel.org/r/20130620155006.GA6324@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     

19 Jun, 2013

4 commits

  • Dave Jones hit the following bug report:

    ===============================
    [ INFO: suspicious RCU usage. ]
    3.10.0-rc2+ #1 Not tainted
    -------------------------------
    include/linux/rcupdate.h:771 rcu_read_lock() used illegally while idle!
    other info that might help us debug this:
    RCU used illegally from idle CPU! rcu_scheduler_active = 1, debug_locks = 0
    RCU used illegally from extended quiescent state!
    2 locks held by cc1/63645:
    #0: (&rq->lock){-.-.-.}, at: [] __schedule+0xed/0x9b0
    #1: (rcu_read_lock){.+.+..}, at: [] cpuacct_charge+0x5/0x1f0

    CPU: 1 PID: 63645 Comm: cc1 Not tainted 3.10.0-rc2+ #1 [loadavg: 40.57 27.55 13.39 25/277 64369]
    Hardware name: Gigabyte Technology Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H, BIOS F12a 04/23/2010
    0000000000000000 ffff88010f78fcf8 ffffffff816ae383 ffff88010f78fd28
    ffffffff810b698d ffff88011c092548 000000000023d073 ffff88011c092500
    0000000000000001 ffff88010f78fd60 ffffffff8109d7c5 ffffffff8109d645
    Call Trace:
    [] dump_stack+0x19/0x1b
    [] lockdep_rcu_suspicious+0xfd/0x130
    [] cpuacct_charge+0x185/0x1f0
    [] ? cpuacct_charge+0x5/0x1f0
    [] update_curr+0xec/0x240
    [] put_prev_task_fair+0x228/0x480
    [] __schedule+0x161/0x9b0
    [] preempt_schedule+0x51/0x80
    [] ? __cond_resched_softirq+0x60/0x60
    [] ? retint_careful+0x12/0x2e
    [] ftrace_ops_control_func+0x1dc/0x210
    [] ftrace_call+0x5/0x2f
    [] ? retint_careful+0xb/0x2e
    [] ? schedule_user+0x5/0x70
    [] ? schedule_user+0x5/0x70
    [] ? retint_careful+0x12/0x2e
    ------------[ cut here ]------------

    What happened was that the function tracer traced the schedule_user() code
    that tells RCU that the system is coming back from userspace, and to
    add the CPU back to the RCU monitoring.

    Because the function tracer does a preempt_disable/enable_notrace() calls
    the preempt_enable_notrace() checks the NEED_RESCHED flag. If it is set,
    then preempt_schedule() is called. But this is called before the user_exit()
    function can inform the kernel that the CPU is no longer in user mode and
    needs to be accounted for by RCU.

    The fix is to create a new preempt_schedule_context() that checks if
    the kernel is still in user mode and if so to switch it to kernel mode
    before calling schedule. It also switches back to user mode coming back
    from schedule in need be.

    The only user of this currently is the preempt_enable_notrace(), which is
    only used by the tracing subsystem.

    Signed-off-by: Steven Rostedt
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1369423420.6828.226.camel@gandalf.local.home
    Signed-off-by: Ingo Molnar

    Steven Rostedt
     
  • I have faced a sequence where the Idle Load Balance was sometime not
    triggered for a while on my platform, in the following scenario:

    CPU 0 and CPU 1 are running tasks and CPU 2 is idle

    CPU 1 kicks the Idle Load Balance
    CPU 1 selects CPU 2 as the new Idle Load Balancer
    CPU 2 sets NOHZ_BALANCE_KICK for CPU 2
    CPU 2 sends a reschedule IPI to CPU 2

    While CPU 3 wakes up, CPU 0 or CPU 1 migrates a waking up task A on CPU 2

    CPU 2 finally wakes up, runs task A and discards the Idle Load Balance
    task A quickly goes back to sleep (before a tick occurs on CPU 2)
    CPU 2 goes back to idle with NOHZ_BALANCE_KICK set

    Whenever CPU 2 will be selected as the ILB, no reschedule IPI will be sent
    because NOHZ_BALANCE_KICK is already set and no Idle Load Balance will be
    performed.

    We must wait for the sched softirq to be raised on CPU 2 thanks to another
    part the kernel to come back to clear NOHZ_BALANCE_KICK.

    The proposed solution clears NOHZ_BALANCE_KICK in schedule_ipi if
    we can't raise the sched_softirq for the Idle Load Balance.

    Change since V1:

    - move the clear of NOHZ_BALANCE_KICK in got_nohz_idle_kick if the ILB
    can't run on this CPU (as suggested by Peter)

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1370419991-13870-1-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     
  • Vince's fuzzer once again found holes. This time it spotted a leak in
    the locked page accounting.

    When an event had redirected output and its close() was the last
    reference to the buffer we didn't have a vm context to undo accounting.

    Change the code to destroy the buffer on the last munmap() and detach
    all redirected events at that time. This provides us the right context
    to undo the vm accounting.

    Reported-and-tested-by: Vince Weaver
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20130604084421.GI8923@twins.programming.kicks-ass.net
    Cc:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Joshua reported: Commit cd7b304dfaf1 (x86, range: fix missing merge
    during add range) broke mtrr cleanup on his setup in 3.9.5.
    corresponding commit in upstream is fbe06b7bae7c.

    The reason is add_range_with_merge could generate blank spot.

    We could avoid that by searching new expanded start/end, that
    new range should include all connected ranges in range array.
    At last add the new expanded start/end to the range array.
    Also move up left array so do not add new blank slot in the
    range array.

    -v2: move left array to avoid enhance add_range()
    -v3: include fix from Joshua about memmove declaring when
    DYN_DEBUG is used.

    Reported-by: Joshua Covington
    Tested-by: Joshua Covington
    Signed-off-by: Yinghai Lu
    Link: http://lkml.kernel.org/r/1371154622-8929-3-git-send-email-yinghai@kernel.org
    Cc: v3.9
    Signed-off-by: H. Peter Anvin

    Yinghai Lu
     

15 Jun, 2013

3 commits

  • Pull VFS fixes from Al Viro:
    "Several fixes + obvious cleanup (you've missed a couple of open-coded
    can_lookup() back then)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    snd_pcm_link(): fix a leak...
    use can_lookup() instead of direct checks of ->i_op->lookup
    move exit_task_namespaces() outside of exit_notify()
    fput: task_work_add() can fail if the caller has passed exit_task_work()
    ncpfs: fix rmdir returns Device or resource busy

    Linus Torvalds
     
  • exit_notify() does exit_task_namespaces() after
    forget_original_parent(). This was needed to ensure that ->nsproxy
    can't be cleared prematurely, an exiting child we are going to
    reparent can do do_notify_parent() and use the parent's (ours) pid_ns.

    However, after 32084504 "pidns: use task_active_pid_ns in
    do_notify_parent" ->nsproxy != NULL is no longer needed, we rely
    on task_active_pid_ns().

    Move exit_task_namespaces() from exit_notify() to do_exit(), after
    exit_fs() and before exit_task_work().

    This solves the problem reported by Andrey, free_ipc_ns()->shm_destroy()
    does fput() which needs task_work_add().

    Note: this particular problem can be fixed if we change fput(), and
    that change makes sense anyway. But there is another reason to move
    the callsite. The original reason for exit_task_namespaces() from
    the middle of exit_notify() was subtle and it has already gone away,
    now this looks confusing. And this allows us do simplify exit_notify(),
    we can avoid unlock/lock(tasklist) and we can use ->exit_state instead
    of PF_EXITING in forget_original_parent().

    Reported-by: Andrey Vagin
    Signed-off-by: Oleg Nesterov
    Acked-by: "Eric W. Biederman"
    Acked-by: Andrey Vagin
    Signed-off-by: Al Viro

    Oleg Nesterov
     
  • PARISC bootup triggers the warning at kernel/cpu/idle.c:96. That's
    caused by the weak arch_cpu_idle() implementation, which is provided
    to avoid that architectures implement idle_poll over and over.

    The switchover to polling mode happens in the first call of the weak
    arch_cpu_idle() implementation, but that code fails to reenable
    interrupts and therefor triggers the warning.

    Fix this by enabling interrupts in the weak arch_cpu_idle() code.

    [ tglx: Made the changelog match the patch ]

    Signed-off-by: James Bottomley
    Reviewed-by: Srivatsa S. Bhat
    Link: http://lkml.kernel.org/r/1371236142.2726.43.camel@dabdike
    Signed-off-by: Thomas Gleixner

    James Bottomley
     

14 Jun, 2013

1 commit

  • Pull RCU fixes from Paul McKenney:
    "I must confess that this past merge window was not RCU's best showing.
    This series contains three more fixes for RCU regressions:

    1. A fix to __DECLARE_TRACE_RCU() that causes it to act as an
    interrupt from idle rather than as a task switch from idle.
    This change is needed due to the recent use of _rcuidle()
    tracepoints that can be invoked from interrupt handlers as well
    as from idle. Without this fix, invoking _rcuidle() tracepoints
    from interrupt handlers results in splats and (more seriously)
    confusion on RCU's part as to whether a given CPU is idle or not.
    This confusion can in turn result in too-short grace periods and
    therefore random memory corruption.

    2. A fix to a subtle deadlock that could result due to RCU doing
    a wakeup while holding one of its rcu_node structure's locks.
    Although the probability of occurrence is low, it really
    does happen. The fix, courtesy of Steven Rostedt, uses
    irq_work_queue() to avoid the deadlock.

    3. A fix to a silent deadlock (invisible to lockdep) due to the
    interaction of timeouts posted by RCU debug code enabled by
    CONFIG_PROVE_RCU_DELAY=y, grace-period initialization, and CPU
    hotplug operations. This will not occur in production kernels,
    but really does occur in randconfig testing. Diagnosis courtesy
    of Steven Rostedt"

    * 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
    rcu: Fix deadlock with CPU hotplug, RCU GP init, and timer migration
    rcu: Don't call wakeup() with rcu_node structure ->lock held
    trace: Allow idle-safe tracepoints to be called from irq

    Linus Torvalds
     

13 Jun, 2013

6 commits

  • Merge misc fixes from Andrew Morton:
    "Bunch of fixes and one little addition to math64.h"

    * emailed patches from Andrew Morton : (27 commits)
    include/linux/math64.h: add div64_ul()
    mm: memcontrol: fix lockless reclaim hierarchy iterator
    frontswap: fix incorrect zeroing and allocation size for frontswap_map
    kernel/audit_tree.c:audit_add_tree_rule(): protect `rule' from kill_rules()
    mm: migration: add migrate_entry_wait_huge()
    ocfs2: add missing lockres put in dlm_mig_lockres_handler
    mm/page_alloc.c: fix watermark check in __zone_watermark_ok()
    drivers/misc/sgi-gru/grufile.c: fix info leak in gru_get_config_info()
    aio: fix io_destroy() regression by using call_rcu()
    rtc-at91rm9200: use shadow IMR on at91sam9x5
    rtc-at91rm9200: add shadow interrupt mask
    rtc-at91rm9200: refactor interrupt-register handling
    rtc-at91rm9200: add configuration support
    rtc-at91rm9200: add match-table compile guard
    fs/ocfs2/namei.c: remove unecessary ERROR when removing non-empty directory
    swap: avoid read_swap_cache_async() race to deadlock while waiting on discard I/O completion
    drivers/rtc/rtc-twl.c: fix missing device_init_wakeup() when booted with device tree
    cciss: fix broken mutex usage in ioctl
    audit: wait_for_auditd() should use TASK_UNINTERRUPTIBLE
    drivers/rtc/rtc-cmos.c: fix accidentally enabling rtc channel
    ...

    Linus Torvalds
     
  • audit_add_tree_rule() must set 'rule->tree = NULL;' firstly, to protect
    the rule itself freed in kill_rules().

    The reason is when it is killed, the 'rule' itself may have already
    released, we should not access it. one example: we add a rule to an
    inode, just at the same time the other task is deleting this inode.

    The work flow for adding a rule:

    audit_receive() -> (need audit_cmd_mutex lock)
    audit_receive_skb() ->
    audit_receive_msg() ->
    audit_receive_filter() ->
    audit_add_rule() ->
    audit_add_tree_rule() -> (need audit_filter_mutex lock)
    ...
    unlock audit_filter_mutex
    get_tree()
    ...
    iterate_mounts() -> (iterate all related inodes)
    tag_mount() ->
    tag_trunk() ->
    create_trunk() -> (assume it is 1st rule)
    fsnotify_add_mark() ->
    fsnotify_add_inode_mark() -> (add mark to inode->i_fsnotify_marks)
    ...
    get_tree(); (each inode will get one)
    ...
    lock audit_filter_mutex

    The work flow for deleting an inode:

    __destroy_inode() ->
    fsnotify_inode_delete() ->
    __fsnotify_inode_delete() ->
    fsnotify_clear_marks_by_inode() -> (get mark from inode->i_fsnotify_marks)
    fsnotify_destroy_mark() ->
    fsnotify_destroy_mark_locked() ->
    audit_tree_freeing_mark() ->
    evict_chunk() ->
    ...
    tree->goner = 1
    ...
    kill_rules() -> (assume current->audit_context == NULL)
    call_rcu() -> (rule->tree != NULL)
    audit_free_rule_rcu() ->
    audit_free_rule()
    ...
    audit_schedule_prune() -> (assume current->audit_context == NULL)
    kthread_run() -> (need audit_cmd_mutex and audit_filter_mutex lock)
    prune_one() -> (delete it from prue_list)
    put_tree(); (match the original get_tree above)

    Signed-off-by: Chen Gang
    Cc: Eric Paris
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • audit_log_start() does wait_for_auditd() in a loop until
    audit_backlog_wait_time passes or audit_skb_queue has a room.

    If signal_pending() is true this becomes a busy-wait loop, schedule() in
    TASK_INTERRUPTIBLE won't block.

    Thanks to Guy for fully investigating and explaining the problem.

    (akpm: that'll cause the system to lock up on a non-preemptible
    uniprocessor kernel)

    (Guy: "Our customer was in fact running a uniprocessor machine, and they
    reported a system hang.")

    Signed-off-by: Oleg Nesterov
    Reported-by: Guy Streeter
    Cc: Eric Paris
    Cc: Al Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The dmesg_restrict sysctl currently covers the syslog method for access
    dmesg, however /dev/kmsg isn't covered by the same protections. Most
    people haven't noticed because util-linux dmesg(1) defaults to using the
    syslog method for access in older versions. With util-linux dmesg(1)
    defaults to reading directly from /dev/kmsg.

    To fix /dev/kmsg, let's compare the existing interfaces and what they
    allow:

    - /proc/kmsg allows:
    - open (SYSLOG_ACTION_OPEN) if CAP_SYSLOG since it uses a destructive
    single-reader interface (SYSLOG_ACTION_READ).
    - everything, after an open.

    - syslog syscall allows:
    - anything, if CAP_SYSLOG.
    - SYSLOG_ACTION_READ_ALL and SYSLOG_ACTION_SIZE_BUFFER, if
    dmesg_restrict==0.
    - nothing else (EPERM).

    The use-cases were:
    - dmesg(1) needs to do non-destructive SYSLOG_ACTION_READ_ALLs.
    - sysklog(1) needs to open /proc/kmsg, drop privs, and still issue the
    destructive SYSLOG_ACTION_READs.

    AIUI, dmesg(1) is moving to /dev/kmsg, and systemd-journald doesn't
    clear the ring buffer.

    Based on the comments in devkmsg_llseek, it sounds like actions besides
    reading aren't going to be supported by /dev/kmsg (i.e.
    SYSLOG_ACTION_CLEAR), so we have a strict subset of the non-destructive
    syslog syscall actions.

    To this end, move the check as Josh had done, but also rename the
    constants to reflect their new uses (SYSLOG_FROM_CALL becomes
    SYSLOG_FROM_READER, and SYSLOG_FROM_FILE becomes SYSLOG_FROM_PROC).
    SYSLOG_FROM_READER allows non-destructive actions, and SYSLOG_FROM_PROC
    allows destructive actions after a capabilities-constrained
    SYSLOG_ACTION_OPEN check.

    - /dev/kmsg allows:
    - open if CAP_SYSLOG or dmesg_restrict==0
    - reading/polling, after open

    Addresses https://bugzilla.redhat.com/show_bug.cgi?id=903192

    [akpm@linux-foundation.org: use pr_warn_once()]
    Signed-off-by: Kees Cook
    Reported-by: Christian Kujau
    Tested-by: Josh Boyer
    Cc: Kay Sievers
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • We recently noticed that reboot of a 1024 cpu machine takes approx 16
    minutes of just stopping the cpus. The slowdown was tracked to commit
    f96972f2dc63 ("kernel/sys.c: call disable_nonboot_cpus() in
    kernel_restart()").

    The current implementation does all the work of hot removing the cpus
    before halting the system. We are switching to just migrating to the
    boot cpu and then continuing with shutdown/reboot.

    This also has the effect of not breaking x86's command line parameter
    for specifying the reboot cpu. Note, this code was shamelessly copied
    from arch/x86/kernel/reboot.c with bits removed pertaining to the
    reboot_cpu command line parameter.

    Signed-off-by: Robin Holt
    Tested-by: Shawn Guo
    Cc: "Srivatsa S. Bhat"
    Cc: H. Peter Anvin
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Russ Anderson
    Cc: Robin Holt
    Cc: Russell King
    Cc: Guan Xuetao
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     
  • There are instances in the kernel where we would like to disable CPU
    hotplug (from sysfs) during some important operation. Today the freezer
    code depends on this and the code to do it was kinda tailor-made for
    that.

    Restructure the code and make it generic enough to be useful for other
    usecases too.

    Signed-off-by: Srivatsa S. Bhat
    Signed-off-by: Robin Holt
    Cc: H. Peter Anvin
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Russ Anderson
    Cc: Robin Holt
    Cc: Russell King
    Cc: Guan Xuetao
    Cc: Shawn Guo
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Srivatsa S. Bhat
     

12 Jun, 2013

3 commits

  • …it/rostedt/linux-trace

    Pull tracing fix from Steven Rostedt:
    "Yoshihiro Yunomae fixed a regression in the output format when using
    one of the counter clocks.

    The new multibuffer code changed the trace_clock file to update the
    trace instances tr->clock_id but the actual traces still used the
    value from the obsolete global variable trace_clock_id"

    * tag 'trace-fixes-v3.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Fix outputting formats of x86-tsc and counter when use trace_clock

    Linus Torvalds
     
  • Moving x86 to the generic idle implementation (commit 7d1a9417 "x86:
    Use generic idle loop") wreckaged the stack protector.

    I stupidly missed that boot_init_stack_canary() must be inlined from a
    function which never returns, but I put that call into
    arch_cpu_idle_prepare() which of course returns.

    I pondered to play tricks with arch_cpu_idle_prepare() first, but then
    I noticed, that the other archs which have implemented the
    stackprotector (ARM and SH) do not initialize the canary for the
    non-boot cpus.

    So I decided to move the boot_init_stack_canary() call into
    cpu_startup_entry() ifdeffed with an CONFIG_X86 for now. This #ifdef
    is just a temporary measure as I don't want to inflict the
    boot_init_stack_canary() call on ARM and SH that late in the cycle.

    I'll queue a patch for 3.11 which removes the #ifdef if the ARM/SH
    maintainers have no objection.

    Reported-by: Wouter van Kesteren
    Cc: x86@kernel.org
    Cc: Russell King
    Cc: Paul Mundt
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Outputting formats of x86-tsc and counter should be a raw format, but after
    applying the patch(2b6080f28c7cc3efc8625ab71495aae89aeb63a0), the format was
    changed to nanosec. This is because the global variable trace_clock_id was used.
    When we use multiple buffers, clock_id of each sub-buffer should be used. Then,
    this patch uses tr->clock_id instead of the global variable trace_clock_id.

    [ Basically, this fixes a regression where the multibuffer code changed the
    trace_clock file to update tr->clock_id but the traces still use the old
    global trace_clock_id variable, negating the file's effect. The global
    trace_clock_id variable is obsolete and removed. - SR ]

    Link: http://lkml.kernel.org/r/20130423013239.22334.7394.stgit@yunodevel

    Signed-off-by: Yoshihiro YUNOMAE
    Signed-off-by: Steven Rostedt

    Yoshihiro YUNOMAE
     

11 Jun, 2013

3 commits

  • The stop machine logic can lock up if all but one of the migration
    threads make it through the disable-irq step and the one remaining
    thread gets stuck in __do_softirq. The reason __do_softirq can hang is
    that it has a bail-out based on jiffies timeout, but in the lockup case,
    jiffies itself is not incremented.

    To work around this, re-add the max_restart counter in __do_irq and stop
    processing irqs after 10 restarts.

    Thanks to Tejun Heo and Rusty Russell and others for helping me track
    this down.

    This was introduced in 3.9 by commit c10d73671ad3 ("softirq: reduce
    latencies").

    It may be worth looking into ath9k to see if it has issues with its irq
    handler at a later date.

    The hang stack traces look something like this:

    ------------[ cut here ]------------
    WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xa7()
    Watchdog detected hard LOCKUP on cpu 2
    Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc]
    Pid: 23, comm: migration/2 Tainted: G C 3.9.4+ #11
    Call Trace:
    warn_slowpath_common+0x85/0x9f
    warn_slowpath_fmt+0x46/0x48
    watchdog_overflow_callback+0x9c/0xa7
    __perf_event_overflow+0x137/0x1cb
    perf_event_overflow+0x14/0x16
    intel_pmu_handle_irq+0x2dc/0x359
    perf_event_nmi_handler+0x19/0x1b
    nmi_handle+0x7f/0xc2
    do_nmi+0xbc/0x304
    end_repeat_nmi+0x1e/0x2e
    <>
    cpu_stopper_thread+0xae/0x162
    smpboot_thread_fn+0x258/0x260
    kthread+0xc7/0xcf
    ret_from_fork+0x7c/0xb0
    ---[ end trace 4947dfa9b0a4cec3 ]---
    BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:17]
    Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc]
    irq event stamp: 835637905
    hardirqs last enabled at (835637904): __do_softirq+0x9f/0x257
    hardirqs last disabled at (835637905): apic_timer_interrupt+0x6d/0x80
    softirqs last enabled at (5654720): __do_softirq+0x1ff/0x257
    softirqs last disabled at (5654725): irq_exit+0x5f/0xbb
    CPU 1
    Pid: 17, comm: migration/1 Tainted: G WC 3.9.4+ #11 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
    RIP: tasklet_hi_action+0xf0/0xf0
    Process migration/1
    Call Trace:

    __do_softirq+0x117/0x257
    irq_exit+0x5f/0xbb
    smp_apic_timer_interrupt+0x8a/0x98
    apic_timer_interrupt+0x72/0x80

    printk+0x4d/0x4f
    stop_machine_cpu_stop+0x22c/0x274
    cpu_stopper_thread+0xae/0x162
    smpboot_thread_fn+0x258/0x260
    kthread+0xc7/0xcf
    ret_from_fork+0x7c/0xb0

    Signed-off-by: Ben Greear
    Acked-by: Tejun Heo
    Acked-by: Pekka Riikonen
    Cc: Eric Dumazet
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Ben Greear
     
  • In Steven Rostedt's words:

    > I've been debugging the last couple of days why my tests have been
    > locking up. One of my tracing tests, runs all available tracers. The
    > lockup always happened with the mmiotrace, which is used to trace
    > interactions between priority drivers and the kernel. But to do this
    > easily, when the tracer gets registered, it disables all but the boot
    > CPUs. The lockup always happened after it got done disabling the CPUs.
    >
    > Then I decided to try this:
    >
    > while :; do
    > for i in 1 2 3; do
    > echo 0 > /sys/devices/system/cpu/cpu$i/online
    > done
    > for i in 1 2 3; do
    > echo 1 > /sys/devices/system/cpu/cpu$i/online
    > done
    > done
    >
    > Well, sure enough, that locked up too, with the same users. Doing a
    > sysrq-w (showing all blocked tasks):
    >
    > [ 2991.344562] task PC stack pid father
    > [ 2991.344562] rcu_preempt D ffff88007986fdf8 0 10 2 0x00000000
    > [ 2991.344562] ffff88007986fc98 0000000000000002 ffff88007986fc48 0000000000000908
    > [ 2991.344562] ffff88007986c280 ffff88007986ffd8 ffff88007986ffd8 00000000001d3c80
    > [ 2991.344562] ffff880079248a40 ffff88007986c280 0000000000000000 00000000fffd4295
    > [ 2991.344562] Call Trace:
    > [ 2991.344562] [] schedule+0x64/0x66
    > [ 2991.344562] [] schedule_timeout+0xbc/0xf9
    > [ 2991.344562] [] ? ftrace_call+0x5/0x2f
    > [ 2991.344562] [] ? cascade+0xa8/0xa8
    > [ 2991.344562] [] schedule_timeout_uninterruptible+0x1e/0x20
    > [ 2991.344562] [] rcu_gp_kthread+0x502/0x94b
    > [ 2991.344562] [] ? __init_waitqueue_head+0x50/0x50
    > [ 2991.344562] [] ? rcu_gp_fqs+0x64/0x64
    > [ 2991.344562] [] kthread+0xb1/0xb9
    > [ 2991.344562] [] ? lock_release_holdtime.part.23+0x4e/0x55
    > [ 2991.344562] [] ? __init_kthread_worker+0x58/0x58
    > [ 2991.344562] [] ret_from_fork+0x7c/0xb0
    > [ 2991.344562] [] ? __init_kthread_worker+0x58/0x58
    > [ 2991.344562] kworker/0:1 D ffffffff81a30680 0 47 2 0x00000000
    > [ 2991.344562] Workqueue: events cpuset_hotplug_workfn
    > [ 2991.344562] ffff880078dbbb58 0000000000000002 0000000000000006 00000000000000d8
    > [ 2991.344562] ffff880078db8100 ffff880078dbbfd8 ffff880078dbbfd8 00000000001d3c80
    > [ 2991.344562] ffff8800779ca5c0 ffff880078db8100 ffffffff81541fcf 0000000000000000
    > [ 2991.344562] Call Trace:
    > [ 2991.344562] [] ? __mutex_lock_common+0x3d4/0x609
    > [ 2991.344562] [] schedule+0x64/0x66
    > [ 2991.344562] [] schedule_preempt_disabled+0x18/0x24
    > [ 2991.344562] [] __mutex_lock_common+0x3d4/0x609
    > [ 2991.344562] [] ? get_online_cpus+0x3c/0x50
    > [ 2991.344562] [] ? get_online_cpus+0x3c/0x50
    > [ 2991.344562] [] mutex_lock_nested+0x3b/0x40
    > [ 2991.344562] [] get_online_cpus+0x3c/0x50
    > [ 2991.344562] [] rebuild_sched_domains_locked+0x6e/0x3a8
    > [ 2991.344562] [] rebuild_sched_domains+0x1c/0x2a
    > [ 2991.344562] [] cpuset_hotplug_workfn+0x1c7/0x1d3
    > [ 2991.344562] [] ? cpuset_hotplug_workfn+0x5/0x1d3
    > [ 2991.344562] [] process_one_work+0x2d4/0x4d1
    > [ 2991.344562] [] ? process_one_work+0x207/0x4d1
    > [ 2991.344562] [] worker_thread+0x2e7/0x3b5
    > [ 2991.344562] [] ? rescuer_thread+0x332/0x332
    > [ 2991.344562] [] kthread+0xb1/0xb9
    > [ 2991.344562] [] ? __init_kthread_worker+0x58/0x58
    > [ 2991.344562] [] ret_from_fork+0x7c/0xb0
    > [ 2991.344562] [] ? __init_kthread_worker+0x58/0x58
    > [ 2991.344562] bash D ffffffff81a4aa80 0 2618 2612 0x10000000
    > [ 2991.344562] ffff8800379abb58 0000000000000002 0000000000000006 0000000000000c2c
    > [ 2991.344562] ffff880077fea140 ffff8800379abfd8 ffff8800379abfd8 00000000001d3c80
    > [ 2991.344562] ffff8800779ca5c0 ffff880077fea140 ffffffff81541fcf 0000000000000000
    > [ 2991.344562] Call Trace:
    > [ 2991.344562] [] ? __mutex_lock_common+0x3d4/0x609
    > [ 2991.344562] [] schedule+0x64/0x66
    > [ 2991.344562] [] schedule_preempt_disabled+0x18/0x24
    > [ 2991.344562] [] __mutex_lock_common+0x3d4/0x609
    > [ 2991.344562] [] ? rcu_cpu_notify+0x2f5/0x86e
    > [ 2991.344562] [] ? rcu_cpu_notify+0x2f5/0x86e
    > [ 2991.344562] [] mutex_lock_nested+0x3b/0x40
    > [ 2991.344562] [] rcu_cpu_notify+0x2f5/0x86e
    > [ 2991.344562] [] ? __lock_is_held+0x32/0x53
    > [ 2991.344562] [] notifier_call_chain+0x6b/0x98
    > [ 2991.344562] [] __raw_notifier_call_chain+0xe/0x10
    > [ 2991.344562] [] __cpu_notify+0x20/0x32
    > [ 2991.344562] [] cpu_notify_nofail+0x17/0x36
    > [ 2991.344562] [] _cpu_down+0x154/0x259
    > [ 2991.344562] [] cpu_down+0x2d/0x3a
    > [ 2991.344562] [] store_online+0x4e/0xe7
    > [ 2991.344562] [] dev_attr_store+0x20/0x22
    > [ 2991.344562] [] sysfs_write_file+0x108/0x144
    > [ 2991.344562] [] vfs_write+0xfd/0x158
    > [ 2991.344562] [] SyS_write+0x5c/0x83
    > [ 2991.344562] [] tracesys+0xdd/0xe2
    >
    > As well as held locks:
    >
    > [ 3034.728033] Showing all locks held in the system:
    > [ 3034.728033] 1 lock held by rcu_preempt/10:
    > [ 3034.728033] #0: (rcu_preempt_state.onoff_mutex){+.+...}, at: [] rcu_gp_kthread+0x167/0x94b
    > [ 3034.728033] 4 locks held by kworker/0:1/47:
    > [ 3034.728033] #0: (events){.+.+.+}, at: [] process_one_work+0x207/0x4d1
    > [ 3034.728033] #1: (cpuset_hotplug_work){+.+.+.}, at: [] process_one_work+0x207/0x4d1
    > [ 3034.728033] #2: (cpuset_mutex){+.+.+.}, at: [] rebuild_sched_domains+0x17/0x2a
    > [ 3034.728033] #3: (cpu_hotplug.lock){+.+.+.}, at: [] get_online_cpus+0x3c/0x50
    > [ 3034.728033] 1 lock held by mingetty/2563:
    > [ 3034.728033] #0: (&ldata->atomic_read_lock){+.+...}, at: [] n_tty_read+0x252/0x7e8
    > [ 3034.728033] 1 lock held by mingetty/2565:
    > [ 3034.728033] #0: (&ldata->atomic_read_lock){+.+...}, at: [] n_tty_read+0x252/0x7e8
    > [ 3034.728033] 1 lock held by mingetty/2569:
    > [ 3034.728033] #0: (&ldata->atomic_read_lock){+.+...}, at: [] n_tty_read+0x252/0x7e8
    > [ 3034.728033] 1 lock held by mingetty/2572:
    > [ 3034.728033] #0: (&ldata->atomic_read_lock){+.+...}, at: [] n_tty_read+0x252/0x7e8
    > [ 3034.728033] 1 lock held by mingetty/2575:
    > [ 3034.728033] #0: (&ldata->atomic_read_lock){+.+...}, at: [] n_tty_read+0x252/0x7e8
    > [ 3034.728033] 7 locks held by bash/2618:
    > [ 3034.728033] #0: (sb_writers#5){.+.+.+}, at: [] file_start_write+0x2a/0x2c
    > [ 3034.728033] #1: (&buffer->mutex#2){+.+.+.}, at: [] sysfs_write_file+0x3c/0x144
    > [ 3034.728033] #2: (s_active#54){.+.+.+}, at: [] sysfs_write_file+0xe7/0x144
    > [ 3034.728033] #3: (x86_cpu_hotplug_driver_mutex){+.+.+.}, at: [] cpu_hotplug_driver_lock+0x17/0x19
    > [ 3034.728033] #4: (cpu_add_remove_lock){+.+.+.}, at: [] cpu_maps_update_begin+0x17/0x19
    > [ 3034.728033] #5: (cpu_hotplug.lock){+.+.+.}, at: [] cpu_hotplug_begin+0x2c/0x6d
    > [ 3034.728033] #6: (rcu_preempt_state.onoff_mutex){+.+...}, at: [] rcu_cpu_notify+0x2f5/0x86e
    > [ 3034.728033] 1 lock held by bash/2980:
    > [ 3034.728033] #0: (&ldata->atomic_read_lock){+.+...}, at: [] n_tty_read+0x252/0x7e8
    >
    > Things looked a little weird. Also, this is a deadlock that lockdep did
    > not catch. But what we have here does not look like a circular lock
    > issue:
    >
    > Bash is blocked in rcu_cpu_notify():
    >
    > 1961 /* Exclude any attempts to start a new grace period. */
    > 1962 mutex_lock(&rsp->onoff_mutex);
    >
    >
    > kworker is blocked in get_online_cpus(), which makes sense as we are
    > currently taking down a CPU.
    >
    > But rcu_preempt is not blocked on anything. It is simply sleeping in
    > rcu_gp_kthread (really rcu_gp_init) here:
    >
    > 1453 #ifdef CONFIG_PROVE_RCU_DELAY
    > 1454 if ((prandom_u32() % (rcu_num_nodes * 8)) == 0 &&
    > 1455 system_state == SYSTEM_RUNNING)
    > 1456 schedule_timeout_uninterruptible(2);
    > 1457 #endif /* #ifdef CONFIG_PROVE_RCU_DELAY */
    >
    > And it does this while holding the onoff_mutex that bash is waiting for.
    >
    > Doing a function trace, it showed me where it happened:
    >
    > [ 125.940066] rcu_pree-10 3.... 28384115273: schedule_timeout_uninterruptible [...]
    > [ 125.940066] rcu_pree-10 3d..3 28384202439: sched_switch: prev_comm=rcu_preempt prev_pid=10 prev_prio=120 prev_state=D ==> next_comm=watchdog/3 next_pid=38 next_prio=120
    >
    > The watchdog ran, and then:
    >
    > [ 125.940066] watchdog-38 3d..3 28384692863: sched_switch: prev_comm=watchdog/3 prev_pid=38 prev_prio=120 prev_state=P ==> next_comm=modprobe next_pid=2848 next_prio=118
    >
    > Not sure what modprobe was doing, but shortly after that:
    >
    > [ 125.940066] modprobe-2848 3d..3 28385041749: sched_switch: prev_comm=modprobe prev_pid=2848 prev_prio=118 prev_state=R+ ==> next_comm=migration/3 next_pid=40 next_prio=0
    >
    > Where the migration thread took down the CPU:
    >
    > [ 125.940066] migratio-40 3d..3 28389148276: sched_switch: prev_comm=migration/3 prev_pid=40 prev_prio=0 prev_state=P ==> next_comm=swapper/3 next_pid=0 next_prio=120
    >
    > which finally did:
    >
    > [ 125.940066] -0 3...1 28389282142: arch_cpu_idle_dead [ 125.940066] -0 3...1 28389282548: native_play_dead [ 125.940066] -0 3...1 28389282924: play_dead_common [ 125.940066] -0 3...1 28389283468: idle_task_exit [ 125.940066] -0 3...1 28389284644: amd_e400_remove_cpu
    >
    > CPU 3 is now offline, the rcu_preempt thread that ran on CPU 3 is still
    > doing a schedule_timeout_uninterruptible() and it registered it's
    > timeout to the timer base for CPU 3. You would think that it would get
    > migrated right? The issue here is that the timer migration happens at
    > the CPU notifier for CPU_DEAD. The problem is that the rcu notifier for
    > CPU_DOWN is blocked waiting for the onoff_mutex to be released, which is
    > held by the thread that just put itself into a uninterruptible sleep,
    > that wont wake up until the CPU_DEAD notifier of the timer
    > infrastructure is called, which wont happen until the rcu notifier
    > finishes. Here's our deadlock!

    This commit breaks this deadlock cycle by substituting a shorter udelay()
    for the previous schedule_timeout_uninterruptible(), while at the same
    time increasing the probability of the delay. This maintains the intensity
    of the testing.

    Reported-by: Steven Rostedt
    Signed-off-by: Paul E. McKenney
    Tested-by: Steven Rostedt

    Paul E. McKenney
     
  • This commit fixes a lockdep-detected deadlock by moving a wake_up()
    call out from a rnp->lock critical section. Please see below for
    the long version of this story.

    On Tue, 2013-05-28 at 16:13 -0400, Dave Jones wrote:

    > [12572.705832] ======================================================
    > [12572.750317] [ INFO: possible circular locking dependency detected ]
    > [12572.796978] 3.10.0-rc3+ #39 Not tainted
    > [12572.833381] -------------------------------------------------------
    > [12572.862233] trinity-child17/31341 is trying to acquire lock:
    > [12572.870390] (rcu_node_0){..-.-.}, at: [] rcu_read_unlock_special+0x9f/0x4c0
    > [12572.878859]
    > but task is already holding lock:
    > [12572.894894] (&ctx->lock){-.-...}, at: [] perf_lock_task_context+0x7d/0x2d0
    > [12572.903381]
    > which lock already depends on the new lock.
    >
    > [12572.927541]
    > the existing dependency chain (in reverse order) is:
    > [12572.943736]
    > -> #4 (&ctx->lock){-.-...}:
    > [12572.960032] [] lock_acquire+0x91/0x1f0
    > [12572.968337] [] _raw_spin_lock+0x40/0x80
    > [12572.976633] [] __perf_event_task_sched_out+0x2e7/0x5e0
    > [12572.984969] [] perf_event_task_sched_out+0x93/0xa0
    > [12572.993326] [] __schedule+0x2cf/0x9c0
    > [12573.001652] [] schedule_user+0x2e/0x70
    > [12573.009998] [] retint_careful+0x12/0x2e
    > [12573.018321]
    > -> #3 (&rq->lock){-.-.-.}:
    > [12573.034628] [] lock_acquire+0x91/0x1f0
    > [12573.042930] [] _raw_spin_lock+0x40/0x80
    > [12573.051248] [] wake_up_new_task+0xb7/0x260
    > [12573.059579] [] do_fork+0x105/0x470
    > [12573.067880] [] kernel_thread+0x26/0x30
    > [12573.076202] [] rest_init+0x23/0x140
    > [12573.084508] [] start_kernel+0x3f1/0x3fe
    > [12573.092852] [] x86_64_start_reservations+0x2a/0x2c
    > [12573.101233] [] x86_64_start_kernel+0xcc/0xcf
    > [12573.109528]
    > -> #2 (&p->pi_lock){-.-.-.}:
    > [12573.125675] [] lock_acquire+0x91/0x1f0
    > [12573.133829] [] _raw_spin_lock_irqsave+0x4b/0x90
    > [12573.141964] [] try_to_wake_up+0x31/0x320
    > [12573.150065] [] default_wake_function+0x12/0x20
    > [12573.158151] [] autoremove_wake_function+0x18/0x40
    > [12573.166195] [] __wake_up_common+0x58/0x90
    > [12573.174215] [] __wake_up+0x39/0x50
    > [12573.182146] [] rcu_start_gp_advanced.isra.11+0x4a/0x50
    > [12573.190119] [] rcu_start_future_gp+0x1c9/0x1f0
    > [12573.198023] [] rcu_nocb_kthread+0x114/0x930
    > [12573.205860] [] kthread+0xed/0x100
    > [12573.213656] [] ret_from_fork+0x7c/0xb0
    > [12573.221379]
    > -> #1 (&rsp->gp_wq){..-.-.}:
    > [12573.236329] [] lock_acquire+0x91/0x1f0
    > [12573.243783] [] _raw_spin_lock_irqsave+0x4b/0x90
    > [12573.251178] [] __wake_up+0x23/0x50
    > [12573.258505] [] rcu_start_gp_advanced.isra.11+0x4a/0x50
    > [12573.265891] [] rcu_start_future_gp+0x1c9/0x1f0
    > [12573.273248] [] rcu_nocb_kthread+0x114/0x930
    > [12573.280564] [] kthread+0xed/0x100
    > [12573.287807] [] ret_from_fork+0x7c/0xb0

    Notice the above call chain.

    rcu_start_future_gp() is called with the rnp->lock held. Then it calls
    rcu_start_gp_advance, which does a wakeup.

    You can't do wakeups while holding the rnp->lock, as that would mean
    that you could not do a rcu_read_unlock() while holding the rq lock, or
    any lock that was taken while holding the rq lock. This is because...
    (See below).

    > [12573.295067]
    > -> #0 (rcu_node_0){..-.-.}:
    > [12573.309293] [] __lock_acquire+0x1786/0x1af0
    > [12573.316568] [] lock_acquire+0x91/0x1f0
    > [12573.323825] [] _raw_spin_lock+0x40/0x80
    > [12573.331081] [] rcu_read_unlock_special+0x9f/0x4c0
    > [12573.338377] [] __rcu_read_unlock+0x96/0xa0
    > [12573.345648] [] perf_lock_task_context+0x143/0x2d0
    > [12573.352942] [] find_get_context+0x4e/0x1f0
    > [12573.360211] [] SYSC_perf_event_open+0x514/0xbd0
    > [12573.367514] [] SyS_perf_event_open+0x9/0x10
    > [12573.374816] [] tracesys+0xdd/0xe2

    Notice the above trace.

    perf took its own ctx->lock, which can be taken while holding the rq
    lock. While holding this lock, it did a rcu_read_unlock(). The
    perf_lock_task_context() basically looks like:

    rcu_read_lock();
    raw_spin_lock(ctx->lock);
    rcu_read_unlock();

    Now, what looks to have happened, is that we scheduled after taking that
    first rcu_read_lock() but before taking the spin lock. When we scheduled
    back in and took the ctx->lock, the following rcu_read_unlock()
    triggered the "special" code.

    The rcu_read_unlock_special() takes the rnp->lock, which gives us a
    possible deadlock scenario.

    CPU0 CPU1 CPU2
    ---- ---- ----

    rcu_nocb_kthread()
    lock(rq->lock);
    lock(ctx->lock);
    lock(rnp->lock);

    wake_up();

    lock(rq->lock);

    rcu_read_unlock();

    rcu_read_unlock_special();

    lock(rnp->lock);
    lock(ctx->lock);

    **** DEADLOCK ****

    > [12573.382068]
    > other info that might help us debug this:
    >
    > [12573.403229] Chain exists of:
    > rcu_node_0 --> &rq->lock --> &ctx->lock
    >
    > [12573.424471] Possible unsafe locking scenario:
    >
    > [12573.438499] CPU0 CPU1
    > [12573.445599] ---- ----
    > [12573.452691] lock(&ctx->lock);
    > [12573.459799] lock(&rq->lock);
    > [12573.467010] lock(&ctx->lock);
    > [12573.474192] lock(rcu_node_0);
    > [12573.481262]
    > *** DEADLOCK ***
    >
    > [12573.501931] 1 lock held by trinity-child17/31341:
    > [12573.508990] #0: (&ctx->lock){-.-...}, at: [] perf_lock_task_context+0x7d/0x2d0
    > [12573.516475]
    > stack backtrace:
    > [12573.530395] CPU: 1 PID: 31341 Comm: trinity-child17 Not tainted 3.10.0-rc3+ #39
    > [12573.545357] ffffffff825b4f90 ffff880219f1dbc0 ffffffff816e375b ffff880219f1dc00
    > [12573.552868] ffffffff816dfa5d ffff880219f1dc50 ffff88023ce4d1f8 ffff88023ce4ca40
    > [12573.560353] 0000000000000001 0000000000000001 ffff88023ce4d1f8 ffff880219f1dcc0
    > [12573.567856] Call Trace:
    > [12573.575011] [] dump_stack+0x19/0x1b
    > [12573.582284] [] print_circular_bug+0x200/0x20f
    > [12573.589637] [] __lock_acquire+0x1786/0x1af0
    > [12573.596982] [] ? sched_clock_cpu+0xb5/0x100
    > [12573.604344] [] lock_acquire+0x91/0x1f0
    > [12573.611652] [] ? rcu_read_unlock_special+0x9f/0x4c0
    > [12573.619030] [] _raw_spin_lock+0x40/0x80
    > [12573.626331] [] ? rcu_read_unlock_special+0x9f/0x4c0
    > [12573.633671] [] rcu_read_unlock_special+0x9f/0x4c0
    > [12573.640992] [] ? perf_lock_task_context+0x7d/0x2d0
    > [12573.648330] [] ? put_lock_stats.isra.29+0xe/0x40
    > [12573.655662] [] ? delay_tsc+0x90/0xe0
    > [12573.662964] [] __rcu_read_unlock+0x96/0xa0
    > [12573.670276] [] perf_lock_task_context+0x143/0x2d0
    > [12573.677622] [] ? __perf_event_enable+0x370/0x370
    > [12573.684981] [] find_get_context+0x4e/0x1f0
    > [12573.692358] [] SYSC_perf_event_open+0x514/0xbd0
    > [12573.699753] [] ? get_parent_ip+0xd/0x50
    > [12573.707135] [] ? trace_hardirqs_on_caller+0xfd/0x1c0
    > [12573.714599] [] SyS_perf_event_open+0x9/0x10
    > [12573.721996] [] tracesys+0xdd/0xe2

    This commit delays the wakeup via irq_work(), which is what
    perf and ftrace use to perform wakeups in critical sections.

    Reported-by: Dave Jones
    Signed-off-by: Steven Rostedt
    Signed-off-by: Paul E. McKenney

    Steven Rostedt
     

09 Jun, 2013

5 commits

  • Pull timer fixes from Thomas Gleixner:

    - Trivial: unused variable removal

    - Posix-timers: Add the clock ID to the new proc interface to make it
    useful. The interface is new and should be functional when we reach
    the final 3.10 release.

    - Cure a false positive warning in the tick code introduced by the
    overhaul in 3.10

    - Fix for a persistent clock detection regression introduced in this
    cycle

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    timekeeping: Correct run-time detection of persistent_clock.
    ntp: Remove unused variable flags in __hardpps
    posix-timers: Show clock ID in proc file
    tick: Cure broadcast false positive pending bit warning

    Linus Torvalds
     
  • Pull irqdomain bug fixes from Grant Likely:
    "This branch contains a set of straight forward bug fixes to the
    irqdomain code and to a couple of drivers that make use of it."

    * tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux:
    irqchip: Return -EPERM for reserved IRQs
    irqdomain: document the simple domain first_irq
    kernel/irq/irqdomain.c: before use 'irq_data', need check it whether valid.
    irqdomain: export irq_domain_add_simple

    Linus Torvalds
     
  • The first_irq needs to be zero to get a linear domain and that
    comes with special semantics. We want to simplify this going
    forward but some documentation never hurts.

    Signed-off-by: Linus Walleij
    Signed-off-by: Grant Likely

    Linus Walleij
     
  • Since irq_data may be NULL, if so, we WARN_ON(), and continue, 'hwirq'
    which related with 'irq_data' has to initialize later, or it will cause
    issue.

    Signed-off-by: Chen Gang
    Signed-off-by: Grant Likely

    Chen Gang
     
  • All other irq_domain_add_* functions are exported already, and apparently
    this one got left out by mistake, which causes build errors for ARM
    allmodconfig kernels:

    ERROR: "irq_domain_add_simple" [drivers/gpio/gpio-rcar.ko] undefined!
    ERROR: "irq_domain_add_simple" [drivers/gpio/gpio-em.ko] undefined!

    Signed-off-by: Arnd Bergmann
    Acked-by: Simon Horman
    Signed-off-by: Grant Likely

    Arnd Bergmann
     

08 Jun, 2013

1 commit

  • …l/git/rostedt/linux-trace

    Pull tracing fixes from Steven Rostedt:
    "This contains 4 fixes.

    The first two fix the case where full RCU debugging is enabled,
    enabling function tracing causes a live lock of the system. This is
    due to the added debug checks in rcu_dereference_raw() that is used by
    the function tracer. These checks are also traced by the function
    tracer as well as cause enough overhead to the function tracer to slow
    down the system enough that the time to finish an interrupt can take
    longer than when the next interrupt is triggered, causing a live lock
    from the timer interrupt.

    Talking this over with Paul McKenney, we came up with a fix that adds
    a new rcu_dereference_raw_notrace() that does not perform these added
    checks, and let the function tracer use that.

    The third commit fixes a failed compile when branch tracing is
    enabled, due to the conversion of the trace_test_buffer() selftest
    that the branch trace wasn't converted for.

    The forth patch fixes a bug caught by the RCU lockdep code where a
    rcu_read_lock() is performed when rcu is disabled (either going to or
    from idle, or user space). This happened on the irqsoff tracer as it
    calls task_uid(). The fix here was to use current_uid() when possible
    that doesn't use rcu locking. Which luckily, is always used when
    irqsoff calls this code."

    * tag 'trace-fixes-v3.10-rc3-v3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Use current_uid() for critical time tracing
    tracing: Fix bad parameter passed in branch selftest
    ftrace: Use the rcu _notrace variants for rcu_dereference_raw() and friends
    rcu: Add _notrace variation of rcu_dereference_raw() and hlist_for_each_entry_rcu()

    Linus Torvalds
     

07 Jun, 2013

1 commit

  • The irqsoff tracer records the max time that interrupts are disabled.
    There are hooks in the assembly code that calls back into the tracer when
    interrupts are disabled or enabled.

    When they are enabled, the tracer checks if the amount of time they
    were disabled is larger than the previous recorded max interrupts off
    time. If it is, it creates a snapshot of the currently running trace
    to store where the last largest interrupts off time was held and how
    it happened.

    During testing, this RCU lockdep dump appeared:

    [ 1257.829021] ===============================
    [ 1257.829021] [ INFO: suspicious RCU usage. ]
    [ 1257.829021] 3.10.0-rc1-test+ #171 Tainted: G W
    [ 1257.829021] -------------------------------
    [ 1257.829021] /home/rostedt/work/git/linux-trace.git/include/linux/rcupdate.h:780 rcu_read_lock() used illegally while idle!
    [ 1257.829021]
    [ 1257.829021] other info that might help us debug this:
    [ 1257.829021]
    [ 1257.829021]
    [ 1257.829021] RCU used illegally from idle CPU!
    [ 1257.829021] rcu_scheduler_active = 1, debug_locks = 0
    [ 1257.829021] RCU used illegally from extended quiescent state!
    [ 1257.829021] 2 locks held by trace-cmd/4831:
    [ 1257.829021] #0: (max_trace_lock){......}, at: [] stop_critical_timing+0x1a3/0x209
    [ 1257.829021] #1: (rcu_read_lock){.+.+..}, at: [] __update_max_tr+0x88/0x1ee
    [ 1257.829021]
    [ 1257.829021] stack backtrace:
    [ 1257.829021] CPU: 3 PID: 4831 Comm: trace-cmd Tainted: G W 3.10.0-rc1-test+ #171
    [ 1257.829021] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
    [ 1257.829021] 0000000000000001 ffff880065f49da8 ffffffff8153dd2b ffff880065f49dd8
    [ 1257.829021] ffffffff81092a00 ffff88006bd78680 ffff88007add7500 0000000000000003
    [ 1257.829021] ffff88006bd78680 ffff880065f49e18 ffffffff810daebf ffffffff810dae5a
    [ 1257.829021] Call Trace:
    [ 1257.829021] [] dump_stack+0x19/0x1b
    [ 1257.829021] [] lockdep_rcu_suspicious+0x109/0x112
    [ 1257.829021] [] __update_max_tr+0xed/0x1ee
    [ 1257.829021] [] ? __update_max_tr+0x88/0x1ee
    [ 1257.829021] [] ? user_enter+0xfd/0x107
    [ 1257.829021] [] update_max_tr_single+0x11d/0x12d
    [ 1257.829021] [] ? user_enter+0xfd/0x107
    [ 1257.829021] [] stop_critical_timing+0x141/0x209
    [ 1257.829021] [] ? trace_hardirqs_on+0xd/0xf
    [ 1257.829021] [] ? user_enter+0xfd/0x107
    [ 1257.829021] [] time_hardirqs_on+0x2a/0x2f
    [ 1257.829021] [] ? user_enter+0xfd/0x107
    [ 1257.829021] [] trace_hardirqs_on_caller+0x16/0x197
    [ 1257.829021] [] trace_hardirqs_on+0xd/0xf
    [ 1257.829021] [] user_enter+0xfd/0x107
    [ 1257.829021] [] do_notify_resume+0x92/0x97
    [ 1257.829021] [] int_signal+0x12/0x17

    What happened was entering into the user code, the interrupts were enabled
    and a max interrupts off was recorded. The trace buffer was saved along with
    various information about the task: comm, pid, uid, priority, etc.

    The uid is recorded with task_uid(tsk). But this is a macro that uses rcu_read_lock()
    to retrieve the data, and this happened to happen where RCU is blind (user_enter).

    As only the preempt and irqs off tracers can have this happen, and they both
    only have the tsk == current, if tsk == current, use current_uid() instead of
    task_uid(), as current_uid() does not use RCU as only current can change its uid.

    This fixes the RCU suspicious splat.

    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     

03 Jun, 2013

1 commit

  • Pull cgroup fixes from Tejun Heo:

    - Fix for yet another xattr bug which may lead to NULL deref.

    - A subtle bug in for_each_descendant_pre(). This bug requires quite
    specific conditions to trigger and isn't too likely to actually
    happen in the wild, but maybe that just makes it that much more
    nastier.

    - A warning message added for silly cgroup re-mount (not -o remount,
    but unmount followed by mount) behavior.

    * 'for-3.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: warn about mismatching options of a new mount of an existing hierarchy
    cgroup: fix a subtle bug in descendant pre-order walk
    cgroup: initialize xattr before calling d_instantiate()

    Linus Torvalds