15 Jun, 2013

2 commits

  • Pull VFS fixes from Al Viro:
    "Several fixes + obvious cleanup (you've missed a couple of open-coded
    can_lookup() back then)"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    snd_pcm_link(): fix a leak...
    use can_lookup() instead of direct checks of ->i_op->lookup
    move exit_task_namespaces() outside of exit_notify()
    fput: task_work_add() can fail if the caller has passed exit_task_work()
    ncpfs: fix rmdir returns Device or resource busy

    Linus Torvalds
     
  • exit_notify() does exit_task_namespaces() after
    forget_original_parent(). This was needed to ensure that ->nsproxy
    can't be cleared prematurely, an exiting child we are going to
    reparent can do do_notify_parent() and use the parent's (ours) pid_ns.

    However, after 32084504 "pidns: use task_active_pid_ns in
    do_notify_parent" ->nsproxy != NULL is no longer needed, we rely
    on task_active_pid_ns().

    Move exit_task_namespaces() from exit_notify() to do_exit(), after
    exit_fs() and before exit_task_work().

    This solves the problem reported by Andrey, free_ipc_ns()->shm_destroy()
    does fput() which needs task_work_add().

    Note: this particular problem can be fixed if we change fput(), and
    that change makes sense anyway. But there is another reason to move
    the callsite. The original reason for exit_task_namespaces() from
    the middle of exit_notify() was subtle and it has already gone away,
    now this looks confusing. And this allows us do simplify exit_notify(),
    we can avoid unlock/lock(tasklist) and we can use ->exit_state instead
    of PF_EXITING in forget_original_parent().

    Reported-by: Andrey Vagin
    Signed-off-by: Oleg Nesterov
    Acked-by: "Eric W. Biederman"
    Acked-by: Andrey Vagin
    Signed-off-by: Al Viro

    Oleg Nesterov
     

14 Jun, 2013

1 commit

  • Pull RCU fixes from Paul McKenney:
    "I must confess that this past merge window was not RCU's best showing.
    This series contains three more fixes for RCU regressions:

    1. A fix to __DECLARE_TRACE_RCU() that causes it to act as an
    interrupt from idle rather than as a task switch from idle.
    This change is needed due to the recent use of _rcuidle()
    tracepoints that can be invoked from interrupt handlers as well
    as from idle. Without this fix, invoking _rcuidle() tracepoints
    from interrupt handlers results in splats and (more seriously)
    confusion on RCU's part as to whether a given CPU is idle or not.
    This confusion can in turn result in too-short grace periods and
    therefore random memory corruption.

    2. A fix to a subtle deadlock that could result due to RCU doing
    a wakeup while holding one of its rcu_node structure's locks.
    Although the probability of occurrence is low, it really
    does happen. The fix, courtesy of Steven Rostedt, uses
    irq_work_queue() to avoid the deadlock.

    3. A fix to a silent deadlock (invisible to lockdep) due to the
    interaction of timeouts posted by RCU debug code enabled by
    CONFIG_PROVE_RCU_DELAY=y, grace-period initialization, and CPU
    hotplug operations. This will not occur in production kernels,
    but really does occur in randconfig testing. Diagnosis courtesy
    of Steven Rostedt"

    * 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
    rcu: Fix deadlock with CPU hotplug, RCU GP init, and timer migration
    rcu: Don't call wakeup() with rcu_node structure ->lock held
    trace: Allow idle-safe tracepoints to be called from irq

    Linus Torvalds
     

13 Jun, 2013

6 commits

  • Merge misc fixes from Andrew Morton:
    "Bunch of fixes and one little addition to math64.h"

    * emailed patches from Andrew Morton : (27 commits)
    include/linux/math64.h: add div64_ul()
    mm: memcontrol: fix lockless reclaim hierarchy iterator
    frontswap: fix incorrect zeroing and allocation size for frontswap_map
    kernel/audit_tree.c:audit_add_tree_rule(): protect `rule' from kill_rules()
    mm: migration: add migrate_entry_wait_huge()
    ocfs2: add missing lockres put in dlm_mig_lockres_handler
    mm/page_alloc.c: fix watermark check in __zone_watermark_ok()
    drivers/misc/sgi-gru/grufile.c: fix info leak in gru_get_config_info()
    aio: fix io_destroy() regression by using call_rcu()
    rtc-at91rm9200: use shadow IMR on at91sam9x5
    rtc-at91rm9200: add shadow interrupt mask
    rtc-at91rm9200: refactor interrupt-register handling
    rtc-at91rm9200: add configuration support
    rtc-at91rm9200: add match-table compile guard
    fs/ocfs2/namei.c: remove unecessary ERROR when removing non-empty directory
    swap: avoid read_swap_cache_async() race to deadlock while waiting on discard I/O completion
    drivers/rtc/rtc-twl.c: fix missing device_init_wakeup() when booted with device tree
    cciss: fix broken mutex usage in ioctl
    audit: wait_for_auditd() should use TASK_UNINTERRUPTIBLE
    drivers/rtc/rtc-cmos.c: fix accidentally enabling rtc channel
    ...

    Linus Torvalds
     
  • audit_add_tree_rule() must set 'rule->tree = NULL;' firstly, to protect
    the rule itself freed in kill_rules().

    The reason is when it is killed, the 'rule' itself may have already
    released, we should not access it. one example: we add a rule to an
    inode, just at the same time the other task is deleting this inode.

    The work flow for adding a rule:

    audit_receive() -> (need audit_cmd_mutex lock)
    audit_receive_skb() ->
    audit_receive_msg() ->
    audit_receive_filter() ->
    audit_add_rule() ->
    audit_add_tree_rule() -> (need audit_filter_mutex lock)
    ...
    unlock audit_filter_mutex
    get_tree()
    ...
    iterate_mounts() -> (iterate all related inodes)
    tag_mount() ->
    tag_trunk() ->
    create_trunk() -> (assume it is 1st rule)
    fsnotify_add_mark() ->
    fsnotify_add_inode_mark() -> (add mark to inode->i_fsnotify_marks)
    ...
    get_tree(); (each inode will get one)
    ...
    lock audit_filter_mutex

    The work flow for deleting an inode:

    __destroy_inode() ->
    fsnotify_inode_delete() ->
    __fsnotify_inode_delete() ->
    fsnotify_clear_marks_by_inode() -> (get mark from inode->i_fsnotify_marks)
    fsnotify_destroy_mark() ->
    fsnotify_destroy_mark_locked() ->
    audit_tree_freeing_mark() ->
    evict_chunk() ->
    ...
    tree->goner = 1
    ...
    kill_rules() -> (assume current->audit_context == NULL)
    call_rcu() -> (rule->tree != NULL)
    audit_free_rule_rcu() ->
    audit_free_rule()
    ...
    audit_schedule_prune() -> (assume current->audit_context == NULL)
    kthread_run() -> (need audit_cmd_mutex and audit_filter_mutex lock)
    prune_one() -> (delete it from prue_list)
    put_tree(); (match the original get_tree above)

    Signed-off-by: Chen Gang
    Cc: Eric Paris
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • audit_log_start() does wait_for_auditd() in a loop until
    audit_backlog_wait_time passes or audit_skb_queue has a room.

    If signal_pending() is true this becomes a busy-wait loop, schedule() in
    TASK_INTERRUPTIBLE won't block.

    Thanks to Guy for fully investigating and explaining the problem.

    (akpm: that'll cause the system to lock up on a non-preemptible
    uniprocessor kernel)

    (Guy: "Our customer was in fact running a uniprocessor machine, and they
    reported a system hang.")

    Signed-off-by: Oleg Nesterov
    Reported-by: Guy Streeter
    Cc: Eric Paris
    Cc: Al Viro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The dmesg_restrict sysctl currently covers the syslog method for access
    dmesg, however /dev/kmsg isn't covered by the same protections. Most
    people haven't noticed because util-linux dmesg(1) defaults to using the
    syslog method for access in older versions. With util-linux dmesg(1)
    defaults to reading directly from /dev/kmsg.

    To fix /dev/kmsg, let's compare the existing interfaces and what they
    allow:

    - /proc/kmsg allows:
    - open (SYSLOG_ACTION_OPEN) if CAP_SYSLOG since it uses a destructive
    single-reader interface (SYSLOG_ACTION_READ).
    - everything, after an open.

    - syslog syscall allows:
    - anything, if CAP_SYSLOG.
    - SYSLOG_ACTION_READ_ALL and SYSLOG_ACTION_SIZE_BUFFER, if
    dmesg_restrict==0.
    - nothing else (EPERM).

    The use-cases were:
    - dmesg(1) needs to do non-destructive SYSLOG_ACTION_READ_ALLs.
    - sysklog(1) needs to open /proc/kmsg, drop privs, and still issue the
    destructive SYSLOG_ACTION_READs.

    AIUI, dmesg(1) is moving to /dev/kmsg, and systemd-journald doesn't
    clear the ring buffer.

    Based on the comments in devkmsg_llseek, it sounds like actions besides
    reading aren't going to be supported by /dev/kmsg (i.e.
    SYSLOG_ACTION_CLEAR), so we have a strict subset of the non-destructive
    syslog syscall actions.

    To this end, move the check as Josh had done, but also rename the
    constants to reflect their new uses (SYSLOG_FROM_CALL becomes
    SYSLOG_FROM_READER, and SYSLOG_FROM_FILE becomes SYSLOG_FROM_PROC).
    SYSLOG_FROM_READER allows non-destructive actions, and SYSLOG_FROM_PROC
    allows destructive actions after a capabilities-constrained
    SYSLOG_ACTION_OPEN check.

    - /dev/kmsg allows:
    - open if CAP_SYSLOG or dmesg_restrict==0
    - reading/polling, after open

    Addresses https://bugzilla.redhat.com/show_bug.cgi?id=903192

    [akpm@linux-foundation.org: use pr_warn_once()]
    Signed-off-by: Kees Cook
    Reported-by: Christian Kujau
    Tested-by: Josh Boyer
    Cc: Kay Sievers
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • We recently noticed that reboot of a 1024 cpu machine takes approx 16
    minutes of just stopping the cpus. The slowdown was tracked to commit
    f96972f2dc63 ("kernel/sys.c: call disable_nonboot_cpus() in
    kernel_restart()").

    The current implementation does all the work of hot removing the cpus
    before halting the system. We are switching to just migrating to the
    boot cpu and then continuing with shutdown/reboot.

    This also has the effect of not breaking x86's command line parameter
    for specifying the reboot cpu. Note, this code was shamelessly copied
    from arch/x86/kernel/reboot.c with bits removed pertaining to the
    reboot_cpu command line parameter.

    Signed-off-by: Robin Holt
    Tested-by: Shawn Guo
    Cc: "Srivatsa S. Bhat"
    Cc: H. Peter Anvin
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Russ Anderson
    Cc: Robin Holt
    Cc: Russell King
    Cc: Guan Xuetao
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     
  • There are instances in the kernel where we would like to disable CPU
    hotplug (from sysfs) during some important operation. Today the freezer
    code depends on this and the code to do it was kinda tailor-made for
    that.

    Restructure the code and make it generic enough to be useful for other
    usecases too.

    Signed-off-by: Srivatsa S. Bhat
    Signed-off-by: Robin Holt
    Cc: H. Peter Anvin
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Russ Anderson
    Cc: Robin Holt
    Cc: Russell King
    Cc: Guan Xuetao
    Cc: Shawn Guo
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Srivatsa S. Bhat
     

12 Jun, 2013

2 commits

  • …it/rostedt/linux-trace

    Pull tracing fix from Steven Rostedt:
    "Yoshihiro Yunomae fixed a regression in the output format when using
    one of the counter clocks.

    The new multibuffer code changed the trace_clock file to update the
    trace instances tr->clock_id but the actual traces still used the
    value from the obsolete global variable trace_clock_id"

    * tag 'trace-fixes-v3.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Fix outputting formats of x86-tsc and counter when use trace_clock

    Linus Torvalds
     
  • Outputting formats of x86-tsc and counter should be a raw format, but after
    applying the patch(2b6080f28c7cc3efc8625ab71495aae89aeb63a0), the format was
    changed to nanosec. This is because the global variable trace_clock_id was used.
    When we use multiple buffers, clock_id of each sub-buffer should be used. Then,
    this patch uses tr->clock_id instead of the global variable trace_clock_id.

    [ Basically, this fixes a regression where the multibuffer code changed the
    trace_clock file to update tr->clock_id but the traces still use the old
    global trace_clock_id variable, negating the file's effect. The global
    trace_clock_id variable is obsolete and removed. - SR ]

    Link: http://lkml.kernel.org/r/20130423013239.22334.7394.stgit@yunodevel

    Signed-off-by: Yoshihiro YUNOMAE
    Signed-off-by: Steven Rostedt

    Yoshihiro YUNOMAE
     

11 Jun, 2013

3 commits

  • The stop machine logic can lock up if all but one of the migration
    threads make it through the disable-irq step and the one remaining
    thread gets stuck in __do_softirq. The reason __do_softirq can hang is
    that it has a bail-out based on jiffies timeout, but in the lockup case,
    jiffies itself is not incremented.

    To work around this, re-add the max_restart counter in __do_irq and stop
    processing irqs after 10 restarts.

    Thanks to Tejun Heo and Rusty Russell and others for helping me track
    this down.

    This was introduced in 3.9 by commit c10d73671ad3 ("softirq: reduce
    latencies").

    It may be worth looking into ath9k to see if it has issues with its irq
    handler at a later date.

    The hang stack traces look something like this:

    ------------[ cut here ]------------
    WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xa7()
    Watchdog detected hard LOCKUP on cpu 2
    Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc]
    Pid: 23, comm: migration/2 Tainted: G C 3.9.4+ #11
    Call Trace:
    warn_slowpath_common+0x85/0x9f
    warn_slowpath_fmt+0x46/0x48
    watchdog_overflow_callback+0x9c/0xa7
    __perf_event_overflow+0x137/0x1cb
    perf_event_overflow+0x14/0x16
    intel_pmu_handle_irq+0x2dc/0x359
    perf_event_nmi_handler+0x19/0x1b
    nmi_handle+0x7f/0xc2
    do_nmi+0xbc/0x304
    end_repeat_nmi+0x1e/0x2e
    <>
    cpu_stopper_thread+0xae/0x162
    smpboot_thread_fn+0x258/0x260
    kthread+0xc7/0xcf
    ret_from_fork+0x7c/0xb0
    ---[ end trace 4947dfa9b0a4cec3 ]---
    BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:17]
    Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc]
    irq event stamp: 835637905
    hardirqs last enabled at (835637904): __do_softirq+0x9f/0x257
    hardirqs last disabled at (835637905): apic_timer_interrupt+0x6d/0x80
    softirqs last enabled at (5654720): __do_softirq+0x1ff/0x257
    softirqs last disabled at (5654725): irq_exit+0x5f/0xbb
    CPU 1
    Pid: 17, comm: migration/1 Tainted: G WC 3.9.4+ #11 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
    RIP: tasklet_hi_action+0xf0/0xf0
    Process migration/1
    Call Trace:

    __do_softirq+0x117/0x257
    irq_exit+0x5f/0xbb
    smp_apic_timer_interrupt+0x8a/0x98
    apic_timer_interrupt+0x72/0x80

    printk+0x4d/0x4f
    stop_machine_cpu_stop+0x22c/0x274
    cpu_stopper_thread+0xae/0x162
    smpboot_thread_fn+0x258/0x260
    kthread+0xc7/0xcf
    ret_from_fork+0x7c/0xb0

    Signed-off-by: Ben Greear
    Acked-by: Tejun Heo
    Acked-by: Pekka Riikonen
    Cc: Eric Dumazet
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Ben Greear
     
  • In Steven Rostedt's words:

    > I've been debugging the last couple of days why my tests have been
    > locking up. One of my tracing tests, runs all available tracers. The
    > lockup always happened with the mmiotrace, which is used to trace
    > interactions between priority drivers and the kernel. But to do this
    > easily, when the tracer gets registered, it disables all but the boot
    > CPUs. The lockup always happened after it got done disabling the CPUs.
    >
    > Then I decided to try this:
    >
    > while :; do
    > for i in 1 2 3; do
    > echo 0 > /sys/devices/system/cpu/cpu$i/online
    > done
    > for i in 1 2 3; do
    > echo 1 > /sys/devices/system/cpu/cpu$i/online
    > done
    > done
    >
    > Well, sure enough, that locked up too, with the same users. Doing a
    > sysrq-w (showing all blocked tasks):
    >
    > [ 2991.344562] task PC stack pid father
    > [ 2991.344562] rcu_preempt D ffff88007986fdf8 0 10 2 0x00000000
    > [ 2991.344562] ffff88007986fc98 0000000000000002 ffff88007986fc48 0000000000000908
    > [ 2991.344562] ffff88007986c280 ffff88007986ffd8 ffff88007986ffd8 00000000001d3c80
    > [ 2991.344562] ffff880079248a40 ffff88007986c280 0000000000000000 00000000fffd4295
    > [ 2991.344562] Call Trace:
    > [ 2991.344562] [] schedule+0x64/0x66
    > [ 2991.344562] [] schedule_timeout+0xbc/0xf9
    > [ 2991.344562] [] ? ftrace_call+0x5/0x2f
    > [ 2991.344562] [] ? cascade+0xa8/0xa8
    > [ 2991.344562] [] schedule_timeout_uninterruptible+0x1e/0x20
    > [ 2991.344562] [] rcu_gp_kthread+0x502/0x94b
    > [ 2991.344562] [] ? __init_waitqueue_head+0x50/0x50
    > [ 2991.344562] [] ? rcu_gp_fqs+0x64/0x64
    > [ 2991.344562] [] kthread+0xb1/0xb9
    > [ 2991.344562] [] ? lock_release_holdtime.part.23+0x4e/0x55
    > [ 2991.344562] [] ? __init_kthread_worker+0x58/0x58
    > [ 2991.344562] [] ret_from_fork+0x7c/0xb0
    > [ 2991.344562] [] ? __init_kthread_worker+0x58/0x58
    > [ 2991.344562] kworker/0:1 D ffffffff81a30680 0 47 2 0x00000000
    > [ 2991.344562] Workqueue: events cpuset_hotplug_workfn
    > [ 2991.344562] ffff880078dbbb58 0000000000000002 0000000000000006 00000000000000d8
    > [ 2991.344562] ffff880078db8100 ffff880078dbbfd8 ffff880078dbbfd8 00000000001d3c80
    > [ 2991.344562] ffff8800779ca5c0 ffff880078db8100 ffffffff81541fcf 0000000000000000
    > [ 2991.344562] Call Trace:
    > [ 2991.344562] [] ? __mutex_lock_common+0x3d4/0x609
    > [ 2991.344562] [] schedule+0x64/0x66
    > [ 2991.344562] [] schedule_preempt_disabled+0x18/0x24
    > [ 2991.344562] [] __mutex_lock_common+0x3d4/0x609
    > [ 2991.344562] [] ? get_online_cpus+0x3c/0x50
    > [ 2991.344562] [] ? get_online_cpus+0x3c/0x50
    > [ 2991.344562] [] mutex_lock_nested+0x3b/0x40
    > [ 2991.344562] [] get_online_cpus+0x3c/0x50
    > [ 2991.344562] [] rebuild_sched_domains_locked+0x6e/0x3a8
    > [ 2991.344562] [] rebuild_sched_domains+0x1c/0x2a
    > [ 2991.344562] [] cpuset_hotplug_workfn+0x1c7/0x1d3
    > [ 2991.344562] [] ? cpuset_hotplug_workfn+0x5/0x1d3
    > [ 2991.344562] [] process_one_work+0x2d4/0x4d1
    > [ 2991.344562] [] ? process_one_work+0x207/0x4d1
    > [ 2991.344562] [] worker_thread+0x2e7/0x3b5
    > [ 2991.344562] [] ? rescuer_thread+0x332/0x332
    > [ 2991.344562] [] kthread+0xb1/0xb9
    > [ 2991.344562] [] ? __init_kthread_worker+0x58/0x58
    > [ 2991.344562] [] ret_from_fork+0x7c/0xb0
    > [ 2991.344562] [] ? __init_kthread_worker+0x58/0x58
    > [ 2991.344562] bash D ffffffff81a4aa80 0 2618 2612 0x10000000
    > [ 2991.344562] ffff8800379abb58 0000000000000002 0000000000000006 0000000000000c2c
    > [ 2991.344562] ffff880077fea140 ffff8800379abfd8 ffff8800379abfd8 00000000001d3c80
    > [ 2991.344562] ffff8800779ca5c0 ffff880077fea140 ffffffff81541fcf 0000000000000000
    > [ 2991.344562] Call Trace:
    > [ 2991.344562] [] ? __mutex_lock_common+0x3d4/0x609
    > [ 2991.344562] [] schedule+0x64/0x66
    > [ 2991.344562] [] schedule_preempt_disabled+0x18/0x24
    > [ 2991.344562] [] __mutex_lock_common+0x3d4/0x609
    > [ 2991.344562] [] ? rcu_cpu_notify+0x2f5/0x86e
    > [ 2991.344562] [] ? rcu_cpu_notify+0x2f5/0x86e
    > [ 2991.344562] [] mutex_lock_nested+0x3b/0x40
    > [ 2991.344562] [] rcu_cpu_notify+0x2f5/0x86e
    > [ 2991.344562] [] ? __lock_is_held+0x32/0x53
    > [ 2991.344562] [] notifier_call_chain+0x6b/0x98
    > [ 2991.344562] [] __raw_notifier_call_chain+0xe/0x10
    > [ 2991.344562] [] __cpu_notify+0x20/0x32
    > [ 2991.344562] [] cpu_notify_nofail+0x17/0x36
    > [ 2991.344562] [] _cpu_down+0x154/0x259
    > [ 2991.344562] [] cpu_down+0x2d/0x3a
    > [ 2991.344562] [] store_online+0x4e/0xe7
    > [ 2991.344562] [] dev_attr_store+0x20/0x22
    > [ 2991.344562] [] sysfs_write_file+0x108/0x144
    > [ 2991.344562] [] vfs_write+0xfd/0x158
    > [ 2991.344562] [] SyS_write+0x5c/0x83
    > [ 2991.344562] [] tracesys+0xdd/0xe2
    >
    > As well as held locks:
    >
    > [ 3034.728033] Showing all locks held in the system:
    > [ 3034.728033] 1 lock held by rcu_preempt/10:
    > [ 3034.728033] #0: (rcu_preempt_state.onoff_mutex){+.+...}, at: [] rcu_gp_kthread+0x167/0x94b
    > [ 3034.728033] 4 locks held by kworker/0:1/47:
    > [ 3034.728033] #0: (events){.+.+.+}, at: [] process_one_work+0x207/0x4d1
    > [ 3034.728033] #1: (cpuset_hotplug_work){+.+.+.}, at: [] process_one_work+0x207/0x4d1
    > [ 3034.728033] #2: (cpuset_mutex){+.+.+.}, at: [] rebuild_sched_domains+0x17/0x2a
    > [ 3034.728033] #3: (cpu_hotplug.lock){+.+.+.}, at: [] get_online_cpus+0x3c/0x50
    > [ 3034.728033] 1 lock held by mingetty/2563:
    > [ 3034.728033] #0: (&ldata->atomic_read_lock){+.+...}, at: [] n_tty_read+0x252/0x7e8
    > [ 3034.728033] 1 lock held by mingetty/2565:
    > [ 3034.728033] #0: (&ldata->atomic_read_lock){+.+...}, at: [] n_tty_read+0x252/0x7e8
    > [ 3034.728033] 1 lock held by mingetty/2569:
    > [ 3034.728033] #0: (&ldata->atomic_read_lock){+.+...}, at: [] n_tty_read+0x252/0x7e8
    > [ 3034.728033] 1 lock held by mingetty/2572:
    > [ 3034.728033] #0: (&ldata->atomic_read_lock){+.+...}, at: [] n_tty_read+0x252/0x7e8
    > [ 3034.728033] 1 lock held by mingetty/2575:
    > [ 3034.728033] #0: (&ldata->atomic_read_lock){+.+...}, at: [] n_tty_read+0x252/0x7e8
    > [ 3034.728033] 7 locks held by bash/2618:
    > [ 3034.728033] #0: (sb_writers#5){.+.+.+}, at: [] file_start_write+0x2a/0x2c
    > [ 3034.728033] #1: (&buffer->mutex#2){+.+.+.}, at: [] sysfs_write_file+0x3c/0x144
    > [ 3034.728033] #2: (s_active#54){.+.+.+}, at: [] sysfs_write_file+0xe7/0x144
    > [ 3034.728033] #3: (x86_cpu_hotplug_driver_mutex){+.+.+.}, at: [] cpu_hotplug_driver_lock+0x17/0x19
    > [ 3034.728033] #4: (cpu_add_remove_lock){+.+.+.}, at: [] cpu_maps_update_begin+0x17/0x19
    > [ 3034.728033] #5: (cpu_hotplug.lock){+.+.+.}, at: [] cpu_hotplug_begin+0x2c/0x6d
    > [ 3034.728033] #6: (rcu_preempt_state.onoff_mutex){+.+...}, at: [] rcu_cpu_notify+0x2f5/0x86e
    > [ 3034.728033] 1 lock held by bash/2980:
    > [ 3034.728033] #0: (&ldata->atomic_read_lock){+.+...}, at: [] n_tty_read+0x252/0x7e8
    >
    > Things looked a little weird. Also, this is a deadlock that lockdep did
    > not catch. But what we have here does not look like a circular lock
    > issue:
    >
    > Bash is blocked in rcu_cpu_notify():
    >
    > 1961 /* Exclude any attempts to start a new grace period. */
    > 1962 mutex_lock(&rsp->onoff_mutex);
    >
    >
    > kworker is blocked in get_online_cpus(), which makes sense as we are
    > currently taking down a CPU.
    >
    > But rcu_preempt is not blocked on anything. It is simply sleeping in
    > rcu_gp_kthread (really rcu_gp_init) here:
    >
    > 1453 #ifdef CONFIG_PROVE_RCU_DELAY
    > 1454 if ((prandom_u32() % (rcu_num_nodes * 8)) == 0 &&
    > 1455 system_state == SYSTEM_RUNNING)
    > 1456 schedule_timeout_uninterruptible(2);
    > 1457 #endif /* #ifdef CONFIG_PROVE_RCU_DELAY */
    >
    > And it does this while holding the onoff_mutex that bash is waiting for.
    >
    > Doing a function trace, it showed me where it happened:
    >
    > [ 125.940066] rcu_pree-10 3.... 28384115273: schedule_timeout_uninterruptible [...]
    > [ 125.940066] rcu_pree-10 3d..3 28384202439: sched_switch: prev_comm=rcu_preempt prev_pid=10 prev_prio=120 prev_state=D ==> next_comm=watchdog/3 next_pid=38 next_prio=120
    >
    > The watchdog ran, and then:
    >
    > [ 125.940066] watchdog-38 3d..3 28384692863: sched_switch: prev_comm=watchdog/3 prev_pid=38 prev_prio=120 prev_state=P ==> next_comm=modprobe next_pid=2848 next_prio=118
    >
    > Not sure what modprobe was doing, but shortly after that:
    >
    > [ 125.940066] modprobe-2848 3d..3 28385041749: sched_switch: prev_comm=modprobe prev_pid=2848 prev_prio=118 prev_state=R+ ==> next_comm=migration/3 next_pid=40 next_prio=0
    >
    > Where the migration thread took down the CPU:
    >
    > [ 125.940066] migratio-40 3d..3 28389148276: sched_switch: prev_comm=migration/3 prev_pid=40 prev_prio=0 prev_state=P ==> next_comm=swapper/3 next_pid=0 next_prio=120
    >
    > which finally did:
    >
    > [ 125.940066] -0 3...1 28389282142: arch_cpu_idle_dead [ 125.940066] -0 3...1 28389282548: native_play_dead [ 125.940066] -0 3...1 28389282924: play_dead_common [ 125.940066] -0 3...1 28389283468: idle_task_exit [ 125.940066] -0 3...1 28389284644: amd_e400_remove_cpu
    >
    > CPU 3 is now offline, the rcu_preempt thread that ran on CPU 3 is still
    > doing a schedule_timeout_uninterruptible() and it registered it's
    > timeout to the timer base for CPU 3. You would think that it would get
    > migrated right? The issue here is that the timer migration happens at
    > the CPU notifier for CPU_DEAD. The problem is that the rcu notifier for
    > CPU_DOWN is blocked waiting for the onoff_mutex to be released, which is
    > held by the thread that just put itself into a uninterruptible sleep,
    > that wont wake up until the CPU_DEAD notifier of the timer
    > infrastructure is called, which wont happen until the rcu notifier
    > finishes. Here's our deadlock!

    This commit breaks this deadlock cycle by substituting a shorter udelay()
    for the previous schedule_timeout_uninterruptible(), while at the same
    time increasing the probability of the delay. This maintains the intensity
    of the testing.

    Reported-by: Steven Rostedt
    Signed-off-by: Paul E. McKenney
    Tested-by: Steven Rostedt

    Paul E. McKenney
     
  • This commit fixes a lockdep-detected deadlock by moving a wake_up()
    call out from a rnp->lock critical section. Please see below for
    the long version of this story.

    On Tue, 2013-05-28 at 16:13 -0400, Dave Jones wrote:

    > [12572.705832] ======================================================
    > [12572.750317] [ INFO: possible circular locking dependency detected ]
    > [12572.796978] 3.10.0-rc3+ #39 Not tainted
    > [12572.833381] -------------------------------------------------------
    > [12572.862233] trinity-child17/31341 is trying to acquire lock:
    > [12572.870390] (rcu_node_0){..-.-.}, at: [] rcu_read_unlock_special+0x9f/0x4c0
    > [12572.878859]
    > but task is already holding lock:
    > [12572.894894] (&ctx->lock){-.-...}, at: [] perf_lock_task_context+0x7d/0x2d0
    > [12572.903381]
    > which lock already depends on the new lock.
    >
    > [12572.927541]
    > the existing dependency chain (in reverse order) is:
    > [12572.943736]
    > -> #4 (&ctx->lock){-.-...}:
    > [12572.960032] [] lock_acquire+0x91/0x1f0
    > [12572.968337] [] _raw_spin_lock+0x40/0x80
    > [12572.976633] [] __perf_event_task_sched_out+0x2e7/0x5e0
    > [12572.984969] [] perf_event_task_sched_out+0x93/0xa0
    > [12572.993326] [] __schedule+0x2cf/0x9c0
    > [12573.001652] [] schedule_user+0x2e/0x70
    > [12573.009998] [] retint_careful+0x12/0x2e
    > [12573.018321]
    > -> #3 (&rq->lock){-.-.-.}:
    > [12573.034628] [] lock_acquire+0x91/0x1f0
    > [12573.042930] [] _raw_spin_lock+0x40/0x80
    > [12573.051248] [] wake_up_new_task+0xb7/0x260
    > [12573.059579] [] do_fork+0x105/0x470
    > [12573.067880] [] kernel_thread+0x26/0x30
    > [12573.076202] [] rest_init+0x23/0x140
    > [12573.084508] [] start_kernel+0x3f1/0x3fe
    > [12573.092852] [] x86_64_start_reservations+0x2a/0x2c
    > [12573.101233] [] x86_64_start_kernel+0xcc/0xcf
    > [12573.109528]
    > -> #2 (&p->pi_lock){-.-.-.}:
    > [12573.125675] [] lock_acquire+0x91/0x1f0
    > [12573.133829] [] _raw_spin_lock_irqsave+0x4b/0x90
    > [12573.141964] [] try_to_wake_up+0x31/0x320
    > [12573.150065] [] default_wake_function+0x12/0x20
    > [12573.158151] [] autoremove_wake_function+0x18/0x40
    > [12573.166195] [] __wake_up_common+0x58/0x90
    > [12573.174215] [] __wake_up+0x39/0x50
    > [12573.182146] [] rcu_start_gp_advanced.isra.11+0x4a/0x50
    > [12573.190119] [] rcu_start_future_gp+0x1c9/0x1f0
    > [12573.198023] [] rcu_nocb_kthread+0x114/0x930
    > [12573.205860] [] kthread+0xed/0x100
    > [12573.213656] [] ret_from_fork+0x7c/0xb0
    > [12573.221379]
    > -> #1 (&rsp->gp_wq){..-.-.}:
    > [12573.236329] [] lock_acquire+0x91/0x1f0
    > [12573.243783] [] _raw_spin_lock_irqsave+0x4b/0x90
    > [12573.251178] [] __wake_up+0x23/0x50
    > [12573.258505] [] rcu_start_gp_advanced.isra.11+0x4a/0x50
    > [12573.265891] [] rcu_start_future_gp+0x1c9/0x1f0
    > [12573.273248] [] rcu_nocb_kthread+0x114/0x930
    > [12573.280564] [] kthread+0xed/0x100
    > [12573.287807] [] ret_from_fork+0x7c/0xb0

    Notice the above call chain.

    rcu_start_future_gp() is called with the rnp->lock held. Then it calls
    rcu_start_gp_advance, which does a wakeup.

    You can't do wakeups while holding the rnp->lock, as that would mean
    that you could not do a rcu_read_unlock() while holding the rq lock, or
    any lock that was taken while holding the rq lock. This is because...
    (See below).

    > [12573.295067]
    > -> #0 (rcu_node_0){..-.-.}:
    > [12573.309293] [] __lock_acquire+0x1786/0x1af0
    > [12573.316568] [] lock_acquire+0x91/0x1f0
    > [12573.323825] [] _raw_spin_lock+0x40/0x80
    > [12573.331081] [] rcu_read_unlock_special+0x9f/0x4c0
    > [12573.338377] [] __rcu_read_unlock+0x96/0xa0
    > [12573.345648] [] perf_lock_task_context+0x143/0x2d0
    > [12573.352942] [] find_get_context+0x4e/0x1f0
    > [12573.360211] [] SYSC_perf_event_open+0x514/0xbd0
    > [12573.367514] [] SyS_perf_event_open+0x9/0x10
    > [12573.374816] [] tracesys+0xdd/0xe2

    Notice the above trace.

    perf took its own ctx->lock, which can be taken while holding the rq
    lock. While holding this lock, it did a rcu_read_unlock(). The
    perf_lock_task_context() basically looks like:

    rcu_read_lock();
    raw_spin_lock(ctx->lock);
    rcu_read_unlock();

    Now, what looks to have happened, is that we scheduled after taking that
    first rcu_read_lock() but before taking the spin lock. When we scheduled
    back in and took the ctx->lock, the following rcu_read_unlock()
    triggered the "special" code.

    The rcu_read_unlock_special() takes the rnp->lock, which gives us a
    possible deadlock scenario.

    CPU0 CPU1 CPU2
    ---- ---- ----

    rcu_nocb_kthread()
    lock(rq->lock);
    lock(ctx->lock);
    lock(rnp->lock);

    wake_up();

    lock(rq->lock);

    rcu_read_unlock();

    rcu_read_unlock_special();

    lock(rnp->lock);
    lock(ctx->lock);

    **** DEADLOCK ****

    > [12573.382068]
    > other info that might help us debug this:
    >
    > [12573.403229] Chain exists of:
    > rcu_node_0 --> &rq->lock --> &ctx->lock
    >
    > [12573.424471] Possible unsafe locking scenario:
    >
    > [12573.438499] CPU0 CPU1
    > [12573.445599] ---- ----
    > [12573.452691] lock(&ctx->lock);
    > [12573.459799] lock(&rq->lock);
    > [12573.467010] lock(&ctx->lock);
    > [12573.474192] lock(rcu_node_0);
    > [12573.481262]
    > *** DEADLOCK ***
    >
    > [12573.501931] 1 lock held by trinity-child17/31341:
    > [12573.508990] #0: (&ctx->lock){-.-...}, at: [] perf_lock_task_context+0x7d/0x2d0
    > [12573.516475]
    > stack backtrace:
    > [12573.530395] CPU: 1 PID: 31341 Comm: trinity-child17 Not tainted 3.10.0-rc3+ #39
    > [12573.545357] ffffffff825b4f90 ffff880219f1dbc0 ffffffff816e375b ffff880219f1dc00
    > [12573.552868] ffffffff816dfa5d ffff880219f1dc50 ffff88023ce4d1f8 ffff88023ce4ca40
    > [12573.560353] 0000000000000001 0000000000000001 ffff88023ce4d1f8 ffff880219f1dcc0
    > [12573.567856] Call Trace:
    > [12573.575011] [] dump_stack+0x19/0x1b
    > [12573.582284] [] print_circular_bug+0x200/0x20f
    > [12573.589637] [] __lock_acquire+0x1786/0x1af0
    > [12573.596982] [] ? sched_clock_cpu+0xb5/0x100
    > [12573.604344] [] lock_acquire+0x91/0x1f0
    > [12573.611652] [] ? rcu_read_unlock_special+0x9f/0x4c0
    > [12573.619030] [] _raw_spin_lock+0x40/0x80
    > [12573.626331] [] ? rcu_read_unlock_special+0x9f/0x4c0
    > [12573.633671] [] rcu_read_unlock_special+0x9f/0x4c0
    > [12573.640992] [] ? perf_lock_task_context+0x7d/0x2d0
    > [12573.648330] [] ? put_lock_stats.isra.29+0xe/0x40
    > [12573.655662] [] ? delay_tsc+0x90/0xe0
    > [12573.662964] [] __rcu_read_unlock+0x96/0xa0
    > [12573.670276] [] perf_lock_task_context+0x143/0x2d0
    > [12573.677622] [] ? __perf_event_enable+0x370/0x370
    > [12573.684981] [] find_get_context+0x4e/0x1f0
    > [12573.692358] [] SYSC_perf_event_open+0x514/0xbd0
    > [12573.699753] [] ? get_parent_ip+0xd/0x50
    > [12573.707135] [] ? trace_hardirqs_on_caller+0xfd/0x1c0
    > [12573.714599] [] SyS_perf_event_open+0x9/0x10
    > [12573.721996] [] tracesys+0xdd/0xe2

    This commit delays the wakeup via irq_work(), which is what
    perf and ftrace use to perform wakeups in critical sections.

    Reported-by: Dave Jones
    Signed-off-by: Steven Rostedt
    Signed-off-by: Paul E. McKenney

    Steven Rostedt
     

09 Jun, 2013

5 commits

  • Pull timer fixes from Thomas Gleixner:

    - Trivial: unused variable removal

    - Posix-timers: Add the clock ID to the new proc interface to make it
    useful. The interface is new and should be functional when we reach
    the final 3.10 release.

    - Cure a false positive warning in the tick code introduced by the
    overhaul in 3.10

    - Fix for a persistent clock detection regression introduced in this
    cycle

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    timekeeping: Correct run-time detection of persistent_clock.
    ntp: Remove unused variable flags in __hardpps
    posix-timers: Show clock ID in proc file
    tick: Cure broadcast false positive pending bit warning

    Linus Torvalds
     
  • Pull irqdomain bug fixes from Grant Likely:
    "This branch contains a set of straight forward bug fixes to the
    irqdomain code and to a couple of drivers that make use of it."

    * tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux:
    irqchip: Return -EPERM for reserved IRQs
    irqdomain: document the simple domain first_irq
    kernel/irq/irqdomain.c: before use 'irq_data', need check it whether valid.
    irqdomain: export irq_domain_add_simple

    Linus Torvalds
     
  • The first_irq needs to be zero to get a linear domain and that
    comes with special semantics. We want to simplify this going
    forward but some documentation never hurts.

    Signed-off-by: Linus Walleij
    Signed-off-by: Grant Likely

    Linus Walleij
     
  • Since irq_data may be NULL, if so, we WARN_ON(), and continue, 'hwirq'
    which related with 'irq_data' has to initialize later, or it will cause
    issue.

    Signed-off-by: Chen Gang
    Signed-off-by: Grant Likely

    Chen Gang
     
  • All other irq_domain_add_* functions are exported already, and apparently
    this one got left out by mistake, which causes build errors for ARM
    allmodconfig kernels:

    ERROR: "irq_domain_add_simple" [drivers/gpio/gpio-rcar.ko] undefined!
    ERROR: "irq_domain_add_simple" [drivers/gpio/gpio-em.ko] undefined!

    Signed-off-by: Arnd Bergmann
    Acked-by: Simon Horman
    Signed-off-by: Grant Likely

    Arnd Bergmann
     

08 Jun, 2013

1 commit

  • …l/git/rostedt/linux-trace

    Pull tracing fixes from Steven Rostedt:
    "This contains 4 fixes.

    The first two fix the case where full RCU debugging is enabled,
    enabling function tracing causes a live lock of the system. This is
    due to the added debug checks in rcu_dereference_raw() that is used by
    the function tracer. These checks are also traced by the function
    tracer as well as cause enough overhead to the function tracer to slow
    down the system enough that the time to finish an interrupt can take
    longer than when the next interrupt is triggered, causing a live lock
    from the timer interrupt.

    Talking this over with Paul McKenney, we came up with a fix that adds
    a new rcu_dereference_raw_notrace() that does not perform these added
    checks, and let the function tracer use that.

    The third commit fixes a failed compile when branch tracing is
    enabled, due to the conversion of the trace_test_buffer() selftest
    that the branch trace wasn't converted for.

    The forth patch fixes a bug caught by the RCU lockdep code where a
    rcu_read_lock() is performed when rcu is disabled (either going to or
    from idle, or user space). This happened on the irqsoff tracer as it
    calls task_uid(). The fix here was to use current_uid() when possible
    that doesn't use rcu locking. Which luckily, is always used when
    irqsoff calls this code."

    * tag 'trace-fixes-v3.10-rc3-v3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Use current_uid() for critical time tracing
    tracing: Fix bad parameter passed in branch selftest
    ftrace: Use the rcu _notrace variants for rcu_dereference_raw() and friends
    rcu: Add _notrace variation of rcu_dereference_raw() and hlist_for_each_entry_rcu()

    Linus Torvalds
     

07 Jun, 2013

1 commit

  • The irqsoff tracer records the max time that interrupts are disabled.
    There are hooks in the assembly code that calls back into the tracer when
    interrupts are disabled or enabled.

    When they are enabled, the tracer checks if the amount of time they
    were disabled is larger than the previous recorded max interrupts off
    time. If it is, it creates a snapshot of the currently running trace
    to store where the last largest interrupts off time was held and how
    it happened.

    During testing, this RCU lockdep dump appeared:

    [ 1257.829021] ===============================
    [ 1257.829021] [ INFO: suspicious RCU usage. ]
    [ 1257.829021] 3.10.0-rc1-test+ #171 Tainted: G W
    [ 1257.829021] -------------------------------
    [ 1257.829021] /home/rostedt/work/git/linux-trace.git/include/linux/rcupdate.h:780 rcu_read_lock() used illegally while idle!
    [ 1257.829021]
    [ 1257.829021] other info that might help us debug this:
    [ 1257.829021]
    [ 1257.829021]
    [ 1257.829021] RCU used illegally from idle CPU!
    [ 1257.829021] rcu_scheduler_active = 1, debug_locks = 0
    [ 1257.829021] RCU used illegally from extended quiescent state!
    [ 1257.829021] 2 locks held by trace-cmd/4831:
    [ 1257.829021] #0: (max_trace_lock){......}, at: [] stop_critical_timing+0x1a3/0x209
    [ 1257.829021] #1: (rcu_read_lock){.+.+..}, at: [] __update_max_tr+0x88/0x1ee
    [ 1257.829021]
    [ 1257.829021] stack backtrace:
    [ 1257.829021] CPU: 3 PID: 4831 Comm: trace-cmd Tainted: G W 3.10.0-rc1-test+ #171
    [ 1257.829021] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
    [ 1257.829021] 0000000000000001 ffff880065f49da8 ffffffff8153dd2b ffff880065f49dd8
    [ 1257.829021] ffffffff81092a00 ffff88006bd78680 ffff88007add7500 0000000000000003
    [ 1257.829021] ffff88006bd78680 ffff880065f49e18 ffffffff810daebf ffffffff810dae5a
    [ 1257.829021] Call Trace:
    [ 1257.829021] [] dump_stack+0x19/0x1b
    [ 1257.829021] [] lockdep_rcu_suspicious+0x109/0x112
    [ 1257.829021] [] __update_max_tr+0xed/0x1ee
    [ 1257.829021] [] ? __update_max_tr+0x88/0x1ee
    [ 1257.829021] [] ? user_enter+0xfd/0x107
    [ 1257.829021] [] update_max_tr_single+0x11d/0x12d
    [ 1257.829021] [] ? user_enter+0xfd/0x107
    [ 1257.829021] [] stop_critical_timing+0x141/0x209
    [ 1257.829021] [] ? trace_hardirqs_on+0xd/0xf
    [ 1257.829021] [] ? user_enter+0xfd/0x107
    [ 1257.829021] [] time_hardirqs_on+0x2a/0x2f
    [ 1257.829021] [] ? user_enter+0xfd/0x107
    [ 1257.829021] [] trace_hardirqs_on_caller+0x16/0x197
    [ 1257.829021] [] trace_hardirqs_on+0xd/0xf
    [ 1257.829021] [] user_enter+0xfd/0x107
    [ 1257.829021] [] do_notify_resume+0x92/0x97
    [ 1257.829021] [] int_signal+0x12/0x17

    What happened was entering into the user code, the interrupts were enabled
    and a max interrupts off was recorded. The trace buffer was saved along with
    various information about the task: comm, pid, uid, priority, etc.

    The uid is recorded with task_uid(tsk). But this is a macro that uses rcu_read_lock()
    to retrieve the data, and this happened to happen where RCU is blind (user_enter).

    As only the preempt and irqs off tracers can have this happen, and they both
    only have the tsk == current, if tsk == current, use current_uid() instead of
    task_uid(), as current_uid() does not use RCU as only current can change its uid.

    This fixes the RCU suspicious splat.

    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     

03 Jun, 2013

1 commit

  • Pull cgroup fixes from Tejun Heo:

    - Fix for yet another xattr bug which may lead to NULL deref.

    - A subtle bug in for_each_descendant_pre(). This bug requires quite
    specific conditions to trigger and isn't too likely to actually
    happen in the wild, but maybe that just makes it that much more
    nastier.

    - A warning message added for silly cgroup re-mount (not -o remount,
    but unmount followed by mount) behavior.

    * 'for-3.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: warn about mismatching options of a new mount of an existing hierarchy
    cgroup: fix a subtle bug in descendant pre-order walk
    cgroup: initialize xattr before calling d_instantiate()

    Linus Torvalds
     

31 May, 2013

1 commit

  • Pull x86 fixes from Peter Anvin:

    - Three EFI-related fixes

    - Two early memory initialization fixes

    - build fix for older binutils

    - fix for an eager FPU performance regression -- currently we don't
    allow the use of the FPU at interrupt time *at all* in eager mode,
    which is clearly wrong.

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86: Allow FPU to be used at interrupt time even with eagerfpu
    x86, crc32-pclmul: Fix build with older binutils
    x86-64, init: Fix a possible wraparound bug in switchover in head_64.S
    x86, range: fix missing merge during add range
    x86, efi: initial the local variable of DataSize to zero
    efivar: fix oops in efivar_update_sysfs_entries() caused by memory reuse
    efivarfs: Never return ENOENT from firmware again

    Linus Torvalds
     

30 May, 2013

1 commit

  • The branch selftest calls trace_test_buffer(), but with the new code
    it expects the first parameter to be a pointer to a struct trace_buffer.
    All self tests were changed but the branch selftest was missed.

    This caused either a crash or failed test when the branch selftest was
    enabled.

    Link: http://lkml.kernel.org/r/20130529141333.GA24064@localhost

    Reported-by: Fengguang Wu
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     

29 May, 2013

6 commits

  • Thomas Gleixner
     
  • As rcu_dereference_raw() under RCU debug config options can add quite a
    bit of checks, and that tracing uses rcu_dereference_raw(), these checks
    happen with the function tracer. The function tracer also happens to trace
    these debug checks too. This added overhead can livelock the system.

    Have the function tracer use the new RCU _notrace equivalents that do
    not do the debug checks for RCU.

    Link: http://lkml.kernel.org/r/20130528184209.467603904@goodmis.org

    Acked-by: Paul E. McKenney
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • With the new __DEVEL__sane_behavior mount option was introduced,
    if the root cgroup is alive with no xattr function, to mount a
    new cgroup with xattr will be rejected in terms of design which
    just fine. However, if the root cgroup does not mounted with
    __DEVEL__sane_hehavior, to create a new cgroup with xattr option
    will succeed although after that the EA function does not works
    as expected but will get ENOTSUPP for setting up attributes under
    either cgroup. e.g.

    setfattr: /cgroup2/test: Operation not supported

    Instead of keeping silence in this case, it's better to drop a log
    entry in warning level. That would be helpful to understand the
    reason behind the scene from the user's perspective, and this is
    essentially an improvement does not break the backward compatibilities.

    With this fix, above mount attemption will keep up works as usual but
    the following line cound be found at the system log:

    [ ...] cgroup: new mount options do not match the existing superblock

    tj: minor formatting / message updates.

    Signed-off-by: Jie Liu
    Reported-by: Alexey Kodanev
    Signed-off-by: Tejun Heo
    Cc: stable@vger.kernel.org

    Jeff Liu
     
  • Since commit 31ade30692dc9680bfc95700d794818fa3f754ac, timekeeping_init()
    checks for presence of persistent clock by attempting to read a non-zero
    time value. This is an issue on platforms where persistent_clock (instead
    is implemented as a free-running counter (instead of an RTC) starting
    from zero on each boot and running during suspend. Examples are some ARM
    platforms (e.g. PandaBoard).

    An attempt to read such a clock during timekeeping_init() may return zero
    value and falsely declare persistent clock as missing. Additionally, in
    the above case suspend times may be accounted twice (once from
    timekeeping_resume() and once from rtc_resume()), resulting in a gradual
    drift of system time.

    This patch does a run-time correction of the issue by doing the same check
    during timekeeping_suspend().

    A better long-term solution would have to return error when trying to read
    non-existing clock and zero when trying to read an uninitialized clock, but
    that would require changing all persistent_clock implementations.

    This patch addresses the immediate breakage, for now.

    Cc: John Stultz
    Cc: Thomas Gleixner
    Cc: Feng Tang
    Cc: stable@vger.kernel.org
    Signed-off-by: Zoran Markovic
    [jstultz: Tweaked commit message and subject]
    Signed-off-by: John Stultz

    Zoran Markovic
     
  • kernel/time/ntp.c: In function ‘__hardpps’:
    kernel/time/ntp.c:877: warning: unused variable ‘flags’

    commit a076b2146fabb0894cae5e0189a8ba3f1502d737 ("ntp: Remove ntp_lock,
    using the timekeeping locks to protect ntp state") removed its users,
    but not the actual variable.

    Signed-off-by: Geert Uytterhoeven
    Signed-off-by: John Stultz

    Geert Uytterhoeven
     
  • …it/rostedt/linux-trace

    Pull tracing fixes from Steven Rostedt:
    "Two more fixes:

    The first one was reported by Mauro Carvalho Chehab, where if a poll()
    is done against a trace buffer for a CPU that has never been online,
    it will crash the kernel, as buffers are only created when a CPU comes
    on line, but the trace files are for all possible CPUs.

    This fix is to check if the buffer was allocated and if not return
    -EINVAL.

    That was the simple fix, the real fix is a bit more complex and not
    for a -rc release. We could have the files created when the CPUs come
    online. That would require some design changes.

    The second one was reported by Peter Zijlstra. If the kernel command
    line has ftrace=nop, it will lock up the system on boot up. This is
    because the new design for 3.10 has the nop tracer bootstrap the
    tracing subsystem. When ftrace=<trace> is defined, when a that tracer
    is registered, it starts the tracing, but uses the nop tracer to clear
    things out. What happened here was that ftrace=nop caused the
    registering of nop to start it and use nop before it was initialized.

    The only thing nop needs to have done to initialize it is to have the
    tracer point its current_tracer structure member to the nop tracer.
    Doing that before registering the nop tracer makes everything work."

    * tag 'trace-fixes-v3.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    ring-buffer: Do not poll non allocated cpu buffers
    tracing: Fix crash when ftrace=nop on the kernel command line

    Linus Torvalds
     

28 May, 2013

2 commits

  • The tracing infrastructure sets up for possible CPUs, but it uses
    the ring buffer polling, it is possible to call the ring buffer
    polling code with a CPU that hasn't been allocated. This will cause
    a kernel oops when it access a ring buffer cpu buffer that is part
    of the possible cpus but hasn't been allocated yet as the CPU has never
    been online.

    Reported-by: Mauro Carvalho Chehab
    Tested-by: Mauro Carvalho Chehab
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     
  • commit 26517f3e (tick: Avoid programming the local cpu timer if
    broadcast pending) added a warning if the cpu enters broadcast mode
    again while the pending bit is still set. Meelis reported that the
    warning triggers. There are two corner cases which have been not
    considered:

    1) cpuidle calls clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER)
    twice. That can result in the following scenario

    CPU0 CPU1
    cpuidle_idle_call()
    clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER)
    set cpu in tick_broadcast_oneshot_mask

    broadcast interrupt
    event expired for cpu1
    set pending bit

    acpi_idle_enter_simple()
    clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER)
    WARN_ON(pending bit)

    Move the WARN_ON into the section where we enter broadcast mode so
    it wont provide false positives on the second call.

    2) safe_halt() enables interrupts, so a broadcast interrupt can be
    delivered befor the broadcast mode is disabled. That sets the
    pending bit for the CPU which receives the broadcast
    interrupt. Though the interrupt is delivered right away from the
    broadcast handler and leaves the pending bit stale.

    Clear the pending bit for the current cpu in the broadcast handler.

    Reported-and-tested-by: Meelis Roos
    Cc: Len Brown
    Cc: Frederic Weisbecker
    Cc: Borislav Petkov
    Cc: Rafael J. Wysocki
    Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1305271841130.4220@ionos
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

25 May, 2013

2 commits

  • Fix kernel-doc warnings in kernel/auditfilter.c:

    Warning(kernel/auditfilter.c:1029): Excess function parameter 'loginuid' description in 'audit_receive_filter'
    Warning(kernel/auditfilter.c:1029): Excess function parameter 'sessionid' description in 'audit_receive_filter'
    Warning(kernel/auditfilter.c:1029): Excess function parameter 'sid' description in 'audit_receive_filter'

    Signed-off-by: Randy Dunlap
    Cc: Eric Paris
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • …it/rostedt/linux-trace

    Pull tracing fix from Steven Rostedt:
    "Masami Hiramatsu fixed another bug. This time returning a proper
    result in event_enable_func(). After checking the return status of
    try_module_get(), it returned the status of try_module_get().

    But try_module_get() returns 0 on failure, which is success for
    event_enable_func()"

    * tag 'trace-fixes-v3.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Return -EBUSY when event_enable_func() fails to get module

    Linus Torvalds
     

24 May, 2013

1 commit

  • When cgroup_next_descendant_pre() initiates a walk, it checks whether
    the subtree root doesn't have any children and if not returns NULL.
    Later code assumes that the subtree isn't empty. This is broken
    because the subtree may become empty inbetween, which can lead to the
    traversal escaping the subtree by walking to the sibling of the
    subtree root.

    There's no reason to have the early exit path. Remove it along with
    the later assumption that the subtree isn't empty. This simplifies
    the code a bit and fixes the subtle bug.

    While at it, fix the comment of cgroup_for_each_descendant_pre() which
    was incorrectly referring to ->css_offline() instead of
    ->css_online().

    Signed-off-by: Tejun Heo
    Reviewed-by: Michal Hocko
    Cc: stable@vger.kernel.org

    Tejun Heo
     

23 May, 2013

1 commit

  • If ftrace= is on the kernel command line, when that tracer is
    registered, it will be initiated by tracing_set_tracer() to execute that
    tracer.

    The nop tracer is just a stub tracer that is used to have no tracer
    enabled. It is assigned at early bootup as it is the default tracer.

    But if ftrace=nop is on the kernel command line, the registering of the
    nop tracer will call tracing_set_tracer() which will try to execute
    the nop tracer. But it expects tr->current_trace to be assigned something
    as it usually is assigned to the nop tracer. As it hasn't been assigned
    to anything yet, it causes the system to crash.

    The simple fix is to move the tr->current_trace = nop before registering
    the nop tracer. The functionality is still the same as the nop tracer
    doesn't do anything anyway.

    Reported-by: Peter Zijlstra
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     

19 May, 2013

1 commit


18 May, 2013

1 commit

  • Christian found v3.9 does not work with E350 with EFI is enabled.

    [ 1.658832] Trying to unpack rootfs image as initramfs...
    [ 1.679935] BUG: unable to handle kernel paging request at ffff88006e3fd000
    [ 1.686940] IP: [] memset+0x1f/0xb0
    [ 1.692010] PGD 1f77067 PUD 1f7a067 PMD 61420067 PTE 0

    but early memtest report all memory could be accessed without problem.

    early page table is set in following sequence:
    [ 0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
    [ 0.000000] init_memory_mapping: [mem 0x6e600000-0x6e7fffff]
    [ 0.000000] init_memory_mapping: [mem 0x6c000000-0x6e5fffff]
    [ 0.000000] init_memory_mapping: [mem 0x00100000-0x6bffffff]
    [ 0.000000] init_memory_mapping: [mem 0x6e800000-0x6ea07fff]
    but later efi_enter_virtual_mode try set mapping again wrongly.
    [ 0.010644] pid_max: default: 32768 minimum: 301
    [ 0.015302] init_memory_mapping: [mem 0x640c5000-0x6e3fcfff]
    that means it fails with pfn_range_is_mapped.

    It turns out that we have a bug in add_range_with_merge and it does not
    merge range properly when new add one fill the hole between two exsiting
    ranges. In the case when [mem 0x00100000-0x6bffffff] is the hole between
    [mem 0x00000000-0x000fffff] and [mem 0x6c000000-0x6e7fffff].

    Fix the add_range_with_merge by calling itself recursively.

    Reported-by: "Christian König"
    Signed-off-by: Yinghai Lu
    Link: http://lkml.kernel.org/r/CAE9FiQVofGoSk7q5-0irjkBxemqK729cND4hov-1QCBJDhxpgQ@mail.gmail.com
    Cc: v3.9
    Signed-off-by: H. Peter Anvin

    Yinghai Lu
     

17 May, 2013

2 commits