28 Aug, 2014

12 commits

  • commit e6d30ab1e7d1281784672c0fc2ffa385cfb7279e upstream.

    All the callers of irq_create_of_mapping() pass the contents of a struct
    of_phandle_args structure to the function. Since all the callers already
    have an of_phandle_args pointer, why not pass it directly to
    irq_create_of_mapping()?

    Signed-off-by: Grant Likely
    Acked-by: Michal Simek
    Acked-by: Tony Lindgren
    Cc: Thomas Gleixner
    Cc: Russell King
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Shawn Guo

    Conflicts:
    arch/arm/mach-integrator/pci_v3.c
    arch/mips/pci/pci-rt3883.c
    kernel/irq/irqdomain.c

    Grant Likely
     
  • Commits 11d4616bd07f ("futex: revert back to the explicit waiter
    counting code") and 69cd9eba3886 ("futex: avoid race between requeue and
    wake") changed some of the finer details of how we think about futexes.
    One was a late fix and the other a consequence of overlooking the whole
    requeuing logic.

    The first change caused our documentation to be incorrect, and the
    second made us aware that we need to explicitly add more details to it.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Jan Stancek reported:
    "pthread_cond_broadcast/4-1.c testcase from openposix testsuite (LTP)
    occasionally fails, because some threads fail to wake up.

    Testcase creates 5 threads, which are all waiting on same condition.
    Main thread then calls pthread_cond_broadcast() without holding mutex,
    which calls:

    futex(uaddr1, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, uaddr2, ..)

    This immediately wakes up single thread A, which unlocks mutex and
    tries to wake up another thread:

    futex(uaddr2, FUTEX_WAKE_PRIVATE, 1)

    If thread A manages to call futex_wake() before any waiters are
    requeued for uaddr2, no other thread is woken up"

    The ordering constraints for the hash bucket waiter counting are that
    the waiter counts have to be incremented _before_ getting the spinlock
    (because the spinlock acts as part of the memory barrier), but the
    "requeue" operation didn't honor those rules, and nobody had even
    thought about that case.

    This fairly simple patch just increments the waiter count for the target
    hash bucket (hb2) when requeing a futex before taking the locks. It
    then decrements them again after releasing the lock - the code that
    actually moves the futex(es) between hash buckets will do the additional
    required waiter count housekeeping.

    Reported-and-tested-by: Jan Stancek
    Acked-by: Davidlohr Bueso
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: stable@vger.kernel.org # 3.14
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Srikar Dronamraju reports that commit b0c29f79ecea ("futexes: Avoid
    taking the hb->lock if there's nothing to wake up") causes java threads
    getting stuck on futexes when runing specjbb on a power7 numa box.

    The cause appears to be that the powerpc spinlocks aren't using the same
    ticket lock model that we use on x86 (and other) architectures, which in
    turn result in the "spin_is_locked()" test in hb_waiters_pending()
    occasionally reporting an unlocked spinlock even when there are pending
    waiters.

    So this reinstates Davidlohr Bueso's original explicit waiter counting
    code, which I had convinced Davidlohr to drop in favor of figuring out
    the pending waiters by just using the existing state of the spinlock and
    the wait queue.

    Reported-and-tested-by: Srikar Dronamraju
    Original-code-by: Davidlohr Bueso
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • "futexes: Increase hash table size for better performance"
    introduces a new alloc_large_system_hash() call.

    alloc_large_system_hash() however may allocate less memory than
    requested, e.g. limited by MAX_ORDER.

    Hence pass a pointer to alloc_large_system_hash() which will
    contain the hash shift when the function returns. Afterwards
    correctly set futex_hashsize.

    Fixes a crash on s390 where the requested allocation size was
    4MB but only 1MB was allocated.

    Signed-off-by: Heiko Carstens
    Cc: Darren Hart
    Cc: Peter Zijlstra
    Cc: Paul E. McKenney
    Cc: Waiman Long
    Cc: Jason Low
    Cc: Davidlohr Bueso
    Link: http://lkml.kernel.org/r/20140116135450.GA4345@osiris
    Signed-off-by: Ingo Molnar

    Heiko Carstens
     
  • In futex_wake() there is clearly no point in taking the hb->lock
    if we know beforehand that there are no tasks to be woken. While
    the hash bucket's plist head is a cheap way of knowing this, we
    cannot rely 100% on it as there is a racy window between the
    futex_wait call and when the task is actually added to the
    plist. To this end, we couple it with the spinlock check as
    tasks trying to enter the critical region are most likely
    potential waiters that will be added to the plist, thus
    preventing tasks sleeping forever if wakers don't acknowledge
    all possible waiters.

    Furthermore, the futex ordering guarantees are preserved,
    ensuring that waiters either observe the changed user space
    value before blocking or is woken by a concurrent waker. For
    wakers, this is done by relying on the barriers in
    get_futex_key_refs() -- for archs that do not have implicit mb
    in atomic_inc(), we explicitly add them through a new
    futex_get_mm function. For waiters we rely on the fact that
    spin_lock calls already update the head counter, so spinners
    are visible even if the lock hasn't been acquired yet.

    For more details please refer to the updated comments in the
    code and related discussion:

    https://lkml.org/lkml/2013/11/26/556

    Special thanks to tglx for careful review and feedback.

    Suggested-by: Linus Torvalds
    Reviewed-by: Darren Hart
    Reviewed-by: Thomas Gleixner
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Davidlohr Bueso
    Cc: Paul E. McKenney
    Cc: Mike Galbraith
    Cc: Jeff Mahoney
    Cc: Scott Norton
    Cc: Tom Vaden
    Cc: Aswin Chandramouleeswaran
    Cc: Waiman Long
    Cc: Jason Low
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/1389569486-25487-5-git-send-email-davidlohr@hp.com
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • That's essential, if you want to hack on futexes.

    Reviewed-by: Darren Hart
    Reviewed-by: Peter Zijlstra
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Davidlohr Bueso
    Cc: Mike Galbraith
    Cc: Jeff Mahoney
    Cc: Linus Torvalds
    Cc: Randy Dunlap
    Cc: Scott Norton
    Cc: Tom Vaden
    Cc: Aswin Chandramouleeswaran
    Cc: Waiman Long
    Cc: Jason Low
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/1389569486-25487-4-git-send-email-davidlohr@hp.com
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • Currently, the futex global hash table suffers from its fixed,
    smallish (for today's standards) size of 256 entries, as well as
    its lack of NUMA awareness. Large systems, using many futexes,
    can be prone to high amounts of collisions; where these futexes
    hash to the same bucket and lead to extra contention on the same
    hb->lock. Furthermore, cacheline bouncing is a reality when we
    have multiple hb->locks residing on the same cacheline and
    different futexes hash to adjacent buckets.

    This patch keeps the current static size of 16 entries for small
    systems, or otherwise, 256 * ncpus (or larger as we need to
    round the number to a power of 2). Note that this number of CPUs
    accounts for all CPUs that can ever be available in the system,
    taking into consideration things like hotpluging. While we do
    impose extra overhead at bootup by making the hash table larger,
    this is a one time thing, and does not shadow the benefits of
    this patch.

    Furthermore, as suggested by tglx, by cache aligning the hash
    buckets we can avoid access across cacheline boundaries and also
    avoid massive cache line bouncing if multiple cpus are hammering
    away at different hash buckets which happen to reside in the
    same cache line.

    Also, similar to other core kernel components (pid, dcache,
    tcp), by using alloc_large_system_hash() we benefit from its
    NUMA awareness and thus the table is distributed among the nodes
    instead of in a single one.

    For a custom microbenchmark that pounds on the uaddr hashing --
    making the wait path fail at futex_wait_setup() returning
    -EWOULDBLOCK for large amounts of futexes, we can see the
    following benefits on a 80-core, 8-socket 1Tb server:

    +---------+--------------------+------------------------+-----------------------+-------------------------------+
    | threads | baseline (ops/sec) | aligned-only (ops/sec) | large table (ops/sec) | large table+aligned (ops/sec) |
    +---------+--------------------+------------------------+-----------------------+-------------------------------+
    |     512 |              32426 | 50531  (+55.8%)        | 255274  (+687.2%)     | 292553  (+802.2%)             |
    |     256 |              65360 | 99588  (+52.3%)        | 443563  (+578.6%)     | 508088  (+677.3%)             |
    |     128 |             125635 | 200075 (+59.2%)        | 742613  (+491.1%)     | 835452  (+564.9%)             |
    |      80 |             193559 | 323425 (+67.1%)        | 1028147 (+431.1%)     | 1130304 (+483.9%)             |
    |      64 |             247667 | 443740 (+79.1%)        | 997300  (+302.6%)     | 1145494 (+362.5%)             |
    |      32 |             628412 | 721401 (+14.7%)        | 965996  (+53.7%)      | 1122115 (+78.5%)              |
    +---------+--------------------+------------------------+-----------------------+-------------------------------+

    Reviewed-by: Darren Hart
    Reviewed-by: Peter Zijlstra
    Reviewed-by: Paul E. McKenney
    Reviewed-by: Waiman Long
    Reviewed-and-tested-by: Jason Low
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Davidlohr Bueso
    Cc: Mike Galbraith
    Cc: Jeff Mahoney
    Cc: Linus Torvalds
    Cc: Scott Norton
    Cc: Tom Vaden
    Cc: Aswin Chandramouleeswaran
    Link: http://lkml.kernel.org/r/1389569486-25487-3-git-send-email-davidlohr@hp.com
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • Avoid waking up every thread sleeping in a futex_wait call during
    suspend and resume by calling a freezable blocking call. Previous
    patches modified the freezer to avoid sending wakeups to threads
    that are blocked in freezable blocking calls.

    This call was selected to be converted to a freezable call because
    it doesn't hold any locks or release any resources when interrupted
    that might be needed by another freezing task or a kernel driver
    during suspend, and is a common site where idle userspace tasks are
    blocked.

    Signed-off-by: Colin Cross
    Cc: Rafael J. Wysocki
    Cc: arve@android.com
    Cc: Tejun Heo
    Cc: Oleg Nesterov
    Cc: Darren Hart
    Cc: Randy Dunlap
    Cc: Al Viro
    Link: http://lkml.kernel.org/r/1367458508-9133-8-git-send-email-ccross@android.com
    Signed-off-by: Thomas Gleixner

    Colin Cross
     
  • - Remove unnecessary head variables.
    - Delete unused parameter in queue_unlock().

    Reviewed-by: Darren Hart
    Reviewed-by: Peter Zijlstra
    Reviewed-by: Paul E. McKenney
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Jason Low
    Signed-off-by: Davidlohr Bueso
    Cc: Mike Galbraith
    Cc: Jeff Mahoney
    Cc: Linus Torvalds
    Cc: Scott Norton
    Cc: Tom Vaden
    Cc: Aswin Chandramouleeswaran
    Cc: Waiman Long
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/1389569486-25487-2-git-send-email-davidlohr@hp.com
    Signed-off-by: Ingo Molnar

    Jason Low
     
  • When debugging the read-only hugepage case, I was confused by the fact
    that get_futex_key() did an access_ok() only for the non-shared futex
    case, since the user address checking really isn't in any way specific
    to the private key handling.

    Now, it turns out that the shared key handling does effectively do the
    equivalent checks inside get_user_pages_fast() (it doesn't actually
    check the address range on x86, but does check the page protections for
    being a user page). So it wasn't actually a bug, but the fact that we
    treat the address differently for private and shared futexes threw me
    for a loop.

    Just move the check up, so that it gets done for both cases. Also, use
    the 'rw' parameter for the type, even if it doesn't actually matter any
    more (it's a historical artifact of the old racy i386 "page faults from
    kernel space don't check write protections").

    Cc: Thomas Gleixner
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Some modules may need to know cpu idle status and take
    actions before and after cpu idle, so we can add
    notification callback when enter/exit cpu idle, then
    modules only need to register this notification callback,
    everytime cpu enter/exit idle, the callback chain will be
    executed.

    Currently only cpufreq interactive governor use this
    notification, as it wants to save power, the timers of
    interactive governor are only enabled when cpu is not in idle.

    Signed-off-by: Anson Huang

    Anson Huang
     

08 Aug, 2014

2 commits

  • commit 504d58745c9ca28d33572e2d8a9990b43e06075d upstream.

    clockevents_increase_min_delta() calls printk() from under
    hrtimer_bases.lock. That causes lock inversion on scheduler locks because
    printk() can call into the scheduler. Lockdep puts it as:

    ======================================================
    [ INFO: possible circular locking dependency detected ]
    3.15.0-rc8-06195-g939f04b #2 Not tainted
    -------------------------------------------------------
    trinity-main/74 is trying to acquire lock:
    (&port_lock_key){-.....}, at: [] serial8250_console_write+0x8c/0x10c

    but task is already holding lock:
    (hrtimer_bases.lock){-.-...}, at: [] hrtimer_try_to_cancel+0x13/0x66

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #5 (hrtimer_bases.lock){-.-...}:
    [] lock_acquire+0x92/0x101
    [] _raw_spin_lock_irqsave+0x2e/0x3e
    [] __hrtimer_start_range_ns+0x1c/0x197
    [] perf_swevent_start_hrtimer.part.41+0x7a/0x85
    [] task_clock_event_start+0x3a/0x3f
    [] task_clock_event_add+0xd/0x14
    [] event_sched_in+0xb6/0x17a
    [] group_sched_in+0x44/0x122
    [] ctx_sched_in.isra.67+0x105/0x11f
    [] perf_event_sched_in.isra.70+0x47/0x4b
    [] __perf_install_in_context+0x8b/0xa3
    [] remote_function+0x12/0x2a
    [] smp_call_function_single+0x2d/0x53
    [] task_function_call+0x30/0x36
    [] perf_install_in_context+0x87/0xbb
    [] SYSC_perf_event_open+0x5c6/0x701
    [] SyS_perf_event_open+0x17/0x19
    [] syscall_call+0x7/0xb

    -> #4 (&ctx->lock){......}:
    [] lock_acquire+0x92/0x101
    [] _raw_spin_lock+0x21/0x30
    [] __perf_event_task_sched_out+0x1dc/0x34f
    [] __schedule+0x4c6/0x4cb
    [] schedule+0xf/0x11
    [] work_resched+0x5/0x30

    -> #3 (&rq->lock){-.-.-.}:
    [] lock_acquire+0x92/0x101
    [] _raw_spin_lock+0x21/0x30
    [] __task_rq_lock+0x33/0x3a
    [] wake_up_new_task+0x25/0xc2
    [] do_fork+0x15c/0x2a0
    [] kernel_thread+0x1a/0x1f
    [] rest_init+0x1a/0x10e
    [] start_kernel+0x303/0x308
    [] i386_start_kernel+0x79/0x7d

    -> #2 (&p->pi_lock){-.-...}:
    [] lock_acquire+0x92/0x101
    [] _raw_spin_lock_irqsave+0x2e/0x3e
    [] try_to_wake_up+0x1d/0xd6
    [] default_wake_function+0xb/0xd
    [] __wake_up_common+0x39/0x59
    [] __wake_up+0x29/0x3b
    [] tty_wakeup+0x49/0x51
    [] uart_write_wakeup+0x17/0x19
    [] serial8250_tx_chars+0xbc/0xfb
    [] serial8250_handle_irq+0x54/0x6a
    [] serial8250_default_handle_irq+0x19/0x1c
    [] serial8250_interrupt+0x38/0x9e
    [] handle_irq_event_percpu+0x5f/0x1e2
    [] handle_irq_event+0x2c/0x43
    [] handle_level_irq+0x57/0x80
    [] handle_irq+0x46/0x5c
    [] do_IRQ+0x32/0x89
    [] common_interrupt+0x2e/0x33
    [] _raw_spin_unlock_irqrestore+0x3f/0x49
    [] uart_start+0x2d/0x32
    [] uart_write+0xc7/0xd6
    [] n_tty_write+0xb8/0x35e
    [] tty_write+0x163/0x1e4
    [] redirected_tty_write+0x6d/0x75
    [] vfs_write+0x75/0xb0
    [] SyS_write+0x44/0x77
    [] syscall_call+0x7/0xb

    -> #1 (&tty->write_wait){-.....}:
    [] lock_acquire+0x92/0x101
    [] _raw_spin_lock_irqsave+0x2e/0x3e
    [] __wake_up+0x15/0x3b
    [] tty_wakeup+0x49/0x51
    [] uart_write_wakeup+0x17/0x19
    [] serial8250_tx_chars+0xbc/0xfb
    [] serial8250_handle_irq+0x54/0x6a
    [] serial8250_default_handle_irq+0x19/0x1c
    [] serial8250_interrupt+0x38/0x9e
    [] handle_irq_event_percpu+0x5f/0x1e2
    [] handle_irq_event+0x2c/0x43
    [] handle_level_irq+0x57/0x80
    [] handle_irq+0x46/0x5c
    [] do_IRQ+0x32/0x89
    [] common_interrupt+0x2e/0x33
    [] _raw_spin_unlock_irqrestore+0x3f/0x49
    [] uart_start+0x2d/0x32
    [] uart_write+0xc7/0xd6
    [] n_tty_write+0xb8/0x35e
    [] tty_write+0x163/0x1e4
    [] redirected_tty_write+0x6d/0x75
    [] vfs_write+0x75/0xb0
    [] SyS_write+0x44/0x77
    [] syscall_call+0x7/0xb

    -> #0 (&port_lock_key){-.....}:
    [] __lock_acquire+0x9ea/0xc6d
    [] lock_acquire+0x92/0x101
    [] _raw_spin_lock_irqsave+0x2e/0x3e
    [] serial8250_console_write+0x8c/0x10c
    [] call_console_drivers.constprop.31+0x87/0x118
    [] console_unlock+0x1d7/0x398
    [] vprintk_emit+0x3da/0x3e4
    [] printk+0x17/0x19
    [] clockevents_program_min_delta+0x104/0x116
    [] clockevents_program_event+0xe7/0xf3
    [] tick_program_event+0x1e/0x23
    [] hrtimer_force_reprogram+0x88/0x8f
    [] __remove_hrtimer+0x5b/0x79
    [] hrtimer_try_to_cancel+0x49/0x66
    [] hrtimer_cancel+0xd/0x18
    [] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
    [] task_clock_event_stop+0x20/0x64
    [] task_clock_event_del+0xd/0xf
    [] event_sched_out+0xab/0x11e
    [] group_sched_out+0x1d/0x66
    [] ctx_sched_out+0xaf/0xbf
    [] __perf_event_task_sched_out+0x1ed/0x34f
    [] __schedule+0x4c6/0x4cb
    [] schedule+0xf/0x11
    [] work_resched+0x5/0x30

    other info that might help us debug this:

    Chain exists of:
    &port_lock_key --> &ctx->lock --> hrtimer_bases.lock

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(hrtimer_bases.lock);
    lock(&ctx->lock);
    lock(hrtimer_bases.lock);
    lock(&port_lock_key);

    *** DEADLOCK ***

    4 locks held by trinity-main/74:
    #0: (&rq->lock){-.-.-.}, at: [] __schedule+0xed/0x4cb
    #1: (&ctx->lock){......}, at: [] __perf_event_task_sched_out+0x1dc/0x34f
    #2: (hrtimer_bases.lock){-.-...}, at: [] hrtimer_try_to_cancel+0x13/0x66
    #3: (console_lock){+.+...}, at: [] vprintk_emit+0x3c7/0x3e4

    stack backtrace:
    CPU: 0 PID: 74 Comm: trinity-main Not tainted 3.15.0-rc8-06195-g939f04b #2
    00000000 81c3a310 8b995c14 81426f69 8b995c44 81425a99 8161f671 8161f570
    8161f538 8161f559 8161f538 8b995c78 8b142bb0 00000004 8b142fdc 8b142bb0
    8b995ca8 8104a62d 8b142fac 000016f2 81c3a310 00000001 00000001 00000003
    Call Trace:
    [] dump_stack+0x16/0x18
    [] print_circular_bug+0x18f/0x19c
    [] __lock_acquire+0x9ea/0xc6d
    [] lock_acquire+0x92/0x101
    [] ? serial8250_console_write+0x8c/0x10c
    [] ? wait_for_xmitr+0x76/0x76
    [] _raw_spin_lock_irqsave+0x2e/0x3e
    [] ? serial8250_console_write+0x8c/0x10c
    [] serial8250_console_write+0x8c/0x10c
    [] ? lock_release+0x191/0x223
    [] ? wait_for_xmitr+0x76/0x76
    [] call_console_drivers.constprop.31+0x87/0x118
    [] console_unlock+0x1d7/0x398
    [] vprintk_emit+0x3da/0x3e4
    [] printk+0x17/0x19
    [] clockevents_program_min_delta+0x104/0x116
    [] tick_program_event+0x1e/0x23
    [] hrtimer_force_reprogram+0x88/0x8f
    [] __remove_hrtimer+0x5b/0x79
    [] hrtimer_try_to_cancel+0x49/0x66
    [] hrtimer_cancel+0xd/0x18
    [] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
    [] task_clock_event_stop+0x20/0x64
    [] task_clock_event_del+0xd/0xf
    [] event_sched_out+0xab/0x11e
    [] group_sched_out+0x1d/0x66
    [] ctx_sched_out+0xaf/0xbf
    [] __perf_event_task_sched_out+0x1ed/0x34f
    [] ? __dequeue_entity+0x23/0x27
    [] ? pick_next_task_fair+0xb1/0x120
    [] __schedule+0x4c6/0x4cb
    [] ? trace_hardirqs_off_caller+0xd7/0x108
    [] ? trace_hardirqs_off+0xb/0xd
    [] ? rcu_irq_exit+0x64/0x77

    Fix the problem by using printk_deferred() which does not call into the
    scheduler.

    Reported-by: Fengguang Wu
    Signed-off-by: Jan Kara
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit aac74dc495456412c4130a1167ce4beb6c1f0b38 upstream.

    After learning we'll need some sort of deferred printk functionality in
    the timekeeping core, Peter suggested we rename the printk_sched function
    so it can be reused by needed subsystems.

    This only changes the function name. No logic changes.

    Signed-off-by: John Stultz
    Reviewed-by: Steven Rostedt
    Cc: Jan Kara
    Cc: Peter Zijlstra
    Cc: Jiri Bohac
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    John Stultz
     

01 Aug, 2014

1 commit

  • commit 58d4e21e50ff3cc57910a8abc20d7e14375d2f61 upstream.

    The "uptime" trace clock added in:

    commit 8aacf017b065a805d27467843490c976835eb4a5
    tracing: Add "uptime" trace clock that uses jiffies

    has wraparound problems when the system has been up more
    than 1 hour 11 minutes and 34 seconds. It converts jiffies
    to nanoseconds using:
    (u64)jiffies_to_usecs(jiffy) * 1000ULL
    but since jiffies_to_usecs() only returns a 32-bit value, it
    truncates at 2^32 microseconds. An additional problem on 32-bit
    systems is that the argument is "unsigned long", so fixing the
    return value only helps until 2^32 jiffies (49.7 days on a HZ=1000
    system).

    Avoid these problems by using jiffies_64 as our basis, and
    not converting to nanoseconds (we do convert to clock_t because
    user facing API must not be dependent on internal kernel
    HZ values).

    Link: http://lkml.kernel.org/p/99d63c5bfe9b320a3b428d773825a37095bf6a51.1405708254.git.tony.luck@intel.com

    Fixes: 8aacf017b065 "tracing: Add "uptime" trace clock that uses jiffies"
    Signed-off-by: Tony Luck
    Signed-off-by: Steven Rostedt
    Signed-off-by: Greg Kroah-Hartman

    Tony Luck
     

28 Jul, 2014

7 commits

  • commit b0ab99e7736af88b8ac1b7ae50ea287fffa2badc upstream.

    proc_sched_show_task() does:

    if (nr_switches)
    do_div(avg_atom, nr_switches);

    nr_switches is unsigned long and do_div truncates it to 32 bits, which
    means it can test non-zero on e.g. x86-64 and be truncated to zero for
    division.

    Fix the problem by using div64_ul() instead.

    As a side effect calculations of avg_atom for big nr_switches are now correct.

    Signed-off-by: Mateusz Guzik
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1402750809-31991-1-git-send-email-mguzik@redhat.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Mateusz Guzik
     
  • commit 4badad352a6bb202ec68afa7a574c0bb961e5ebc upstream.

    The optimistic spin code assumes regular stores and cmpxchg() play nice;
    this is found to not be true for at least: parisc, sparc32, tile32,
    metag-lock1, arc-!llsc and hexagon.

    There is further wreckage, but this in particular seemed easy to
    trigger, so blacklist this.

    Opt in for known good archs.

    Signed-off-by: Peter Zijlstra
    Reported-by: Mikulas Patocka
    Cc: David Miller
    Cc: Chris Metcalf
    Cc: James Bottomley
    Cc: Vineet Gupta
    Cc: Jason Low
    Cc: Waiman Long
    Cc: "James E.J. Bottomley"
    Cc: Paul McKenney
    Cc: John David Anglin
    Cc: James Hogan
    Cc: Linus Torvalds
    Cc: Davidlohr Bueso
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Russell King
    Cc: Will Deacon
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: sparclinux@vger.kernel.org
    Link: http://lkml.kernel.org/r/20140606175316.GV13930@laptop.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     
  • commit 4320f6b1d9db4ca912c5eb6ecb328b2e090e1586 upstream.

    The commit [247bc037: PM / Sleep: Mitigate race between the freezer
    and request_firmware()] introduced the finer state control, but it
    also leads to a new bug; for example, a bug report regarding the
    firmware loading of intel BT device at suspend/resume:
    https://bugzilla.novell.com/show_bug.cgi?id=873790

    The root cause seems to be a small window between the process resume
    and the clear of usermodehelper lock. The request_firmware() function
    checks the UMH lock and gives up when it's in UMH_DISABLE state. This
    is for avoiding the invalid f/w loading during suspend/resume phase.
    The problem is, however, that usermodehelper_enable() is called at the
    end of thaw_processes(). Thus, a thawed process in between can kick
    off the f/w loader code path (in this case, via btusb_setup_intel())
    even before the call of usermodehelper_enable(). Then
    usermodehelper_read_trylock() returns an error and request_firmware()
    spews WARN_ON() in the end.

    This oneliner patch fixes the issue just by setting to UMH_FREEZING
    state again before restarting tasks, so that the call of
    request_firmware() will be blocked until the end of this function
    instead of returning an error.

    Fixes: 247bc0374254 (PM / Sleep: Mitigate race between the freezer and request_firmware())
    Link: https://bugzilla.novell.com/show_bug.cgi?id=873790
    Signed-off-by: Takashi Iwai
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Greg Kroah-Hartman

    Takashi Iwai
     
  • commit 16927776ae757d0d132bdbfabbfe2c498342bd59 upstream.

    Sharvil noticed with the posix timer_settime interface, using the
    CLOCK_REALTIME_ALARM or CLOCK_BOOTTIME_ALARM clockid, if the users
    tried to specify a relative time timer, it would incorrectly be
    treated as absolute regardless of the state of the flags argument.

    This patch corrects this, properly checking the absolute/relative flag,
    as well as adds further error checking that no invalid flag bits are set.

    Reported-by: Sharvil Nanavati
    Signed-off-by: John Stultz
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Prarit Bhargava
    Cc: Sharvil Nanavati
    Link: http://lkml.kernel.org/r/1404767171-6902-1-git-send-email-john.stultz@linaro.org
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    John Stultz
     
  • commit 97b8ee845393701edc06e27ccec2876ff9596019 upstream.

    ring_buffer_poll_wait() should always put the poll_table to its wait_queue
    even there is immediate data available. Otherwise, the following epoll and
    read sequence will eventually hang forever:

    1. Put some data to make the trace_pipe ring_buffer read ready first
    2. epoll_ctl(efd, EPOLL_CTL_ADD, trace_pipe_fd, ee)
    3. epoll_wait()
    4. read(trace_pipe_fd) till EAGAIN
    5. Add some more data to the trace_pipe ring_buffer
    6. epoll_wait() -> this epoll_wait() will block forever

    ~ During the epoll_ctl(efd, EPOLL_CTL_ADD,...) call in step 2,
    ring_buffer_poll_wait() returns immediately without adding poll_table,
    which has poll_table->_qproc pointing to ep_poll_callback(), to its
    wait_queue.
    ~ During the epoll_wait() call in step 3 and step 6,
    ring_buffer_poll_wait() cannot add ep_poll_callback() to its wait_queue
    because the poll_table->_qproc is NULL and it is how epoll works.
    ~ When there is new data available in step 6, ring_buffer does not know
    it has to call ep_poll_callback() because it is not in its wait queue.
    Hence, block forever.

    Other poll implementation seems to call poll_wait() unconditionally as the very
    first thing to do. For example, tcp_poll() in tcp.c.

    Link: http://lkml.kernel.org/p/20140610060637.GA14045@devbig242.prn2.facebook.com

    Fixes: 2a2cc8f7c4d0 "ftrace: allow the event pipe to be polled"
    Reviewed-by: Chris Mason
    Signed-off-by: Martin Lau
    Signed-off-by: Steven Rostedt
    Signed-off-by: Greg Kroah-Hartman

    Martin Lau
     
  • commit 8abfb8727f4a724d31f9ccfd8013fbd16d539445 upstream.

    Currently trace option stacktrace is not applicable for
    trace_printk with constant string argument, the reason is
    in __trace_puts/__trace_bputs ftrace_trace_stack is missing.

    In contrast, when using trace_printk with non constant string
    argument(will call into __trace_printk/__trace_bprintk), then
    trace option stacktrace is workable, this inconstant result
    will confuses users a lot.

    Link: http://lkml.kernel.org/p/51E7A7C9.9040401@huawei.com

    Signed-off-by: zhangwei(Jovi)
    Signed-off-by: Steven Rostedt
    Signed-off-by: Greg Kroah-Hartman

    zhangwei(Jovi)
     
  • commit 5f8bf2d263a20b986225ae1ed7d6759dc4b93af9 upstream.

    Running my ftrace tests on PowerPC, it failed the test that checks
    if function_graph tracer is affected by the stack tracer. It was.
    Looking into this, I found that the update_function_graph_func()
    must be called even if the trampoline function is not changed.
    This is because archs like PowerPC do not support ftrace_ops being
    passed by assembly and instead uses a helper function (what the
    trampoline function points to). Since this function is not changed
    even when multiple ftrace_ops are added to the code, the test that
    falls out before calling update_function_graph_func() will miss that
    the update must still be done.

    Call update_function_graph_function() for all calls to
    update_ftrace_function()

    Signed-off-by: Steven Rostedt
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Red Hat)
     

18 Jul, 2014

8 commits

  • commit 27e35715df54cbc4f2d044f681802ae30479e7fb upstream.

    When the rtmutex fast path is enabled the slow unlock function can
    create the following situation:

    spin_lock(foo->m->wait_lock);
    foo->m->owner = NULL;
    rt_mutex_lock(foo->m); refcnt);
    rt_mutex_unlock(foo->m); m->wait_lock); owner */
    clear_rt_mutex_waiters(m);
    owner = rt_mutex_owner(m);
    spin_unlock(m->wait_lock);
    if (cmpxchg(m->owner, owner, 0) == owner)
    return;
    spin_lock(m->wait_lock);
    }

    So in case of a new waiter incoming while the owner tries the slow
    path unlock we have two situations:

    unlock(wait_lock);
    lock(wait_lock);
    cmpxchg(p, owner, 0) == owner
    mark_rt_mutex_waiters(lock);
    acquire(lock);

    Or:

    unlock(wait_lock);
    lock(wait_lock);
    mark_rt_mutex_waiters(lock);
    cmpxchg(p, owner, 0) != owner
    enqueue_waiter();
    unlock(wait_lock);
    lock(wait_lock);
    wakeup_next waiter();
    unlock(wait_lock);
    lock(wait_lock);
    acquire(lock);

    If the fast path is disabled, then the simple

    m->owner = NULL;
    unlock(m->wait_lock);

    is sufficient as all access to m->owner is serialized via
    m->wait_lock;

    Also document and clarify the wakeup_next_waiter function as suggested
    by Oleg Nesterov.

    Reported-by: Steven Rostedt
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Steven Rostedt
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20140611183852.937945560@linutronix.de
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Mike Galbraith
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 3d5c9340d1949733eb37616abd15db36aef9a57c upstream.

    Even in the case when deadlock detection is not requested by the
    caller, we can detect deadlocks. Right now the code stops the lock
    chain walk and keeps the waiter enqueued, even on itself. Silly not to
    yell when such a scenario is detected and to keep the waiter enqueued.

    Return -EDEADLK unconditionally and handle it at the call sites.

    The futex calls return -EDEADLK. The non futex ones dequeue the
    waiter, throw a warning and put the task into a schedule loop.

    Tagged for stable as it makes the code more robust.

    Signed-off-by: Thomas Gleixner
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Brad Mouring
    Link: http://lkml.kernel.org/r/20140605152801.836501969@linutronix.de
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Mike Galbraith
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 82084984383babe728e6e3c9a8e5c46278091315 upstream.

    When we walk the lock chain, we drop all locks after each step. So the
    lock chain can change under us before we reacquire the locks. That's
    harmless in principle as we just follow the wrong lock path. But it
    can lead to a false positive in the dead lock detection logic:

    T0 holds L0
    T0 blocks on L1 held by T1
    T1 blocks on L2 held by T2
    T2 blocks on L3 held by T3
    T4 blocks on L4 held by T4

    Now we walk the chain

    lock T1 -> lock L2 -> adjust L2 -> unlock T1 ->
    lock T2 -> adjust T2 -> drop locks

    T2 times out and blocks on L0

    Now we continue:

    lock T2 -> lock L0 -> deadlock detected, but it's not a deadlock at all.

    Brad tried to work around that in the deadlock detection logic itself,
    but the more I looked at it the less I liked it, because it's crystal
    ball magic after the fact.

    We actually can detect a chain change very simple:

    lock T1 -> lock L2 -> adjust L2 -> unlock T1 -> lock T2 -> adjust T2 ->

    next_lock = T2->pi_blocked_on->lock;

    drop locks

    T2 times out and blocks on L0

    Now we continue:

    lock T2 ->

    if (next_lock != T2->pi_blocked_on->lock)
    return;

    So if we detect that T2 is now blocked on a different lock we stop the
    chain walk. That's also correct in the following scenario:

    lock T1 -> lock L2 -> adjust L2 -> unlock T1 -> lock T2 -> adjust T2 ->

    next_lock = T2->pi_blocked_on->lock;

    drop locks

    T3 times out and drops L3
    T2 acquires L3 and blocks on L4 now

    Now we continue:

    lock T2 ->

    if (next_lock != T2->pi_blocked_on->lock)
    return;

    We don't have to follow up the chain at that point, because T2
    propagated our priority up to T4 already.

    [ Folded a cleanup patch from peterz ]

    Signed-off-by: Thomas Gleixner
    Reported-by: Brad Mouring
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20140605152801.930031935@linutronix.de
    Signed-off-by: Mike Galbraith
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 397335f004f41e5fcf7a795e94eb3ab83411a17c upstream.

    The current deadlock detection logic does not work reliably due to the
    following early exit path:

    /*
    * Drop out, when the task has no waiters. Note,
    * top_waiter can be NULL, when we are in the deboosting
    * mode!
    */
    if (top_waiter && (!task_has_pi_waiters(task) ||
    top_waiter != task_top_pi_waiter(task)))
    goto out_unlock_pi;

    So this not only exits when the task has no waiters, it also exits
    unconditionally when the current waiter is not the top priority waiter
    of the task.

    So in a nested locking scenario, it might abort the lock chain walk
    and therefor miss a potential deadlock.

    Simple fix: Continue the chain walk, when deadlock detection is
    enabled.

    We also avoid the whole enqueue, if we detect the deadlock right away
    (A-A). It's an optimization, but also prevents that another waiter who
    comes in after the detection and before the task has undone the damage
    observes the situation and detects the deadlock and returns
    -EDEADLOCK, which is wrong as the other task is not in a deadlock
    situation.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Reviewed-by: Steven Rostedt
    Cc: Lai Jiangshan
    Link: http://lkml.kernel.org/r/20140522031949.725272460@linutronix.de
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Mike Galbraith
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 8b8b36834d0fff67fc8668093f4312dd04dcf21d upstream.

    The per_cpu buffers are created one per possible CPU. But these do
    not mean that those CPUs are online, nor do they even exist.

    With the addition of the ring buffer polling, it assumes that the
    caller polls on an existing buffer. But this is not the case if
    the user reads trace_pipe from a CPU that does not exist, and this
    causes the kernel to crash.

    Simple fix is to check the cpu against buffer bitmask against to see
    if the buffer was allocated or not and return -ENODEV if it is
    not.

    More updates were done to pass the -ENODEV back up to userspace.

    Link: http://lkml.kernel.org/r/5393DB61.6060707@oracle.com

    Reported-by: Sasha Levin
    Signed-off-by: Steven Rostedt
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Red Hat)
     
  • commit 5a6024f1604eef119cf3a6fa413fe0261a81a8f3 upstream.

    When hot-adding and onlining CPU, kernel panic occurs, showing following
    call trace.

    BUG: unable to handle kernel paging request at 0000000000001d08
    IP: [] __alloc_pages_nodemask+0x9d/0xb10
    PGD 0
    Oops: 0000 [#1] SMP
    ...
    Call Trace:
    [] ? cpumask_next_and+0x35/0x50
    [] ? find_busiest_group+0x113/0x8f0
    [] ? deactivate_slab+0x349/0x3c0
    [] new_slab+0x91/0x300
    [] __slab_alloc+0x2bb/0x482
    [] ? copy_process.part.25+0xfc/0x14c0
    [] ? load_balance+0x218/0x890
    [] ? sched_clock+0x9/0x10
    [] ? trace_clock_local+0x9/0x10
    [] kmem_cache_alloc_node+0x8c/0x200
    [] copy_process.part.25+0xfc/0x14c0
    [] ? trace_buffer_unlock_commit+0x4d/0x60
    [] ? kthread_create_on_node+0x140/0x140
    [] do_fork+0xbc/0x360
    [] kernel_thread+0x26/0x30
    [] kthreadd+0x2c2/0x300
    [] ? kthread_create_on_cpu+0x60/0x60
    [] ret_from_fork+0x7c/0xb0
    [] ? kthread_create_on_cpu+0x60/0x60

    In my investigation, I found the root cause is wq_numa_possible_cpumask.
    All entries of wq_numa_possible_cpumask is allocated by
    alloc_cpumask_var_node(). And these entries are used without initializing.
    So these entries have wrong value.

    When hot-adding and onlining CPU, wq_update_unbound_numa() is called.
    wq_update_unbound_numa() calls alloc_unbound_pwq(). And alloc_unbound_pwq()
    calls get_unbound_pool(). In get_unbound_pool(), worker_pool->node is set
    as follow:

    3592 /* if cpumask is contained inside a NUMA node, we belong to that node */
    3593 if (wq_numa_enabled) {
    3594 for_each_node(node) {
    3595 if (cpumask_subset(pool->attrs->cpumask,
    3596 wq_numa_possible_cpumask[node])) {
    3597 pool->node = node;
    3598 break;
    3599 }
    3600 }
    3601 }

    But wq_numa_possible_cpumask[node] does not have correct cpumask. So, wrong
    node is selected. As a result, kernel panic occurs.

    By this patch, all entries of wq_numa_possible_cpumask are allocated by
    zalloc_cpumask_var_node to initialize them. And the panic disappeared.

    Signed-off-by: Yasuaki Ishimatsu
    Reviewed-by: Lai Jiangshan
    Signed-off-by: Tejun Heo
    Fixes: bce903809ab3 ("workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]")
    Signed-off-by: Greg Kroah-Hartman

    Yasuaki Ishimatsu
     
  • commit 391acf970d21219a2a5446282d3b20eace0c0d7a upstream.

    When runing with the kernel(3.15-rc7+), the follow bug occurs:
    [ 9969.258987] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
    [ 9969.359906] in_atomic(): 1, irqs_disabled(): 0, pid: 160655, name: python
    [ 9969.441175] INFO: lockdep is turned off.
    [ 9969.488184] CPU: 26 PID: 160655 Comm: python Tainted: G A 3.15.0-rc7+ #85
    [ 9969.581032] Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012
    [ 9969.706052] ffffffff81a20e60 ffff8803e941fbd0 ffffffff8162f523 ffff8803e941fd18
    [ 9969.795323] ffff8803e941fbe0 ffffffff8109995a ffff8803e941fc58 ffffffff81633e6c
    [ 9969.884710] ffffffff811ba5dc ffff880405c6b480 ffff88041fdd90a0 0000000000002000
    [ 9969.974071] Call Trace:
    [ 9970.003403] [] dump_stack+0x4d/0x66
    [ 9970.065074] [] __might_sleep+0xfa/0x130
    [ 9970.130743] [] mutex_lock_nested+0x3c/0x4f0
    [ 9970.200638] [] ? kmem_cache_alloc+0x1bc/0x210
    [ 9970.272610] [] cpuset_mems_allowed+0x27/0x140
    [ 9970.344584] [] ? __mpol_dup+0x63/0x150
    [ 9970.409282] [] __mpol_dup+0xe5/0x150
    [ 9970.471897] [] ? __mpol_dup+0x63/0x150
    [ 9970.536585] [] ? copy_process.part.23+0x606/0x1d40
    [ 9970.613763] [] ? trace_hardirqs_on+0xd/0x10
    [ 9970.683660] [] ? monotonic_to_bootbased+0x2f/0x50
    [ 9970.759795] [] copy_process.part.23+0x670/0x1d40
    [ 9970.834885] [] do_fork+0xd8/0x380
    [ 9970.894375] [] ? __audit_syscall_entry+0x9c/0xf0
    [ 9970.969470] [] SyS_clone+0x16/0x20
    [ 9971.030011] [] stub_clone+0x69/0x90
    [ 9971.091573] [] ? system_call_fastpath+0x16/0x1b

    The cause is that cpuset_mems_allowed() try to take
    mutex_lock(&callback_mutex) under the rcu_read_lock(which was hold in
    __mpol_dup()). And in cpuset_mems_allowed(), the access to cpuset is
    under rcu_read_lock, so in __mpol_dup, we can reduce the rcu_read_lock
    protection region to protect the access to cpuset only in
    current_cpuset_is_being_rebound(). So that we can avoid this bug.

    This patch is a temporary solution that just addresses the bug
    mentioned above, can not fix the long-standing issue about cpuset.mems
    rebinding on fork():

    "When the forker's task_struct is duplicated (which includes
    ->mems_allowed) and it races with an update to cpuset_being_rebound
    in update_tasks_nodemask() then the task's mems_allowed doesn't get
    updated. And the child task's mems_allowed can be wrong if the
    cpuset's nodemask changes before the child has been added to the
    cgroup's tasklist."

    Signed-off-by: Gu Zheng
    Acked-by: Li Zefan
    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Gu Zheng
     
  • commit bddbceb688c6d0decaabc7884fede319d02f96c8 upstream.

    Uevents are suppressed during attributes registration, but never
    restored, so kobject_uevent() does nothing.

    Signed-off-by: Maxime Bizon
    Signed-off-by: Tejun Heo
    Fixes: 226223ab3c4118ddd10688cc2c131135848371ab
    Signed-off-by: Greg Kroah-Hartman

    Maxime Bizon
     

10 Jul, 2014

1 commit


07 Jul, 2014

2 commits

  • commit 4af4206be2bd1933cae20c2b6fb2058dbc887f7c upstream.

    syscall_regfunc() and syscall_unregfunc() should set/clear
    TIF_SYSCALL_TRACEPOINT system-wide, but do_each_thread() can race
    with copy_process() and miss the new child which was not added to
    the process/thread lists yet.

    Change copy_process() to update the child's TIF_SYSCALL_TRACEPOINT
    under tasklist.

    Link: http://lkml.kernel.org/p/20140413185854.GB20668@redhat.com

    Fixes: a871bd33a6c0 "tracing: Add syscall tracepoints"
    Acked-by: Frederic Weisbecker
    Acked-by: Paul E. McKenney
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Steven Rostedt
    Signed-off-by: Greg Kroah-Hartman

    Oleg Nesterov
     
  • commit 379cfdac37923653c9d4242d10052378b7563005 upstream.

    In order to prevent the saved cmdline cache from being filled when
    tracing is not active, the comms are only recorded after a trace event
    is recorded.

    The problem is, a comm can fail to be recorded if the trace_cmdline_lock
    is held. That lock is taken via a trylock to allow it to happen from
    any context (including NMI). If the lock fails to be taken, the comm
    is skipped. No big deal, as we will try again later.

    But! Because of the code that was added to only record after an event,
    we may not try again later as the recording is made as a oneshot per
    event per CPU.

    Only disable the recording of the comm if the comm is actually recorded.

    Fixes: 7ffbd48d5cab "tracing: Cache comms only after an event occurred"
    Signed-off-by: Steven Rostedt
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Red Hat)
     

01 Jul, 2014

2 commits

  • commit 1e77d0a1ed7417d2a5a52a7b8d32aea1833faa6c upstream.

    Till reported that the spurious interrupt detection of threaded
    interrupts is broken in two ways:

    - note_interrupt() is called for each action thread of a shared
    interrupt line. That's wrong as we are only interested whether none
    of the device drivers felt responsible for the interrupt, but by
    calling multiple times for a single interrupt line we account
    IRQ_NONE even if one of the drivers felt responsible.

    - note_interrupt() when called from the thread handler is not
    serialized. That leaves the members of irq_desc which are used for
    the spurious detection unprotected.

    To solve this we need to defer the spurious detection of a threaded
    interrupt to the next hardware interrupt context where we have
    implicit serialization.

    If note_interrupt is called with action_ret == IRQ_WAKE_THREAD, we
    check whether the previous interrupt requested a deferred check. If
    not, we request a deferred check for the next hardware interrupt and
    return.

    If set, we check whether one of the interrupt threads signaled
    success. Depending on this information we feed the result into the
    spurious detector.

    If one primary handler of a shared interrupt returns IRQ_HANDLED we
    disable the deferred check of irq threads on the same line, as we have
    found at least one device driver who cared.

    Reported-by: Till Straumann
    Signed-off-by: Thomas Gleixner
    Tested-by: Austin Schuh
    Cc: Oliver Hartkopp
    Cc: Wolfgang Grandegger
    Cc: Pavel Pisa
    Cc: Marc Kleine-Budde
    Cc: linux-can@vger.kernel.org
    Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1303071450130.22263@ionos
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 4e52365f279564cef0ddd41db5237f0471381093 upstream.

    When tracing a process in another pid namespace, it's important for fork
    event messages to contain the child's pid as seen from the tracer's pid
    namespace, not the parent's. Otherwise, the tracer won't be able to
    correlate the fork event with later SIGTRAP signals it receives from the
    child.

    We still risk a race condition if a ptracer from a different pid
    namespace attaches after we compute the pid_t value. However, sending a
    bogus fork event message in this unlikely scenario is still a vast
    improvement over the status quo where we always send bogus fork event
    messages to debuggers in a different pid namespace than the forking
    process.

    Signed-off-by: Matthew Dempsky
    Acked-by: Oleg Nesterov
    Cc: Kees Cook
    Cc: Julien Tinnes
    Cc: Roland McGrath
    Cc: Jan Kratochvil
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Matthew Dempsky
     

27 Jun, 2014

2 commits

  • commit 0e576acbc1d9600cf2d9b4a141a2554639959d50 upstream.

    If CONFIG_NO_HZ=n tick_nohz_get_sleep_length() returns NSEC_PER_SEC/HZ.

    If CONFIG_NO_HZ=y and the nohz functionality is disabled via the
    command line option "nohz=off" or not enabled due to missing hardware
    support, then tick_nohz_get_sleep_length() returns 0. That happens
    because ts->sleep_length is never set in that case.

    Set it to NSEC_PER_SEC/HZ when the NOHZ mode is inactive.

    Reported-by: Michal Hocko
    Reported-by: Borislav Petkov
    Signed-off-by: Thomas Gleixner
    Cc: Rui Xiang
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • [ Upstream commit 90f62cf30a78721641e08737bda787552428061e ]

    It is possible by passing a netlink socket to a more privileged
    executable and then to fool that executable into writing to the socket
    data that happens to be valid netlink message to do something that
    privileged executable did not intend to do.

    To keep this from happening replace bare capable and ns_capable calls
    with netlink_capable, netlink_net_calls and netlink_ns_capable calls.
    Which act the same as the previous calls except they verify that the
    opener of the socket had the desired permissions as well.

    Reported-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     

17 Jun, 2014

2 commits

  • commit a3c54931199565930d6d84f4c3456f6440aefd41 upstream.

    Fixes an easy DoS and possible information disclosure.

    This does nothing about the broken state of x32 auditing.

    eparis: If the admin has enabled auditd and has specifically loaded
    audit rules. This bug has been around since before git. Wow...

    Signed-off-by: Andy Lutomirski
    Signed-off-by: Eric Paris
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andy Lutomirski
     
  • commit 23adbe12ef7d3d4195e80800ab36b37bee28cd03 upstream.

    The kernel has no concept of capabilities with respect to inodes; inodes
    exist independently of namespaces. For example, inode_capable(inode,
    CAP_LINUX_IMMUTABLE) would be nonsense.

    This patch changes inode_capable to check for uid and gid mappings and
    renames it to capable_wrt_inode_uidgid, which should make it more
    obvious what it does.

    Fixes CVE-2014-4014.

    Cc: Theodore Ts'o
    Cc: Serge Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Chinner
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andy Lutomirski
     

12 Jun, 2014

1 commit

  • commit 723478c8a471403c53cf144999701f6e0c4bbd11 upstream.

    /proc/sys/kernel/perf_event_max_sample_rate will accept
    negative values as well as 0.

    Negative values are unreasonable, and 0 causes a
    divide by zero exception in perf_proc_update_handler.

    This patch enforces a lower limit of 1.

    Signed-off-by: Knut Petersen
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/5242DB0C.4070005@t-online.de
    Signed-off-by: Ingo Molnar
    Cc: Weng Meiling
    Signed-off-by: Greg Kroah-Hartman

    Knut Petersen