Eric Lee / smarc-fsl-linux-kernel

28 Aug, 2014

12 commits

0a385ed04 ENGR00313685-3 of/irq: simplify args to irq_create_of_mapping ... Browse Code »

commit e6d30ab1e7d1281784672c0fc2ffa385cfb7279e upstream.

All the callers of irq_create_of_mapping() pass the contents of a struct
of_phandle_args structure to the function. Since all the callers already
have an of_phandle_args pointer, why not pass it directly to
irq_create_of_mapping()?

Signed-off-by: Grant Likely
Acked-by: Michal Simek
Acked-by: Tony Lindgren
Cc: Thomas Gleixner
Cc: Russell King
Cc: Ralf Baechle
Cc: Benjamin Herrenschmidt
Signed-off-by: Shawn Guo

Conflicts:
arch/arm/mach-integrator/pci_v3.c
arch/mips/pci/pci-rt3883.c
kernel/irq/irqdomain.c

Grant Likely
2014-08-28 07:28:50 +0800
206f6ceae futex: update documentation for ordering guarantees ... Browse Code »

Commits 11d4616bd07f ("futex: revert back to the explicit waiter
counting code") and 69cd9eba3886 ("futex: avoid race between requeue and
wake") changed some of the finer details of how we think about futexes.
One was a late fix and the other a consequence of overlooking the whole
requeuing logic.

The first change caused our documentation to be incorrect, and the
second made us aware that we need to explicitly add more details to it.

Signed-off-by: Davidlohr Bueso
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2014-08-28 07:28:40 +0800
3cf587d5a futex: avoid race between requeue and wake ... Browse Code »

Jan Stancek reported:
"pthread_cond_broadcast/4-1.c testcase from openposix testsuite (LTP)
occasionally fails, because some threads fail to wake up.

Testcase creates 5 threads, which are all waiting on same condition.
Main thread then calls pthread_cond_broadcast() without holding mutex,
which calls:

futex(uaddr1, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, uaddr2, ..)

This immediately wakes up single thread A, which unlocks mutex and
tries to wake up another thread:

futex(uaddr2, FUTEX_WAKE_PRIVATE, 1)

If thread A manages to call futex_wake() before any waiters are
requeued for uaddr2, no other thread is woken up"

The ordering constraints for the hash bucket waiter counting are that
the waiter counts have to be incremented _before_ getting the spinlock
(because the spinlock acts as part of the memory barrier), but the
"requeue" operation didn't honor those rules, and nobody had even
thought about that case.

This fairly simple patch just increments the waiter count for the target
hash bucket (hb2) when requeing a futex before taking the locks. It
then decrements them again after releasing the lock - the code that
actually moves the futex(es) between hash buckets will do the additional
required waiter count housekeeping.

Reported-and-tested-by: Jan Stancek
Acked-by: Davidlohr Bueso
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: stable@vger.kernel.org # 3.14
Signed-off-by: Linus Torvalds

Linus Torvalds
2014-08-28 07:28:40 +0800
19905486c futex: revert back to the explicit waiter counting code ... Browse Code »

Srikar Dronamraju reports that commit b0c29f79ecea ("futexes: Avoid
taking the hb->lock if there's nothing to wake up") causes java threads
getting stuck on futexes when runing specjbb on a power7 numa box.

The cause appears to be that the powerpc spinlocks aren't using the same
ticket lock model that we use on x86 (and other) architectures, which in
turn result in the "spin_is_locked()" test in hb_waiters_pending()
occasionally reporting an unlocked spinlock even when there are pending
waiters.

So this reinstates Davidlohr Bueso's original explicit waiter counting
code, which I had convinced Davidlohr to drop in favor of figuring out
the pending waiters by just using the existing state of the spinlock and
the wait queue.

Reported-and-tested-by: Srikar Dronamraju
Original-code-by: Davidlohr Bueso
Signed-off-by: Linus Torvalds

Linus Torvalds
2014-08-28 07:28:40 +0800
90a002f99 futexes: Fix futex_hashsize initialization ... Browse Code »

"futexes: Increase hash table size for better performance"
introduces a new alloc_large_system_hash() call.

alloc_large_system_hash() however may allocate less memory than
requested, e.g. limited by MAX_ORDER.

Hence pass a pointer to alloc_large_system_hash() which will
contain the hash shift when the function returns. Afterwards
correctly set futex_hashsize.

Fixes a crash on s390 where the requested allocation size was
4MB but only 1MB was allocated.

Signed-off-by: Heiko Carstens
Cc: Darren Hart
Cc: Peter Zijlstra
Cc: Paul E. McKenney
Cc: Waiman Long
Cc: Jason Low
Cc: Davidlohr Bueso
Link: http://lkml.kernel.org/r/20140116135450.GA4345@osiris
Signed-off-by: Ingo Molnar

Heiko Carstens
2014-08-28 07:28:39 +0800
35e32c306 futexes: Avoid taking the hb->lock if there's nothing to wake up ... Browse Code »

In futex_wake() there is clearly no point in taking the hb->lock
if we know beforehand that there are no tasks to be woken. While
the hash bucket's plist head is a cheap way of knowing this, we
cannot rely 100% on it as there is a racy window between the
futex_wait call and when the task is actually added to the
plist. To this end, we couple it with the spinlock check as
tasks trying to enter the critical region are most likely
potential waiters that will be added to the plist, thus
preventing tasks sleeping forever if wakers don't acknowledge
all possible waiters.

Furthermore, the futex ordering guarantees are preserved,
ensuring that waiters either observe the changed user space
value before blocking or is woken by a concurrent waker. For
wakers, this is done by relying on the barriers in
get_futex_key_refs() -- for archs that do not have implicit mb
in atomic_inc(), we explicitly add them through a new
futex_get_mm function. For waiters we rely on the fact that
spin_lock calls already update the head counter, so spinners
are visible even if the lock hasn't been acquired yet.

For more details please refer to the updated comments in the
code and related discussion:

https://lkml.org/lkml/2013/11/26/556

Special thanks to tglx for careful review and feedback.

Suggested-by: Linus Torvalds
Reviewed-by: Darren Hart
Reviewed-by: Thomas Gleixner
Reviewed-by: Peter Zijlstra
Signed-off-by: Davidlohr Bueso
Cc: Paul E. McKenney
Cc: Mike Galbraith
Cc: Jeff Mahoney
Cc: Scott Norton
Cc: Tom Vaden
Cc: Aswin Chandramouleeswaran
Cc: Waiman Long
Cc: Jason Low
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/1389569486-25487-5-git-send-email-davidlohr@hp.com
Signed-off-by: Ingo Molnar

Davidlohr Bueso
2014-08-28 07:25:13 +0800
beb75e13e futexes: Document multiprocessor ordering guarantees ... Browse Code »

That's essential, if you want to hack on futexes.

Reviewed-by: Darren Hart
Reviewed-by: Peter Zijlstra
Reviewed-by: Paul E. McKenney
Signed-off-by: Thomas Gleixner
Signed-off-by: Davidlohr Bueso
Cc: Mike Galbraith
Cc: Jeff Mahoney
Cc: Linus Torvalds
Cc: Randy Dunlap
Cc: Scott Norton
Cc: Tom Vaden
Cc: Aswin Chandramouleeswaran
Cc: Waiman Long
Cc: Jason Low
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/1389569486-25487-4-git-send-email-davidlohr@hp.com
Signed-off-by: Ingo Molnar

Thomas Gleixner
2014-08-28 07:25:13 +0800
351bda9ba futexes: Increase hash table size for better performance ... Browse Code »

Currently, the futex global hash table suffers from its fixed,
smallish (for today's standards) size of 256 entries, as well as
its lack of NUMA awareness. Large systems, using many futexes,
can be prone to high amounts of collisions; where these futexes
hash to the same bucket and lead to extra contention on the same
hb->lock. Furthermore, cacheline bouncing is a reality when we
have multiple hb->locks residing on the same cacheline and
different futexes hash to adjacent buckets.

This patch keeps the current static size of 16 entries for small
systems, or otherwise, 256 * ncpus (or larger as we need to
round the number to a power of 2). Note that this number of CPUs
accounts for all CPUs that can ever be available in the system,
taking into consideration things like hotpluging. While we do
impose extra overhead at bootup by making the hash table larger,
this is a one time thing, and does not shadow the benefits of
this patch.

Furthermore, as suggested by tglx, by cache aligning the hash
buckets we can avoid access across cacheline boundaries and also
avoid massive cache line bouncing if multiple cpus are hammering
away at different hash buckets which happen to reside in the
same cache line.

Also, similar to other core kernel components (pid, dcache,
tcp), by using alloc_large_system_hash() we benefit from its
NUMA awareness and thus the table is distributed among the nodes
instead of in a single one.

For a custom microbenchmark that pounds on the uaddr hashing --
making the wait path fail at futex_wait_setup() returning
-EWOULDBLOCK for large amounts of futexes, we can see the
following benefits on a 80-core, 8-socket 1Tb server:

+---------+--------------------+------------------------+-----------------------+-------------------------------+
| threads | baseline (ops/sec) | aligned-only (ops/sec) | large table (ops/sec) | large table+aligned (ops/sec) |
+---------+--------------------+------------------------+-----------------------+-------------------------------+
|     512 |              32426 | 50531 (+55.8%)        | 255274 (+687.2%)     | 292553 (+802.2%)             |
|     256 |              65360 | 99588 (+52.3%)        | 443563 (+578.6%)     | 508088 (+677.3%)             |
|     128 |             125635 | 200075 (+59.2%)        | 742613 (+491.1%)     | 835452 (+564.9%)             |
|      80 |             193559 | 323425 (+67.1%)        | 1028147 (+431.1%)     | 1130304 (+483.9%)             |
|      64 |             247667 | 443740 (+79.1%)        | 997300 (+302.6%)     | 1145494 (+362.5%)             |
|      32 |             628412 | 721401 (+14.7%)        | 965996 (+53.7%)      | 1122115 (+78.5%)              |
+---------+--------------------+------------------------+-----------------------+-------------------------------+

Reviewed-by: Darren Hart
Reviewed-by: Peter Zijlstra
Reviewed-by: Paul E. McKenney
Reviewed-by: Waiman Long
Reviewed-and-tested-by: Jason Low
Reviewed-by: Thomas Gleixner
Signed-off-by: Davidlohr Bueso
Cc: Mike Galbraith
Cc: Jeff Mahoney
Cc: Linus Torvalds
Cc: Scott Norton
Cc: Tom Vaden
Cc: Aswin Chandramouleeswaran
Link: http://lkml.kernel.org/r/1389569486-25487-3-git-send-email-davidlohr@hp.com
Signed-off-by: Ingo Molnar

Davidlohr Bueso
2014-08-28 07:22:33 +0800
2b370fadc futex: Use freezable blocking call ... Browse Code »

Avoid waking up every thread sleeping in a futex_wait call during
suspend and resume by calling a freezable blocking call. Previous
patches modified the freezer to avoid sending wakeups to threads
that are blocked in freezable blocking calls.

This call was selected to be converted to a freezable call because
it doesn't hold any locks or release any resources when interrupted
that might be needed by another freezing task or a kernel driver
during suspend, and is a common site where idle userspace tasks are
blocked.

Signed-off-by: Colin Cross
Cc: Rafael J. Wysocki
Cc: arve@android.com
Cc: Tejun Heo
Cc: Oleg Nesterov
Cc: Darren Hart
Cc: Randy Dunlap
Cc: Al Viro
Link: http://lkml.kernel.org/r/1367458508-9133-8-git-send-email-ccross@android.com
Signed-off-by: Thomas Gleixner

Colin Cross
2014-08-28 07:11:59 +0800
1a192d2d0 futexes: Clean up various details ... Browse Code »

- Remove unnecessary head variables.
- Delete unused parameter in queue_unlock().

Reviewed-by: Darren Hart
Reviewed-by: Peter Zijlstra
Reviewed-by: Paul E. McKenney
Reviewed-by: Thomas Gleixner
Signed-off-by: Jason Low
Signed-off-by: Davidlohr Bueso
Cc: Mike Galbraith
Cc: Jeff Mahoney
Cc: Linus Torvalds
Cc: Scott Norton
Cc: Tom Vaden
Cc: Aswin Chandramouleeswaran
Cc: Waiman Long
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/1389569486-25487-2-git-send-email-davidlohr@hp.com
Signed-off-by: Ingo Molnar

Jason Low
2014-08-28 07:11:59 +0800
eef0d5877 futex: move user address verification up to common code ... Browse Code »

When debugging the read-only hugepage case, I was confused by the fact
that get_futex_key() did an access_ok() only for the non-shared futex
case, since the user address checking really isn't in any way specific
to the private key handling.

Now, it turns out that the shared key handling does effectively do the
equivalent checks inside get_user_pages_fast() (it doesn't actually
check the address range on x86, but does check the page protections for
being a user page). So it wasn't actually a bug, but the fact that we
treat the address differently for private and shared futexes threw me
for a loop.

Just move the check up, so that it gets done for both cases. Also, use
the 'rw' parameter for the type, even if it doesn't actually matter any
more (it's a historical artifact of the old racy i386 "page faults from
kernel space don't check write protections").

Cc: Thomas Gleixner
Signed-off-by: Linus Torvalds

Linus Torvalds
2014-08-28 07:11:58 +0800
88d4952ee ENGR00273073-1 arm: add cpu idle notification callback ... Browse Code »

Some modules may need to know cpu idle status and take
actions before and after cpu idle, so we can add
notification callback when enter/exit cpu idle, then
modules only need to register this notification callback,
everytime cpu enter/exit idle, the callback chain will be
executed.

Currently only cpufreq interactive governor use this
notification, as it wants to save power, the timers of
interactive governor are only enabled when cpu is not in idle.

Signed-off-by: Anson Huang

Anson Huang
2014-08-28 07:04:35 +0800

08 Aug, 2014

2 commits

562eebeb9 timer: Fix lock inversion between hrtimer_bases.lock and scheduler locks ... Browse Code »

commit 504d58745c9ca28d33572e2d8a9990b43e06075d upstream.

clockevents_increase_min_delta() calls printk() from under
hrtimer_bases.lock. That causes lock inversion on scheduler locks because
printk() can call into the scheduler. Lockdep puts it as:

======================================================
[ INFO: possible circular locking dependency detected ]
3.15.0-rc8-06195-g939f04b #2 Not tainted
-------------------------------------------------------
trinity-main/74 is trying to acquire lock:
(&port_lock_key){-.....}, at: [] serial8250_console_write+0x8c/0x10c

but task is already holding lock:
(hrtimer_bases.lock){-.-...}, at: [] hrtimer_try_to_cancel+0x13/0x66

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #5 (hrtimer_bases.lock){-.-...}:
[] lock_acquire+0x92/0x101
[] _raw_spin_lock_irqsave+0x2e/0x3e
[] __hrtimer_start_range_ns+0x1c/0x197
[] perf_swevent_start_hrtimer.part.41+0x7a/0x85
[] task_clock_event_start+0x3a/0x3f
[] task_clock_event_add+0xd/0x14
[] event_sched_in+0xb6/0x17a
[] group_sched_in+0x44/0x122
[] ctx_sched_in.isra.67+0x105/0x11f
[] perf_event_sched_in.isra.70+0x47/0x4b
[] __perf_install_in_context+0x8b/0xa3
[] remote_function+0x12/0x2a
[] smp_call_function_single+0x2d/0x53
[] task_function_call+0x30/0x36
[] perf_install_in_context+0x87/0xbb
[] SYSC_perf_event_open+0x5c6/0x701
[] SyS_perf_event_open+0x17/0x19
[] syscall_call+0x7/0xb

-> #4 (&ctx->lock){......}:
[] lock_acquire+0x92/0x101
[] _raw_spin_lock+0x21/0x30
[] __perf_event_task_sched_out+0x1dc/0x34f
[] __schedule+0x4c6/0x4cb
[] schedule+0xf/0x11
[] work_resched+0x5/0x30

-> #3 (&rq->lock){-.-.-.}:
[] lock_acquire+0x92/0x101
[] _raw_spin_lock+0x21/0x30
[] __task_rq_lock+0x33/0x3a
[] wake_up_new_task+0x25/0xc2
[] do_fork+0x15c/0x2a0
[] kernel_thread+0x1a/0x1f
[] rest_init+0x1a/0x10e
[] start_kernel+0x303/0x308
[] i386_start_kernel+0x79/0x7d

-> #2 (&p->pi_lock){-.-...}:
[] lock_acquire+0x92/0x101
[] _raw_spin_lock_irqsave+0x2e/0x3e
[] try_to_wake_up+0x1d/0xd6
[] default_wake_function+0xb/0xd
[] __wake_up_common+0x39/0x59
[] __wake_up+0x29/0x3b
[] tty_wakeup+0x49/0x51
[] uart_write_wakeup+0x17/0x19
[] serial8250_tx_chars+0xbc/0xfb
[] serial8250_handle_irq+0x54/0x6a
[] serial8250_default_handle_irq+0x19/0x1c
[] serial8250_interrupt+0x38/0x9e
[] handle_irq_event_percpu+0x5f/0x1e2
[] handle_irq_event+0x2c/0x43
[] handle_level_irq+0x57/0x80
[] handle_irq+0x46/0x5c
[] do_IRQ+0x32/0x89
[] common_interrupt+0x2e/0x33
[] _raw_spin_unlock_irqrestore+0x3f/0x49
[] uart_start+0x2d/0x32
[] uart_write+0xc7/0xd6
[] n_tty_write+0xb8/0x35e
[] tty_write+0x163/0x1e4
[] redirected_tty_write+0x6d/0x75
[] vfs_write+0x75/0xb0
[] SyS_write+0x44/0x77
[] syscall_call+0x7/0xb

-> #1 (&tty->write_wait){-.....}:
[] lock_acquire+0x92/0x101
[] _raw_spin_lock_irqsave+0x2e/0x3e
[] __wake_up+0x15/0x3b
[] tty_wakeup+0x49/0x51
[] uart_write_wakeup+0x17/0x19
[] serial8250_tx_chars+0xbc/0xfb
[] serial8250_handle_irq+0x54/0x6a
[] serial8250_default_handle_irq+0x19/0x1c
[] serial8250_interrupt+0x38/0x9e
[] handle_irq_event_percpu+0x5f/0x1e2
[] handle_irq_event+0x2c/0x43
[] handle_level_irq+0x57/0x80
[] handle_irq+0x46/0x5c
[] do_IRQ+0x32/0x89
[] common_interrupt+0x2e/0x33
[] _raw_spin_unlock_irqrestore+0x3f/0x49
[] uart_start+0x2d/0x32
[] uart_write+0xc7/0xd6
[] n_tty_write+0xb8/0x35e
[] tty_write+0x163/0x1e4
[] redirected_tty_write+0x6d/0x75
[] vfs_write+0x75/0xb0
[] SyS_write+0x44/0x77
[] syscall_call+0x7/0xb

-> #0 (&port_lock_key){-.....}:
[] __lock_acquire+0x9ea/0xc6d
[] lock_acquire+0x92/0x101
[] _raw_spin_lock_irqsave+0x2e/0x3e
[] serial8250_console_write+0x8c/0x10c
[] call_console_drivers.constprop.31+0x87/0x118
[] console_unlock+0x1d7/0x398
[] vprintk_emit+0x3da/0x3e4
[] printk+0x17/0x19
[] clockevents_program_min_delta+0x104/0x116
[] clockevents_program_event+0xe7/0xf3
[] tick_program_event+0x1e/0x23
[] hrtimer_force_reprogram+0x88/0x8f
[] __remove_hrtimer+0x5b/0x79
[] hrtimer_try_to_cancel+0x49/0x66
[] hrtimer_cancel+0xd/0x18
[] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
[] task_clock_event_stop+0x20/0x64
[] task_clock_event_del+0xd/0xf
[] event_sched_out+0xab/0x11e
[] group_sched_out+0x1d/0x66
[] ctx_sched_out+0xaf/0xbf
[] __perf_event_task_sched_out+0x1ed/0x34f
[] __schedule+0x4c6/0x4cb
[] schedule+0xf/0x11
[] work_resched+0x5/0x30

other info that might help us debug this:

Chain exists of:
&port_lock_key --> &ctx->lock --> hrtimer_bases.lock

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(hrtimer_bases.lock);
lock(&ctx->lock);
lock(hrtimer_bases.lock);
lock(&port_lock_key);

*** DEADLOCK ***

4 locks held by trinity-main/74:
#0: (&rq->lock){-.-.-.}, at: [] __schedule+0xed/0x4cb
#1: (&ctx->lock){......}, at: [] __perf_event_task_sched_out+0x1dc/0x34f
#2: (hrtimer_bases.lock){-.-...}, at: [] hrtimer_try_to_cancel+0x13/0x66
#3: (console_lock){+.+...}, at: [] vprintk_emit+0x3c7/0x3e4

stack backtrace:
CPU: 0 PID: 74 Comm: trinity-main Not tainted 3.15.0-rc8-06195-g939f04b #2
00000000 81c3a310 8b995c14 81426f69 8b995c44 81425a99 8161f671 8161f570
8161f538 8161f559 8161f538 8b995c78 8b142bb0 00000004 8b142fdc 8b142bb0
8b995ca8 8104a62d 8b142fac 000016f2 81c3a310 00000001 00000001 00000003
Call Trace:
[] dump_stack+0x16/0x18
[] print_circular_bug+0x18f/0x19c
[] __lock_acquire+0x9ea/0xc6d
[] lock_acquire+0x92/0x101
[] ? serial8250_console_write+0x8c/0x10c
[] ? wait_for_xmitr+0x76/0x76
[] _raw_spin_lock_irqsave+0x2e/0x3e
[] ? serial8250_console_write+0x8c/0x10c
[] serial8250_console_write+0x8c/0x10c
[] ? lock_release+0x191/0x223
[] ? wait_for_xmitr+0x76/0x76
[] call_console_drivers.constprop.31+0x87/0x118
[] console_unlock+0x1d7/0x398
[] vprintk_emit+0x3da/0x3e4
[] printk+0x17/0x19
[] clockevents_program_min_delta+0x104/0x116
[] tick_program_event+0x1e/0x23
[] hrtimer_force_reprogram+0x88/0x8f
[] __remove_hrtimer+0x5b/0x79
[] hrtimer_try_to_cancel+0x49/0x66
[] hrtimer_cancel+0xd/0x18
[] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
[] task_clock_event_stop+0x20/0x64
[] task_clock_event_del+0xd/0xf
[] event_sched_out+0xab/0x11e
[] group_sched_out+0x1d/0x66
[] ctx_sched_out+0xaf/0xbf
[] __perf_event_task_sched_out+0x1ed/0x34f
[] ? __dequeue_entity+0x23/0x27
[] ? pick_next_task_fair+0xb1/0x120
[] __schedule+0x4c6/0x4cb
[] ? trace_hardirqs_off_caller+0xd7/0x108
[] ? trace_hardirqs_off+0xb/0xd
[] ? rcu_irq_exit+0x64/0x77

Fix the problem by using printk_deferred() which does not call into the
scheduler.

Reported-by: Fengguang Wu
Signed-off-by: Jan Kara
Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2014-08-08 05:30:26 +0800
3984bb13c printk: rename printk_sched to printk_deferred ... Browse Code »

commit aac74dc495456412c4130a1167ce4beb6c1f0b38 upstream.

After learning we'll need some sort of deferred printk functionality in
the timekeeping core, Peter suggested we rename the printk_sched function
so it can be reused by needed subsystems.

This only changes the function name. No logic changes.

Signed-off-by: John Stultz
Reviewed-by: Steven Rostedt
Cc: Jan Kara
Cc: Peter Zijlstra
Cc: Jiri Bohac
Cc: Thomas Gleixner
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

John Stultz
2014-08-08 05:30:26 +0800

01 Aug, 2014

1 commit

efd39f778 tracing: Fix wraparound problems in "uptime" trace clock ... Browse Code »

commit 58d4e21e50ff3cc57910a8abc20d7e14375d2f61 upstream.

The "uptime" trace clock added in:

commit 8aacf017b065a805d27467843490c976835eb4a5
tracing: Add "uptime" trace clock that uses jiffies

has wraparound problems when the system has been up more
than 1 hour 11 minutes and 34 seconds. It converts jiffies
to nanoseconds using:
(u64)jiffies_to_usecs(jiffy) * 1000ULL
but since jiffies_to_usecs() only returns a 32-bit value, it
truncates at 2^32 microseconds. An additional problem on 32-bit
systems is that the argument is "unsigned long", so fixing the
return value only helps until 2^32 jiffies (49.7 days on a HZ=1000
system).

Avoid these problems by using jiffies_64 as our basis, and
not converting to nanoseconds (we do convert to clock_t because
user facing API must not be dependent on internal kernel
HZ values).

Link: http://lkml.kernel.org/p/99d63c5bfe9b320a3b428d773825a37095bf6a51.1405708254.git.tony.luck@intel.com

Fixes: 8aacf017b065 "tracing: Add "uptime" trace clock that uses jiffies"
Signed-off-by: Tony Luck
Signed-off-by: Steven Rostedt
Signed-off-by: Greg Kroah-Hartman

Tony Luck
2014-08-01 03:53:49 +0800

28 Jul, 2014

7 commits

4aba6e363 sched: Fix possible divide by zero in avg_atom() calculation ... Browse Code »

commit b0ab99e7736af88b8ac1b7ae50ea287fffa2badc upstream.

proc_sched_show_task() does:

if (nr_switches)
do_div(avg_atom, nr_switches);

nr_switches is unsigned long and do_div truncates it to 32 bits, which
means it can test non-zero on e.g. x86-64 and be truncated to zero for
division.

Fix the problem by using div64_ul() instead.

As a side effect calculations of avg_atom for big nr_switches are now correct.

Signed-off-by: Mateusz Guzik
Signed-off-by: Peter Zijlstra
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1402750809-31991-1-git-send-email-mguzik@redhat.com
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman

Mateusz Guzik
2014-07-28 23:00:07 +0800
e6be7d311 locking/mutex: Disable optimistic spinning on some architectures ... Browse Code »

commit 4badad352a6bb202ec68afa7a574c0bb961e5ebc upstream.

The optimistic spin code assumes regular stores and cmpxchg() play nice;
this is found to not be true for at least: parisc, sparc32, tile32,
metag-lock1, arc-!llsc and hexagon.

There is further wreckage, but this in particular seemed easy to
trigger, so blacklist this.

Opt in for known good archs.

Signed-off-by: Peter Zijlstra
Reported-by: Mikulas Patocka
Cc: David Miller
Cc: Chris Metcalf
Cc: James Bottomley
Cc: Vineet Gupta
Cc: Jason Low
Cc: Waiman Long
Cc: "James E.J. Bottomley"
Cc: Paul McKenney
Cc: John David Anglin
Cc: James Hogan
Cc: Linus Torvalds
Cc: Davidlohr Bueso
Cc: Benjamin Herrenschmidt
Cc: Catalin Marinas
Cc: Russell King
Cc: Will Deacon
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sparclinux@vger.kernel.org
Link: http://lkml.kernel.org/r/20140606175316.GV13930@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman

Peter Zijlstra
2014-07-28 23:00:07 +0800
804536e8e PM / sleep: Fix request_firmware() error at resume ... Browse Code »

commit 4320f6b1d9db4ca912c5eb6ecb328b2e090e1586 upstream.

The commit [247bc037: PM / Sleep: Mitigate race between the freezer
and request_firmware()] introduced the finer state control, but it
also leads to a new bug; for example, a bug report regarding the
firmware loading of intel BT device at suspend/resume:
https://bugzilla.novell.com/show_bug.cgi?id=873790

The root cause seems to be a small window between the process resume
and the clear of usermodehelper lock. The request_firmware() function
checks the UMH lock and gives up when it's in UMH_DISABLE state. This
is for avoiding the invalid f/w loading during suspend/resume phase.
The problem is, however, that usermodehelper_enable() is called at the
end of thaw_processes(). Thus, a thawed process in between can kick
off the f/w loader code path (in this case, via btusb_setup_intel())
even before the call of usermodehelper_enable(). Then
usermodehelper_read_trylock() returns an error and request_firmware()
spews WARN_ON() in the end.

This oneliner patch fixes the issue just by setting to UMH_FREEZING
state again before restarting tasks, so that the call of
request_firmware() will be blocked until the end of this function
instead of returning an error.

Fixes: 247bc0374254 (PM / Sleep: Mitigate race between the freezer and request_firmware())
Link: https://bugzilla.novell.com/show_bug.cgi?id=873790
Signed-off-by: Takashi Iwai
Signed-off-by: Rafael J. Wysocki
Signed-off-by: Greg Kroah-Hartman

Takashi Iwai
2014-07-28 23:00:07 +0800
c93319273 alarmtimer: Fix bug where relative alarm timers were treated as absolute ... Browse Code »

commit 16927776ae757d0d132bdbfabbfe2c498342bd59 upstream.

Sharvil noticed with the posix timer_settime interface, using the
CLOCK_REALTIME_ALARM or CLOCK_BOOTTIME_ALARM clockid, if the users
tried to specify a relative time timer, it would incorrectly be
treated as absolute regardless of the state of the flags argument.

This patch corrects this, properly checking the absolute/relative flag,
as well as adds further error checking that no invalid flag bits are set.

Reported-by: Sharvil Nanavati
Signed-off-by: John Stultz
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Prarit Bhargava
Cc: Sharvil Nanavati
Link: http://lkml.kernel.org/r/1404767171-6902-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

John Stultz
2014-07-28 23:00:07 +0800
16de9ea38 ring-buffer: Fix polling on trace_pipe ... Browse Code »

commit 97b8ee845393701edc06e27ccec2876ff9596019 upstream.

ring_buffer_poll_wait() should always put the poll_table to its wait_queue
even there is immediate data available. Otherwise, the following epoll and
read sequence will eventually hang forever:

1. Put some data to make the trace_pipe ring_buffer read ready first
2. epoll_ctl(efd, EPOLL_CTL_ADD, trace_pipe_fd, ee)
3. epoll_wait()
4. read(trace_pipe_fd) till EAGAIN
5. Add some more data to the trace_pipe ring_buffer
6. epoll_wait() -> this epoll_wait() will block forever

~ During the epoll_ctl(efd, EPOLL_CTL_ADD,...) call in step 2,
ring_buffer_poll_wait() returns immediately without adding poll_table,
which has poll_table->_qproc pointing to ep_poll_callback(), to its
wait_queue.
~ During the epoll_wait() call in step 3 and step 6,
ring_buffer_poll_wait() cannot add ep_poll_callback() to its wait_queue
because the poll_table->_qproc is NULL and it is how epoll works.
~ When there is new data available in step 6, ring_buffer does not know
it has to call ep_poll_callback() because it is not in its wait queue.
Hence, block forever.

Other poll implementation seems to call poll_wait() unconditionally as the very
first thing to do. For example, tcp_poll() in tcp.c.

Link: http://lkml.kernel.org/p/20140610060637.GA14045@devbig242.prn2.facebook.com

Fixes: 2a2cc8f7c4d0 "ftrace: allow the event pipe to be polled"
Reviewed-by: Chris Mason
Signed-off-by: Martin Lau
Signed-off-by: Steven Rostedt
Signed-off-by: Greg Kroah-Hartman

Martin Lau
2014-07-28 23:00:06 +0800
e250100be tracing: Add ftrace_trace_stack into __trace_puts/__trace_bputs ... Browse Code »

commit 8abfb8727f4a724d31f9ccfd8013fbd16d539445 upstream.

Currently trace option stacktrace is not applicable for
trace_printk with constant string argument, the reason is
in __trace_puts/__trace_bputs ftrace_trace_stack is missing.

In contrast, when using trace_printk with non constant string
argument(will call into __trace_printk/__trace_bprintk), then
trace option stacktrace is workable, this inconstant result
will confuses users a lot.

Link: http://lkml.kernel.org/p/51E7A7C9.9040401@huawei.com

Signed-off-by: zhangwei(Jovi)
Signed-off-by: Steven Rostedt
Signed-off-by: Greg Kroah-Hartman

zhangwei(Jovi)
2014-07-28 23:00:03 +0800
9b87c4e58 tracing: Fix graph tracer with stack tracer on other archs ... Browse Code »

commit 5f8bf2d263a20b986225ae1ed7d6759dc4b93af9 upstream.

Running my ftrace tests on PowerPC, it failed the test that checks
if function_graph tracer is affected by the stack tracer. It was.
Looking into this, I found that the update_function_graph_func()
must be called even if the trampoline function is not changed.
This is because archs like PowerPC do not support ftrace_ops being
passed by assembly and instead uses a helper function (what the
trampoline function points to). Since this function is not changed
even when multiple ftrace_ops are added to the code, the test that
falls out before calling update_function_graph_func() will miss that
the update must still be done.

Call update_function_graph_function() for all calls to
update_ftrace_function()

Signed-off-by: Steven Rostedt
Signed-off-by: Greg Kroah-Hartman

Steven Rostedt (Red Hat)
2014-07-28 23:00:03 +0800

18 Jul, 2014

8 commits

2371e977c rtmutex: Plug slow unlock race ... Browse Code »

commit 27e35715df54cbc4f2d044f681802ae30479e7fb upstream.

When the rtmutex fast path is enabled the slow unlock function can
create the following situation:

spin_lock(foo->m->wait_lock);
foo->m->owner = NULL;
rt_mutex_lock(foo->m); refcnt);
rt_mutex_unlock(foo->m); m->wait_lock); owner */
clear_rt_mutex_waiters(m);
owner = rt_mutex_owner(m);
spin_unlock(m->wait_lock);
if (cmpxchg(m->owner, owner, 0) == owner)
return;
spin_lock(m->wait_lock);
}

So in case of a new waiter incoming while the owner tries the slow
path unlock we have two situations:

unlock(wait_lock);
lock(wait_lock);
cmpxchg(p, owner, 0) == owner
mark_rt_mutex_waiters(lock);
acquire(lock);

Or:

unlock(wait_lock);
lock(wait_lock);
mark_rt_mutex_waiters(lock);
cmpxchg(p, owner, 0) != owner
enqueue_waiter();
unlock(wait_lock);
lock(wait_lock);
wakeup_next waiter();
unlock(wait_lock);
lock(wait_lock);
acquire(lock);

If the fast path is disabled, then the simple

m->owner = NULL;
unlock(m->wait_lock);

is sufficient as all access to m->owner is serialized via
m->wait_lock;

Also document and clarify the wakeup_next_waiter function as suggested
by Oleg Nesterov.

Reported-by: Steven Rostedt
Signed-off-by: Thomas Gleixner
Reviewed-by: Steven Rostedt
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/20140611183852.937945560@linutronix.de
Signed-off-by: Thomas Gleixner
Signed-off-by: Mike Galbraith
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2014-07-18 06:58:04 +0800
1201613a7 rtmutex: Handle deadlock detection smarter ... Browse Code »

commit 3d5c9340d1949733eb37616abd15db36aef9a57c upstream.

Even in the case when deadlock detection is not requested by the
caller, we can detect deadlocks. Right now the code stops the lock
chain walk and keeps the waiter enqueued, even on itself. Silly not to
yell when such a scenario is detected and to keep the waiter enqueued.

Return -EDEADLK unconditionally and handle it at the call sites.

The futex calls return -EDEADLK. The non futex ones dequeue the
waiter, throw a warning and put the task into a schedule loop.

Tagged for stable as it makes the code more robust.

Signed-off-by: Thomas Gleixner
Cc: Steven Rostedt
Cc: Peter Zijlstra
Cc: Brad Mouring
Link: http://lkml.kernel.org/r/20140605152801.836501969@linutronix.de
Signed-off-by: Thomas Gleixner
Signed-off-by: Mike Galbraith
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2014-07-18 06:58:04 +0800
98be12bc2 rtmutex: Detect changes in the pi lock chain ... Browse Code »

commit 82084984383babe728e6e3c9a8e5c46278091315 upstream.

When we walk the lock chain, we drop all locks after each step. So the
lock chain can change under us before we reacquire the locks. That's
harmless in principle as we just follow the wrong lock path. But it
can lead to a false positive in the dead lock detection logic:

T0 holds L0
T0 blocks on L1 held by T1
T1 blocks on L2 held by T2
T2 blocks on L3 held by T3
T4 blocks on L4 held by T4

Now we walk the chain

lock T1 -> lock L2 -> adjust L2 -> unlock T1 ->
lock T2 -> adjust T2 -> drop locks

T2 times out and blocks on L0

Now we continue:

lock T2 -> lock L0 -> deadlock detected, but it's not a deadlock at all.

Brad tried to work around that in the deadlock detection logic itself,
but the more I looked at it the less I liked it, because it's crystal
ball magic after the fact.

We actually can detect a chain change very simple:

lock T1 -> lock L2 -> adjust L2 -> unlock T1 -> lock T2 -> adjust T2 ->

next_lock = T2->pi_blocked_on->lock;

drop locks

T2 times out and blocks on L0

Now we continue:

lock T2 ->

if (next_lock != T2->pi_blocked_on->lock)
return;

So if we detect that T2 is now blocked on a different lock we stop the
chain walk. That's also correct in the following scenario:

lock T1 -> lock L2 -> adjust L2 -> unlock T1 -> lock T2 -> adjust T2 ->

next_lock = T2->pi_blocked_on->lock;

drop locks

T3 times out and drops L3
T2 acquires L3 and blocks on L4 now

Now we continue:

lock T2 ->

if (next_lock != T2->pi_blocked_on->lock)
return;

We don't have to follow up the chain at that point, because T2
propagated our priority up to T4 already.

[ Folded a cleanup patch from peterz ]

Signed-off-by: Thomas Gleixner
Reported-by: Brad Mouring
Cc: Steven Rostedt
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/20140605152801.930031935@linutronix.de
Signed-off-by: Mike Galbraith
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2014-07-18 06:58:03 +0800
d88b1b40b rtmutex: Fix deadlock detector for real ... Browse Code »

commit 397335f004f41e5fcf7a795e94eb3ab83411a17c upstream.

The current deadlock detection logic does not work reliably due to the
following early exit path:

/*
* Drop out, when the task has no waiters. Note,
* top_waiter can be NULL, when we are in the deboosting
* mode!
*/
if (top_waiter && (!task_has_pi_waiters(task) ||
top_waiter != task_top_pi_waiter(task)))
goto out_unlock_pi;

So this not only exits when the task has no waiters, it also exits
unconditionally when the current waiter is not the top priority waiter
of the task.

So in a nested locking scenario, it might abort the lock chain walk
and therefor miss a potential deadlock.

Simple fix: Continue the chain walk, when deadlock detection is
enabled.

We also avoid the whole enqueue, if we detect the deadlock right away
(A-A). It's an optimization, but also prevents that another waiter who
comes in after the detection and before the task has undone the damage
observes the situation and detects the deadlock and returns
-EDEADLOCK, which is wrong as the other task is not in a deadlock
situation.

Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Reviewed-by: Steven Rostedt
Cc: Lai Jiangshan
Link: http://lkml.kernel.org/r/20140522031949.725272460@linutronix.de
Signed-off-by: Thomas Gleixner
Signed-off-by: Mike Galbraith
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2014-07-18 06:58:03 +0800
561237e44 ring-buffer: Check if buffer exists before polling ... Browse Code »

commit 8b8b36834d0fff67fc8668093f4312dd04dcf21d upstream.

The per_cpu buffers are created one per possible CPU. But these do
not mean that those CPUs are online, nor do they even exist.

With the addition of the ring buffer polling, it assumes that the
caller polls on an existing buffer. But this is not the case if
the user reads trace_pipe from a CPU that does not exist, and this
causes the kernel to crash.

Simple fix is to check the cpu against buffer bitmask against to see
if the buffer was allocated or not and return -ENODEV if it is
not.

More updates were done to pass the -ENODEV back up to userspace.

Link: http://lkml.kernel.org/r/5393DB61.6060707@oracle.com

Reported-by: Sasha Levin
Signed-off-by: Steven Rostedt
Signed-off-by: Greg Kroah-Hartman

Steven Rostedt (Red Hat)
2014-07-18 06:58:03 +0800
3e24998c8 workqueue: zero cpumask of wq_numa_possible_cpumask on init ... Browse Code »

commit 5a6024f1604eef119cf3a6fa413fe0261a81a8f3 upstream.

When hot-adding and onlining CPU, kernel panic occurs, showing following
call trace.

BUG: unable to handle kernel paging request at 0000000000001d08
IP: [] __alloc_pages_nodemask+0x9d/0xb10
PGD 0
Oops: 0000 [#1] SMP
...
Call Trace:
[] ? cpumask_next_and+0x35/0x50
[] ? find_busiest_group+0x113/0x8f0
[] ? deactivate_slab+0x349/0x3c0
[] new_slab+0x91/0x300
[] __slab_alloc+0x2bb/0x482
[] ? copy_process.part.25+0xfc/0x14c0
[] ? load_balance+0x218/0x890
[] ? sched_clock+0x9/0x10
[] ? trace_clock_local+0x9/0x10
[] kmem_cache_alloc_node+0x8c/0x200
[] copy_process.part.25+0xfc/0x14c0
[] ? trace_buffer_unlock_commit+0x4d/0x60
[] ? kthread_create_on_node+0x140/0x140
[] do_fork+0xbc/0x360
[] kernel_thread+0x26/0x30
[] kthreadd+0x2c2/0x300
[] ? kthread_create_on_cpu+0x60/0x60
[] ret_from_fork+0x7c/0xb0
[] ? kthread_create_on_cpu+0x60/0x60

In my investigation, I found the root cause is wq_numa_possible_cpumask.
All entries of wq_numa_possible_cpumask is allocated by
alloc_cpumask_var_node(). And these entries are used without initializing.
So these entries have wrong value.

When hot-adding and onlining CPU, wq_update_unbound_numa() is called.
wq_update_unbound_numa() calls alloc_unbound_pwq(). And alloc_unbound_pwq()
calls get_unbound_pool(). In get_unbound_pool(), worker_pool->node is set
as follow:

3592 /* if cpumask is contained inside a NUMA node, we belong to that node */
3593 if (wq_numa_enabled) {
3594 for_each_node(node) {
3595 if (cpumask_subset(pool->attrs->cpumask,
3596 wq_numa_possible_cpumask[node])) {
3597 pool->node = node;
3598 break;
3599 }
3600 }
3601 }

But wq_numa_possible_cpumask[node] does not have correct cpumask. So, wrong
node is selected. As a result, kernel panic occurs.

By this patch, all entries of wq_numa_possible_cpumask are allocated by
zalloc_cpumask_var_node to initialize them. And the panic disappeared.

Signed-off-by: Yasuaki Ishimatsu
Reviewed-by: Lai Jiangshan
Signed-off-by: Tejun Heo
Fixes: bce903809ab3 ("workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]")
Signed-off-by: Greg Kroah-Hartman

Yasuaki Ishimatsu
2014-07-18 06:58:00 +0800
3c33a9bdb cpuset,mempolicy: fix sleeping function called from invalid context ... Browse Code »

commit 391acf970d21219a2a5446282d3b20eace0c0d7a upstream.

When runing with the kernel(3.15-rc7+), the follow bug occurs:
[ 9969.258987] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
[ 9969.359906] in_atomic(): 1, irqs_disabled(): 0, pid: 160655, name: python
[ 9969.441175] INFO: lockdep is turned off.
[ 9969.488184] CPU: 26 PID: 160655 Comm: python Tainted: G A 3.15.0-rc7+ #85
[ 9969.581032] Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012
[ 9969.706052] ffffffff81a20e60 ffff8803e941fbd0 ffffffff8162f523 ffff8803e941fd18
[ 9969.795323] ffff8803e941fbe0 ffffffff8109995a ffff8803e941fc58 ffffffff81633e6c
[ 9969.884710] ffffffff811ba5dc ffff880405c6b480 ffff88041fdd90a0 0000000000002000
[ 9969.974071] Call Trace:
[ 9970.003403] [] dump_stack+0x4d/0x66
[ 9970.065074] [] __might_sleep+0xfa/0x130
[ 9970.130743] [] mutex_lock_nested+0x3c/0x4f0
[ 9970.200638] [] ? kmem_cache_alloc+0x1bc/0x210
[ 9970.272610] [] cpuset_mems_allowed+0x27/0x140
[ 9970.344584] [] ? __mpol_dup+0x63/0x150
[ 9970.409282] [] __mpol_dup+0xe5/0x150
[ 9970.471897] [] ? __mpol_dup+0x63/0x150
[ 9970.536585] [] ? copy_process.part.23+0x606/0x1d40
[ 9970.613763] [] ? trace_hardirqs_on+0xd/0x10
[ 9970.683660] [] ? monotonic_to_bootbased+0x2f/0x50
[ 9970.759795] [] copy_process.part.23+0x670/0x1d40
[ 9970.834885] [] do_fork+0xd8/0x380
[ 9970.894375] [] ? __audit_syscall_entry+0x9c/0xf0
[ 9970.969470] [] SyS_clone+0x16/0x20
[ 9971.030011] [] stub_clone+0x69/0x90
[ 9971.091573] [] ? system_call_fastpath+0x16/0x1b

The cause is that cpuset_mems_allowed() try to take
mutex_lock(&callback_mutex) under the rcu_read_lock(which was hold in
__mpol_dup()). And in cpuset_mems_allowed(), the access to cpuset is
under rcu_read_lock, so in __mpol_dup, we can reduce the rcu_read_lock
protection region to protect the access to cpuset only in
current_cpuset_is_being_rebound(). So that we can avoid this bug.

This patch is a temporary solution that just addresses the bug
mentioned above, can not fix the long-standing issue about cpuset.mems
rebinding on fork():

"When the forker's task_struct is duplicated (which includes
->mems_allowed) and it races with an update to cpuset_being_rebound
in update_tasks_nodemask() then the task's mems_allowed doesn't get
updated. And the child task's mems_allowed can be wrong if the
cpuset's nodemask changes before the child has been added to the
cgroup's tasklist."

Signed-off-by: Gu Zheng
Acked-by: Li Zefan
Signed-off-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Gu Zheng
2014-07-18 06:58:00 +0800
7c36f88c2 workqueue: fix dev_set_uevent_suppress() imbalance ... Browse Code »

commit bddbceb688c6d0decaabc7884fede319d02f96c8 upstream.

Uevents are suppressed during attributes registration, but never
restored, so kobject_uevent() does nothing.

Signed-off-by: Maxime Bizon
Signed-off-by: Tejun Heo
Fixes: 226223ab3c4118ddd10688cc2c131135848371ab
Signed-off-by: Greg Kroah-Hartman

Maxime Bizon
2014-07-18 06:57:59 +0800

10 Jul, 2014

1 commit

9d31798d8 tracing: Remove ftrace_stop/start() from reading the trace file ... Browse Code »

commit 099ed151675cd1d2dbeae1dac697975f6a68716d upstream.

Disabling reading and writing to the trace file should not be able to
disable all function tracing callbacks. There's other users today
(like kprobes and perf). Reading a trace file should not stop those
from happening.

Reviewed-by: Masami Hiramatsu
Signed-off-by: Steven Rostedt
Signed-off-by: Greg Kroah-Hartman

Steven Rostedt (Red Hat)
2014-07-10 02:14:02 +0800

07 Jul, 2014

2 commits

e6bc60b8f tracing: Fix syscall_*regfunc() vs copy_process() race ... Browse Code »

commit 4af4206be2bd1933cae20c2b6fb2058dbc887f7c upstream.

syscall_regfunc() and syscall_unregfunc() should set/clear
TIF_SYSCALL_TRACEPOINT system-wide, but do_each_thread() can race
with copy_process() and miss the new child which was not added to
the process/thread lists yet.

Change copy_process() to update the child's TIF_SYSCALL_TRACEPOINT
under tasklist.

Link: http://lkml.kernel.org/p/20140413185854.GB20668@redhat.com

Fixes: a871bd33a6c0 "tracing: Add syscall tracepoints"
Acked-by: Frederic Weisbecker
Acked-by: Paul E. McKenney
Signed-off-by: Oleg Nesterov
Signed-off-by: Steven Rostedt
Signed-off-by: Greg Kroah-Hartman

Oleg Nesterov
2014-07-07 09:54:16 +0800
c5bce7364 tracing: Try again for saved cmdline if failed due to locking ... Browse Code »

commit 379cfdac37923653c9d4242d10052378b7563005 upstream.

In order to prevent the saved cmdline cache from being filled when
tracing is not active, the comms are only recorded after a trace event
is recorded.

The problem is, a comm can fail to be recorded if the trace_cmdline_lock
is held. That lock is taken via a trylock to allow it to happen from
any context (including NMI). If the lock fails to be taken, the comm
is skipped. No big deal, as we will try again later.

But! Because of the code that was added to only record after an event,
we may not try again later as the recording is made as a oneshot per
event per CPU.

Only disable the recording of the comm if the comm is actually recorded.

Fixes: 7ffbd48d5cab "tracing: Cache comms only after an event occurred"
Signed-off-by: Steven Rostedt
Signed-off-by: Greg Kroah-Hartman

Steven Rostedt (Red Hat)
2014-07-07 09:54:15 +0800

01 Jul, 2014

2 commits

72aeabd74 genirq: Sanitize spurious interrupt detection of threaded irqs ... Browse Code »

commit 1e77d0a1ed7417d2a5a52a7b8d32aea1833faa6c upstream.

Till reported that the spurious interrupt detection of threaded
interrupts is broken in two ways:

- note_interrupt() is called for each action thread of a shared
interrupt line. That's wrong as we are only interested whether none
of the device drivers felt responsible for the interrupt, but by
calling multiple times for a single interrupt line we account
IRQ_NONE even if one of the drivers felt responsible.

- note_interrupt() when called from the thread handler is not
serialized. That leaves the members of irq_desc which are used for
the spurious detection unprotected.

To solve this we need to defer the spurious detection of a threaded
interrupt to the next hardware interrupt context where we have
implicit serialization.

If note_interrupt is called with action_ret == IRQ_WAKE_THREAD, we
check whether the previous interrupt requested a deferred check. If
not, we request a deferred check for the next hardware interrupt and
return.

If set, we check whether one of the interrupt threads signaled
success. Depending on this information we feed the result into the
spurious detector.

If one primary handler of a shared interrupt returns IRQ_HANDLED we
disable the deferred check of irq threads on the same line, as we have
found at least one device driver who cared.

Reported-by: Till Straumann
Signed-off-by: Thomas Gleixner
Tested-by: Austin Schuh
Cc: Oliver Hartkopp
Cc: Wolfgang Grandegger
Cc: Pavel Pisa
Cc: Marc Kleine-Budde
Cc: linux-can@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1303071450130.22263@ionos
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2014-07-01 11:09:45 +0800
1a2d97324 ptrace: fix fork event messages across pid namespaces ... Browse Code »

commit 4e52365f279564cef0ddd41db5237f0471381093 upstream.

When tracing a process in another pid namespace, it's important for fork
event messages to contain the child's pid as seen from the tracer's pid
namespace, not the parent's. Otherwise, the tracer won't be able to
correlate the fork event with later SIGTRAP signals it receives from the
child.

We still risk a race condition if a ptracer from a different pid
namespace attaches after we compute the pid_t value. However, sending a
bogus fork event message in this unlikely scenario is still a vast
improvement over the status quo where we always send bogus fork event
messages to debuggers in a different pid namespace than the forking
process.

Signed-off-by: Matthew Dempsky
Acked-by: Oleg Nesterov
Cc: Kees Cook
Cc: Julien Tinnes
Cc: Roland McGrath
Cc: Jan Kratochvil
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Matthew Dempsky
2014-07-01 11:09:42 +0800

27 Jun, 2014

2 commits

ec804bd9e nohz: Fix another inconsistency between CONFIG_NO_HZ=n and nohz=off ... Browse Code »

commit 0e576acbc1d9600cf2d9b4a141a2554639959d50 upstream.

If CONFIG_NO_HZ=n tick_nohz_get_sleep_length() returns NSEC_PER_SEC/HZ.

If CONFIG_NO_HZ=y and the nohz functionality is disabled via the
command line option "nohz=off" or not enabled due to missing hardware
support, then tick_nohz_get_sleep_length() returns 0. That happens
because ts->sleep_length is never set in that case.

Set it to NSEC_PER_SEC/HZ when the NOHZ mode is inactive.

Reported-by: Michal Hocko
Reported-by: Borislav Petkov
Signed-off-by: Thomas Gleixner
Cc: Rui Xiang
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2014-06-27 03:12:41 +0800
1141a4558 net: Use netlink_ns_capable to verify the permisions of netlink messages ... Browse Code »

[ Upstream commit 90f62cf30a78721641e08737bda787552428061e ]

It is possible by passing a netlink socket to a more privileged
executable and then to fool that executable into writing to the socket
data that happens to be valid netlink message to do something that
privileged executable did not intend to do.

To keep this from happening replace bare capable and ns_capable calls
with netlink_capable, netlink_net_calls and netlink_ns_capable calls.
Which act the same as the previous calls except they verify that the
opener of the socket had the desired permissions as well.

Reported-by: Andy Lutomirski
Signed-off-by: "Eric W. Biederman"
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2014-06-27 03:12:37 +0800

17 Jun, 2014

2 commits

553a87251 auditsc: audit_krule mask accesses need bounds checking ... Browse Code »

commit a3c54931199565930d6d84f4c3456f6440aefd41 upstream.

Fixes an easy DoS and possible information disclosure.

This does nothing about the broken state of x32 auditing.

eparis: If the admin has enabled auditd and has specifically loaded
audit rules. This bug has been around since before git. Wow...

Signed-off-by: Andy Lutomirski
Signed-off-by: Eric Paris
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Andy Lutomirski
2014-06-17 04:42:53 +0800
4f80c6c18 fs,userns: Change inode_capable to capable_wrt_inode_uidgid ... Browse Code »

commit 23adbe12ef7d3d4195e80800ab36b37bee28cd03 upstream.

The kernel has no concept of capabilities with respect to inodes; inodes
exist independently of namespaces. For example, inode_capable(inode,
CAP_LINUX_IMMUTABLE) would be nonsense.

This patch changes inode_capable to check for uid and gid mappings and
renames it to capable_wrt_inode_uidgid, which should make it more
obvious what it does.

Fixes CVE-2014-4014.

Cc: Theodore Ts'o
Cc: Serge Hallyn
Cc: "Eric W. Biederman"
Cc: Dave Chinner
Signed-off-by: Andy Lutomirski
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Andy Lutomirski
2014-06-17 04:42:52 +0800

12 Jun, 2014

1 commit

02f98e3e3 perf: Enforce 1 as lower limit for perf_event_max_sample_rate ... Browse Code »

commit 723478c8a471403c53cf144999701f6e0c4bbd11 upstream.

/proc/sys/kernel/perf_event_max_sample_rate will accept
negative values as well as 0.

Negative values are unreasonable, and 0 causes a
divide by zero exception in perf_proc_update_handler.

This patch enforces a lower limit of 1.

Signed-off-by: Knut Petersen
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/5242DB0C.4070005@t-online.de
Signed-off-by: Ingo Molnar
Cc: Weng Meiling
Signed-off-by: Greg Kroah-Hartman

Knut Petersen
2014-06-12 03:03:27 +0800