28 Aug, 2014
12 commits
-
commit e6d30ab1e7d1281784672c0fc2ffa385cfb7279e upstream.
All the callers of irq_create_of_mapping() pass the contents of a struct
of_phandle_args structure to the function. Since all the callers already
have an of_phandle_args pointer, why not pass it directly to
irq_create_of_mapping()?Signed-off-by: Grant Likely
Acked-by: Michal Simek
Acked-by: Tony Lindgren
Cc: Thomas Gleixner
Cc: Russell King
Cc: Ralf Baechle
Cc: Benjamin Herrenschmidt
Signed-off-by: Shawn GuoConflicts:
arch/arm/mach-integrator/pci_v3.c
arch/mips/pci/pci-rt3883.c
kernel/irq/irqdomain.c -
Commits 11d4616bd07f ("futex: revert back to the explicit waiter
counting code") and 69cd9eba3886 ("futex: avoid race between requeue and
wake") changed some of the finer details of how we think about futexes.
One was a late fix and the other a consequence of overlooking the whole
requeuing logic.The first change caused our documentation to be incorrect, and the
second made us aware that we need to explicitly add more details to it.Signed-off-by: Davidlohr Bueso
Signed-off-by: Linus Torvalds -
Jan Stancek reported:
"pthread_cond_broadcast/4-1.c testcase from openposix testsuite (LTP)
occasionally fails, because some threads fail to wake up.Testcase creates 5 threads, which are all waiting on same condition.
Main thread then calls pthread_cond_broadcast() without holding mutex,
which calls:futex(uaddr1, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, uaddr2, ..)
This immediately wakes up single thread A, which unlocks mutex and
tries to wake up another thread:futex(uaddr2, FUTEX_WAKE_PRIVATE, 1)
If thread A manages to call futex_wake() before any waiters are
requeued for uaddr2, no other thread is woken up"The ordering constraints for the hash bucket waiter counting are that
the waiter counts have to be incremented _before_ getting the spinlock
(because the spinlock acts as part of the memory barrier), but the
"requeue" operation didn't honor those rules, and nobody had even
thought about that case.This fairly simple patch just increments the waiter count for the target
hash bucket (hb2) when requeing a futex before taking the locks. It
then decrements them again after releasing the lock - the code that
actually moves the futex(es) between hash buckets will do the additional
required waiter count housekeeping.Reported-and-tested-by: Jan Stancek
Acked-by: Davidlohr Bueso
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: stable@vger.kernel.org # 3.14
Signed-off-by: Linus Torvalds -
Srikar Dronamraju reports that commit b0c29f79ecea ("futexes: Avoid
taking the hb->lock if there's nothing to wake up") causes java threads
getting stuck on futexes when runing specjbb on a power7 numa box.The cause appears to be that the powerpc spinlocks aren't using the same
ticket lock model that we use on x86 (and other) architectures, which in
turn result in the "spin_is_locked()" test in hb_waiters_pending()
occasionally reporting an unlocked spinlock even when there are pending
waiters.So this reinstates Davidlohr Bueso's original explicit waiter counting
code, which I had convinced Davidlohr to drop in favor of figuring out
the pending waiters by just using the existing state of the spinlock and
the wait queue.Reported-and-tested-by: Srikar Dronamraju
Original-code-by: Davidlohr Bueso
Signed-off-by: Linus Torvalds -
"futexes: Increase hash table size for better performance"
introduces a new alloc_large_system_hash() call.alloc_large_system_hash() however may allocate less memory than
requested, e.g. limited by MAX_ORDER.Hence pass a pointer to alloc_large_system_hash() which will
contain the hash shift when the function returns. Afterwards
correctly set futex_hashsize.Fixes a crash on s390 where the requested allocation size was
4MB but only 1MB was allocated.Signed-off-by: Heiko Carstens
Cc: Darren Hart
Cc: Peter Zijlstra
Cc: Paul E. McKenney
Cc: Waiman Long
Cc: Jason Low
Cc: Davidlohr Bueso
Link: http://lkml.kernel.org/r/20140116135450.GA4345@osiris
Signed-off-by: Ingo Molnar -
In futex_wake() there is clearly no point in taking the hb->lock
if we know beforehand that there are no tasks to be woken. While
the hash bucket's plist head is a cheap way of knowing this, we
cannot rely 100% on it as there is a racy window between the
futex_wait call and when the task is actually added to the
plist. To this end, we couple it with the spinlock check as
tasks trying to enter the critical region are most likely
potential waiters that will be added to the plist, thus
preventing tasks sleeping forever if wakers don't acknowledge
all possible waiters.Furthermore, the futex ordering guarantees are preserved,
ensuring that waiters either observe the changed user space
value before blocking or is woken by a concurrent waker. For
wakers, this is done by relying on the barriers in
get_futex_key_refs() -- for archs that do not have implicit mb
in atomic_inc(), we explicitly add them through a new
futex_get_mm function. For waiters we rely on the fact that
spin_lock calls already update the head counter, so spinners
are visible even if the lock hasn't been acquired yet.For more details please refer to the updated comments in the
code and related discussion:https://lkml.org/lkml/2013/11/26/556
Special thanks to tglx for careful review and feedback.
Suggested-by: Linus Torvalds
Reviewed-by: Darren Hart
Reviewed-by: Thomas Gleixner
Reviewed-by: Peter Zijlstra
Signed-off-by: Davidlohr Bueso
Cc: Paul E. McKenney
Cc: Mike Galbraith
Cc: Jeff Mahoney
Cc: Scott Norton
Cc: Tom Vaden
Cc: Aswin Chandramouleeswaran
Cc: Waiman Long
Cc: Jason Low
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/1389569486-25487-5-git-send-email-davidlohr@hp.com
Signed-off-by: Ingo Molnar -
That's essential, if you want to hack on futexes.
Reviewed-by: Darren Hart
Reviewed-by: Peter Zijlstra
Reviewed-by: Paul E. McKenney
Signed-off-by: Thomas Gleixner
Signed-off-by: Davidlohr Bueso
Cc: Mike Galbraith
Cc: Jeff Mahoney
Cc: Linus Torvalds
Cc: Randy Dunlap
Cc: Scott Norton
Cc: Tom Vaden
Cc: Aswin Chandramouleeswaran
Cc: Waiman Long
Cc: Jason Low
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/1389569486-25487-4-git-send-email-davidlohr@hp.com
Signed-off-by: Ingo Molnar -
Currently, the futex global hash table suffers from its fixed,
smallish (for today's standards) size of 256 entries, as well as
its lack of NUMA awareness. Large systems, using many futexes,
can be prone to high amounts of collisions; where these futexes
hash to the same bucket and lead to extra contention on the same
hb->lock. Furthermore, cacheline bouncing is a reality when we
have multiple hb->locks residing on the same cacheline and
different futexes hash to adjacent buckets.This patch keeps the current static size of 16 entries for small
systems, or otherwise, 256 * ncpus (or larger as we need to
round the number to a power of 2). Note that this number of CPUs
accounts for all CPUs that can ever be available in the system,
taking into consideration things like hotpluging. While we do
impose extra overhead at bootup by making the hash table larger,
this is a one time thing, and does not shadow the benefits of
this patch.Furthermore, as suggested by tglx, by cache aligning the hash
buckets we can avoid access across cacheline boundaries and also
avoid massive cache line bouncing if multiple cpus are hammering
away at different hash buckets which happen to reside in the
same cache line.Also, similar to other core kernel components (pid, dcache,
tcp), by using alloc_large_system_hash() we benefit from its
NUMA awareness and thus the table is distributed among the nodes
instead of in a single one.For a custom microbenchmark that pounds on the uaddr hashing --
making the wait path fail at futex_wait_setup() returning
-EWOULDBLOCK for large amounts of futexes, we can see the
following benefits on a 80-core, 8-socket 1Tb server:+---------+--------------------+------------------------+-----------------------+-------------------------------+
| threads | baseline (ops/sec) | aligned-only (ops/sec) | large table (ops/sec) | large table+aligned (ops/sec) |
+---------+--------------------+------------------------+-----------------------+-------------------------------+
| 512 | 32426 | 50531 (+55.8%) | 255274 (+687.2%) | 292553 (+802.2%) |
| 256 | 65360 | 99588 (+52.3%) | 443563 (+578.6%) | 508088 (+677.3%) |
| 128 | 125635 | 200075 (+59.2%) | 742613 (+491.1%) | 835452 (+564.9%) |
| 80 | 193559 | 323425 (+67.1%) | 1028147 (+431.1%) | 1130304 (+483.9%) |
| 64 | 247667 | 443740 (+79.1%) | 997300 (+302.6%) | 1145494 (+362.5%) |
| 32 | 628412 | 721401 (+14.7%) | 965996 (+53.7%) | 1122115 (+78.5%) |
+---------+--------------------+------------------------+-----------------------+-------------------------------+Reviewed-by: Darren Hart
Reviewed-by: Peter Zijlstra
Reviewed-by: Paul E. McKenney
Reviewed-by: Waiman Long
Reviewed-and-tested-by: Jason Low
Reviewed-by: Thomas Gleixner
Signed-off-by: Davidlohr Bueso
Cc: Mike Galbraith
Cc: Jeff Mahoney
Cc: Linus Torvalds
Cc: Scott Norton
Cc: Tom Vaden
Cc: Aswin Chandramouleeswaran
Link: http://lkml.kernel.org/r/1389569486-25487-3-git-send-email-davidlohr@hp.com
Signed-off-by: Ingo Molnar -
Avoid waking up every thread sleeping in a futex_wait call during
suspend and resume by calling a freezable blocking call. Previous
patches modified the freezer to avoid sending wakeups to threads
that are blocked in freezable blocking calls.This call was selected to be converted to a freezable call because
it doesn't hold any locks or release any resources when interrupted
that might be needed by another freezing task or a kernel driver
during suspend, and is a common site where idle userspace tasks are
blocked.Signed-off-by: Colin Cross
Cc: Rafael J. Wysocki
Cc: arve@android.com
Cc: Tejun Heo
Cc: Oleg Nesterov
Cc: Darren Hart
Cc: Randy Dunlap
Cc: Al Viro
Link: http://lkml.kernel.org/r/1367458508-9133-8-git-send-email-ccross@android.com
Signed-off-by: Thomas Gleixner -
- Remove unnecessary head variables.
- Delete unused parameter in queue_unlock().Reviewed-by: Darren Hart
Reviewed-by: Peter Zijlstra
Reviewed-by: Paul E. McKenney
Reviewed-by: Thomas Gleixner
Signed-off-by: Jason Low
Signed-off-by: Davidlohr Bueso
Cc: Mike Galbraith
Cc: Jeff Mahoney
Cc: Linus Torvalds
Cc: Scott Norton
Cc: Tom Vaden
Cc: Aswin Chandramouleeswaran
Cc: Waiman Long
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/1389569486-25487-2-git-send-email-davidlohr@hp.com
Signed-off-by: Ingo Molnar -
When debugging the read-only hugepage case, I was confused by the fact
that get_futex_key() did an access_ok() only for the non-shared futex
case, since the user address checking really isn't in any way specific
to the private key handling.Now, it turns out that the shared key handling does effectively do the
equivalent checks inside get_user_pages_fast() (it doesn't actually
check the address range on x86, but does check the page protections for
being a user page). So it wasn't actually a bug, but the fact that we
treat the address differently for private and shared futexes threw me
for a loop.Just move the check up, so that it gets done for both cases. Also, use
the 'rw' parameter for the type, even if it doesn't actually matter any
more (it's a historical artifact of the old racy i386 "page faults from
kernel space don't check write protections").Cc: Thomas Gleixner
Signed-off-by: Linus Torvalds -
Some modules may need to know cpu idle status and take
actions before and after cpu idle, so we can add
notification callback when enter/exit cpu idle, then
modules only need to register this notification callback,
everytime cpu enter/exit idle, the callback chain will be
executed.Currently only cpufreq interactive governor use this
notification, as it wants to save power, the timers of
interactive governor are only enabled when cpu is not in idle.Signed-off-by: Anson Huang
08 Aug, 2014
2 commits
-
commit 504d58745c9ca28d33572e2d8a9990b43e06075d upstream.
clockevents_increase_min_delta() calls printk() from under
hrtimer_bases.lock. That causes lock inversion on scheduler locks because
printk() can call into the scheduler. Lockdep puts it as:======================================================
[ INFO: possible circular locking dependency detected ]
3.15.0-rc8-06195-g939f04b #2 Not tainted
-------------------------------------------------------
trinity-main/74 is trying to acquire lock:
(&port_lock_key){-.....}, at: [] serial8250_console_write+0x8c/0x10cbut task is already holding lock:
(hrtimer_bases.lock){-.-...}, at: [] hrtimer_try_to_cancel+0x13/0x66which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #5 (hrtimer_bases.lock){-.-...}:
[] lock_acquire+0x92/0x101
[] _raw_spin_lock_irqsave+0x2e/0x3e
[] __hrtimer_start_range_ns+0x1c/0x197
[] perf_swevent_start_hrtimer.part.41+0x7a/0x85
[] task_clock_event_start+0x3a/0x3f
[] task_clock_event_add+0xd/0x14
[] event_sched_in+0xb6/0x17a
[] group_sched_in+0x44/0x122
[] ctx_sched_in.isra.67+0x105/0x11f
[] perf_event_sched_in.isra.70+0x47/0x4b
[] __perf_install_in_context+0x8b/0xa3
[] remote_function+0x12/0x2a
[] smp_call_function_single+0x2d/0x53
[] task_function_call+0x30/0x36
[] perf_install_in_context+0x87/0xbb
[] SYSC_perf_event_open+0x5c6/0x701
[] SyS_perf_event_open+0x17/0x19
[] syscall_call+0x7/0xb-> #4 (&ctx->lock){......}:
[] lock_acquire+0x92/0x101
[] _raw_spin_lock+0x21/0x30
[] __perf_event_task_sched_out+0x1dc/0x34f
[] __schedule+0x4c6/0x4cb
[] schedule+0xf/0x11
[] work_resched+0x5/0x30-> #3 (&rq->lock){-.-.-.}:
[] lock_acquire+0x92/0x101
[] _raw_spin_lock+0x21/0x30
[] __task_rq_lock+0x33/0x3a
[] wake_up_new_task+0x25/0xc2
[] do_fork+0x15c/0x2a0
[] kernel_thread+0x1a/0x1f
[] rest_init+0x1a/0x10e
[] start_kernel+0x303/0x308
[] i386_start_kernel+0x79/0x7d-> #2 (&p->pi_lock){-.-...}:
[] lock_acquire+0x92/0x101
[] _raw_spin_lock_irqsave+0x2e/0x3e
[] try_to_wake_up+0x1d/0xd6
[] default_wake_function+0xb/0xd
[] __wake_up_common+0x39/0x59
[] __wake_up+0x29/0x3b
[] tty_wakeup+0x49/0x51
[] uart_write_wakeup+0x17/0x19
[] serial8250_tx_chars+0xbc/0xfb
[] serial8250_handle_irq+0x54/0x6a
[] serial8250_default_handle_irq+0x19/0x1c
[] serial8250_interrupt+0x38/0x9e
[] handle_irq_event_percpu+0x5f/0x1e2
[] handle_irq_event+0x2c/0x43
[] handle_level_irq+0x57/0x80
[] handle_irq+0x46/0x5c
[] do_IRQ+0x32/0x89
[] common_interrupt+0x2e/0x33
[] _raw_spin_unlock_irqrestore+0x3f/0x49
[] uart_start+0x2d/0x32
[] uart_write+0xc7/0xd6
[] n_tty_write+0xb8/0x35e
[] tty_write+0x163/0x1e4
[] redirected_tty_write+0x6d/0x75
[] vfs_write+0x75/0xb0
[] SyS_write+0x44/0x77
[] syscall_call+0x7/0xb-> #1 (&tty->write_wait){-.....}:
[] lock_acquire+0x92/0x101
[] _raw_spin_lock_irqsave+0x2e/0x3e
[] __wake_up+0x15/0x3b
[] tty_wakeup+0x49/0x51
[] uart_write_wakeup+0x17/0x19
[] serial8250_tx_chars+0xbc/0xfb
[] serial8250_handle_irq+0x54/0x6a
[] serial8250_default_handle_irq+0x19/0x1c
[] serial8250_interrupt+0x38/0x9e
[] handle_irq_event_percpu+0x5f/0x1e2
[] handle_irq_event+0x2c/0x43
[] handle_level_irq+0x57/0x80
[] handle_irq+0x46/0x5c
[] do_IRQ+0x32/0x89
[] common_interrupt+0x2e/0x33
[] _raw_spin_unlock_irqrestore+0x3f/0x49
[] uart_start+0x2d/0x32
[] uart_write+0xc7/0xd6
[] n_tty_write+0xb8/0x35e
[] tty_write+0x163/0x1e4
[] redirected_tty_write+0x6d/0x75
[] vfs_write+0x75/0xb0
[] SyS_write+0x44/0x77
[] syscall_call+0x7/0xb-> #0 (&port_lock_key){-.....}:
[] __lock_acquire+0x9ea/0xc6d
[] lock_acquire+0x92/0x101
[] _raw_spin_lock_irqsave+0x2e/0x3e
[] serial8250_console_write+0x8c/0x10c
[] call_console_drivers.constprop.31+0x87/0x118
[] console_unlock+0x1d7/0x398
[] vprintk_emit+0x3da/0x3e4
[] printk+0x17/0x19
[] clockevents_program_min_delta+0x104/0x116
[] clockevents_program_event+0xe7/0xf3
[] tick_program_event+0x1e/0x23
[] hrtimer_force_reprogram+0x88/0x8f
[] __remove_hrtimer+0x5b/0x79
[] hrtimer_try_to_cancel+0x49/0x66
[] hrtimer_cancel+0xd/0x18
[] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
[] task_clock_event_stop+0x20/0x64
[] task_clock_event_del+0xd/0xf
[] event_sched_out+0xab/0x11e
[] group_sched_out+0x1d/0x66
[] ctx_sched_out+0xaf/0xbf
[] __perf_event_task_sched_out+0x1ed/0x34f
[] __schedule+0x4c6/0x4cb
[] schedule+0xf/0x11
[] work_resched+0x5/0x30other info that might help us debug this:
Chain exists of:
&port_lock_key --> &ctx->lock --> hrtimer_bases.lockPossible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(hrtimer_bases.lock);
lock(&ctx->lock);
lock(hrtimer_bases.lock);
lock(&port_lock_key);*** DEADLOCK ***
4 locks held by trinity-main/74:
#0: (&rq->lock){-.-.-.}, at: [] __schedule+0xed/0x4cb
#1: (&ctx->lock){......}, at: [] __perf_event_task_sched_out+0x1dc/0x34f
#2: (hrtimer_bases.lock){-.-...}, at: [] hrtimer_try_to_cancel+0x13/0x66
#3: (console_lock){+.+...}, at: [] vprintk_emit+0x3c7/0x3e4stack backtrace:
CPU: 0 PID: 74 Comm: trinity-main Not tainted 3.15.0-rc8-06195-g939f04b #2
00000000 81c3a310 8b995c14 81426f69 8b995c44 81425a99 8161f671 8161f570
8161f538 8161f559 8161f538 8b995c78 8b142bb0 00000004 8b142fdc 8b142bb0
8b995ca8 8104a62d 8b142fac 000016f2 81c3a310 00000001 00000001 00000003
Call Trace:
[] dump_stack+0x16/0x18
[] print_circular_bug+0x18f/0x19c
[] __lock_acquire+0x9ea/0xc6d
[] lock_acquire+0x92/0x101
[] ? serial8250_console_write+0x8c/0x10c
[] ? wait_for_xmitr+0x76/0x76
[] _raw_spin_lock_irqsave+0x2e/0x3e
[] ? serial8250_console_write+0x8c/0x10c
[] serial8250_console_write+0x8c/0x10c
[] ? lock_release+0x191/0x223
[] ? wait_for_xmitr+0x76/0x76
[] call_console_drivers.constprop.31+0x87/0x118
[] console_unlock+0x1d7/0x398
[] vprintk_emit+0x3da/0x3e4
[] printk+0x17/0x19
[] clockevents_program_min_delta+0x104/0x116
[] tick_program_event+0x1e/0x23
[] hrtimer_force_reprogram+0x88/0x8f
[] __remove_hrtimer+0x5b/0x79
[] hrtimer_try_to_cancel+0x49/0x66
[] hrtimer_cancel+0xd/0x18
[] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
[] task_clock_event_stop+0x20/0x64
[] task_clock_event_del+0xd/0xf
[] event_sched_out+0xab/0x11e
[] group_sched_out+0x1d/0x66
[] ctx_sched_out+0xaf/0xbf
[] __perf_event_task_sched_out+0x1ed/0x34f
[] ? __dequeue_entity+0x23/0x27
[] ? pick_next_task_fair+0xb1/0x120
[] __schedule+0x4c6/0x4cb
[] ? trace_hardirqs_off_caller+0xd7/0x108
[] ? trace_hardirqs_off+0xb/0xd
[] ? rcu_irq_exit+0x64/0x77Fix the problem by using printk_deferred() which does not call into the
scheduler.Reported-by: Fengguang Wu
Signed-off-by: Jan Kara
Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman -
commit aac74dc495456412c4130a1167ce4beb6c1f0b38 upstream.
After learning we'll need some sort of deferred printk functionality in
the timekeeping core, Peter suggested we rename the printk_sched function
so it can be reused by needed subsystems.This only changes the function name. No logic changes.
Signed-off-by: John Stultz
Reviewed-by: Steven Rostedt
Cc: Jan Kara
Cc: Peter Zijlstra
Cc: Jiri Bohac
Cc: Thomas Gleixner
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
01 Aug, 2014
1 commit
-
commit 58d4e21e50ff3cc57910a8abc20d7e14375d2f61 upstream.
The "uptime" trace clock added in:
commit 8aacf017b065a805d27467843490c976835eb4a5
tracing: Add "uptime" trace clock that uses jiffieshas wraparound problems when the system has been up more
than 1 hour 11 minutes and 34 seconds. It converts jiffies
to nanoseconds using:
(u64)jiffies_to_usecs(jiffy) * 1000ULL
but since jiffies_to_usecs() only returns a 32-bit value, it
truncates at 2^32 microseconds. An additional problem on 32-bit
systems is that the argument is "unsigned long", so fixing the
return value only helps until 2^32 jiffies (49.7 days on a HZ=1000
system).Avoid these problems by using jiffies_64 as our basis, and
not converting to nanoseconds (we do convert to clock_t because
user facing API must not be dependent on internal kernel
HZ values).Link: http://lkml.kernel.org/p/99d63c5bfe9b320a3b428d773825a37095bf6a51.1405708254.git.tony.luck@intel.com
Fixes: 8aacf017b065 "tracing: Add "uptime" trace clock that uses jiffies"
Signed-off-by: Tony Luck
Signed-off-by: Steven Rostedt
Signed-off-by: Greg Kroah-Hartman
28 Jul, 2014
7 commits
-
commit b0ab99e7736af88b8ac1b7ae50ea287fffa2badc upstream.
proc_sched_show_task() does:
if (nr_switches)
do_div(avg_atom, nr_switches);nr_switches is unsigned long and do_div truncates it to 32 bits, which
means it can test non-zero on e.g. x86-64 and be truncated to zero for
division.Fix the problem by using div64_ul() instead.
As a side effect calculations of avg_atom for big nr_switches are now correct.
Signed-off-by: Mateusz Guzik
Signed-off-by: Peter Zijlstra
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1402750809-31991-1-git-send-email-mguzik@redhat.com
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman -
commit 4badad352a6bb202ec68afa7a574c0bb961e5ebc upstream.
The optimistic spin code assumes regular stores and cmpxchg() play nice;
this is found to not be true for at least: parisc, sparc32, tile32,
metag-lock1, arc-!llsc and hexagon.There is further wreckage, but this in particular seemed easy to
trigger, so blacklist this.Opt in for known good archs.
Signed-off-by: Peter Zijlstra
Reported-by: Mikulas Patocka
Cc: David Miller
Cc: Chris Metcalf
Cc: James Bottomley
Cc: Vineet Gupta
Cc: Jason Low
Cc: Waiman Long
Cc: "James E.J. Bottomley"
Cc: Paul McKenney
Cc: John David Anglin
Cc: James Hogan
Cc: Linus Torvalds
Cc: Davidlohr Bueso
Cc: Benjamin Herrenschmidt
Cc: Catalin Marinas
Cc: Russell King
Cc: Will Deacon
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sparclinux@vger.kernel.org
Link: http://lkml.kernel.org/r/20140606175316.GV13930@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman -
commit 4320f6b1d9db4ca912c5eb6ecb328b2e090e1586 upstream.
The commit [247bc037: PM / Sleep: Mitigate race between the freezer
and request_firmware()] introduced the finer state control, but it
also leads to a new bug; for example, a bug report regarding the
firmware loading of intel BT device at suspend/resume:
https://bugzilla.novell.com/show_bug.cgi?id=873790The root cause seems to be a small window between the process resume
and the clear of usermodehelper lock. The request_firmware() function
checks the UMH lock and gives up when it's in UMH_DISABLE state. This
is for avoiding the invalid f/w loading during suspend/resume phase.
The problem is, however, that usermodehelper_enable() is called at the
end of thaw_processes(). Thus, a thawed process in between can kick
off the f/w loader code path (in this case, via btusb_setup_intel())
even before the call of usermodehelper_enable(). Then
usermodehelper_read_trylock() returns an error and request_firmware()
spews WARN_ON() in the end.This oneliner patch fixes the issue just by setting to UMH_FREEZING
state again before restarting tasks, so that the call of
request_firmware() will be blocked until the end of this function
instead of returning an error.Fixes: 247bc0374254 (PM / Sleep: Mitigate race between the freezer and request_firmware())
Link: https://bugzilla.novell.com/show_bug.cgi?id=873790
Signed-off-by: Takashi Iwai
Signed-off-by: Rafael J. Wysocki
Signed-off-by: Greg Kroah-Hartman -
commit 16927776ae757d0d132bdbfabbfe2c498342bd59 upstream.
Sharvil noticed with the posix timer_settime interface, using the
CLOCK_REALTIME_ALARM or CLOCK_BOOTTIME_ALARM clockid, if the users
tried to specify a relative time timer, it would incorrectly be
treated as absolute regardless of the state of the flags argument.This patch corrects this, properly checking the absolute/relative flag,
as well as adds further error checking that no invalid flag bits are set.Reported-by: Sharvil Nanavati
Signed-off-by: John Stultz
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Prarit Bhargava
Cc: Sharvil Nanavati
Link: http://lkml.kernel.org/r/1404767171-6902-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman -
commit 97b8ee845393701edc06e27ccec2876ff9596019 upstream.
ring_buffer_poll_wait() should always put the poll_table to its wait_queue
even there is immediate data available. Otherwise, the following epoll and
read sequence will eventually hang forever:1. Put some data to make the trace_pipe ring_buffer read ready first
2. epoll_ctl(efd, EPOLL_CTL_ADD, trace_pipe_fd, ee)
3. epoll_wait()
4. read(trace_pipe_fd) till EAGAIN
5. Add some more data to the trace_pipe ring_buffer
6. epoll_wait() -> this epoll_wait() will block forever~ During the epoll_ctl(efd, EPOLL_CTL_ADD,...) call in step 2,
ring_buffer_poll_wait() returns immediately without adding poll_table,
which has poll_table->_qproc pointing to ep_poll_callback(), to its
wait_queue.
~ During the epoll_wait() call in step 3 and step 6,
ring_buffer_poll_wait() cannot add ep_poll_callback() to its wait_queue
because the poll_table->_qproc is NULL and it is how epoll works.
~ When there is new data available in step 6, ring_buffer does not know
it has to call ep_poll_callback() because it is not in its wait queue.
Hence, block forever.Other poll implementation seems to call poll_wait() unconditionally as the very
first thing to do. For example, tcp_poll() in tcp.c.Link: http://lkml.kernel.org/p/20140610060637.GA14045@devbig242.prn2.facebook.com
Fixes: 2a2cc8f7c4d0 "ftrace: allow the event pipe to be polled"
Reviewed-by: Chris Mason
Signed-off-by: Martin Lau
Signed-off-by: Steven Rostedt
Signed-off-by: Greg Kroah-Hartman -
commit 8abfb8727f4a724d31f9ccfd8013fbd16d539445 upstream.
Currently trace option stacktrace is not applicable for
trace_printk with constant string argument, the reason is
in __trace_puts/__trace_bputs ftrace_trace_stack is missing.In contrast, when using trace_printk with non constant string
argument(will call into __trace_printk/__trace_bprintk), then
trace option stacktrace is workable, this inconstant result
will confuses users a lot.Link: http://lkml.kernel.org/p/51E7A7C9.9040401@huawei.com
Signed-off-by: zhangwei(Jovi)
Signed-off-by: Steven Rostedt
Signed-off-by: Greg Kroah-Hartman -
commit 5f8bf2d263a20b986225ae1ed7d6759dc4b93af9 upstream.
Running my ftrace tests on PowerPC, it failed the test that checks
if function_graph tracer is affected by the stack tracer. It was.
Looking into this, I found that the update_function_graph_func()
must be called even if the trampoline function is not changed.
This is because archs like PowerPC do not support ftrace_ops being
passed by assembly and instead uses a helper function (what the
trampoline function points to). Since this function is not changed
even when multiple ftrace_ops are added to the code, the test that
falls out before calling update_function_graph_func() will miss that
the update must still be done.Call update_function_graph_function() for all calls to
update_ftrace_function()Signed-off-by: Steven Rostedt
Signed-off-by: Greg Kroah-Hartman
18 Jul, 2014
8 commits
-
commit 27e35715df54cbc4f2d044f681802ae30479e7fb upstream.
When the rtmutex fast path is enabled the slow unlock function can
create the following situation:spin_lock(foo->m->wait_lock);
foo->m->owner = NULL;
rt_mutex_lock(foo->m); refcnt);
rt_mutex_unlock(foo->m); m->wait_lock); owner */
clear_rt_mutex_waiters(m);
owner = rt_mutex_owner(m);
spin_unlock(m->wait_lock);
if (cmpxchg(m->owner, owner, 0) == owner)
return;
spin_lock(m->wait_lock);
}So in case of a new waiter incoming while the owner tries the slow
path unlock we have two situations:unlock(wait_lock);
lock(wait_lock);
cmpxchg(p, owner, 0) == owner
mark_rt_mutex_waiters(lock);
acquire(lock);Or:
unlock(wait_lock);
lock(wait_lock);
mark_rt_mutex_waiters(lock);
cmpxchg(p, owner, 0) != owner
enqueue_waiter();
unlock(wait_lock);
lock(wait_lock);
wakeup_next waiter();
unlock(wait_lock);
lock(wait_lock);
acquire(lock);If the fast path is disabled, then the simple
m->owner = NULL;
unlock(m->wait_lock);is sufficient as all access to m->owner is serialized via
m->wait_lock;Also document and clarify the wakeup_next_waiter function as suggested
by Oleg Nesterov.Reported-by: Steven Rostedt
Signed-off-by: Thomas Gleixner
Reviewed-by: Steven Rostedt
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/20140611183852.937945560@linutronix.de
Signed-off-by: Thomas Gleixner
Signed-off-by: Mike Galbraith
Signed-off-by: Greg Kroah-Hartman -
commit 3d5c9340d1949733eb37616abd15db36aef9a57c upstream.
Even in the case when deadlock detection is not requested by the
caller, we can detect deadlocks. Right now the code stops the lock
chain walk and keeps the waiter enqueued, even on itself. Silly not to
yell when such a scenario is detected and to keep the waiter enqueued.Return -EDEADLK unconditionally and handle it at the call sites.
The futex calls return -EDEADLK. The non futex ones dequeue the
waiter, throw a warning and put the task into a schedule loop.Tagged for stable as it makes the code more robust.
Signed-off-by: Thomas Gleixner
Cc: Steven Rostedt
Cc: Peter Zijlstra
Cc: Brad Mouring
Link: http://lkml.kernel.org/r/20140605152801.836501969@linutronix.de
Signed-off-by: Thomas Gleixner
Signed-off-by: Mike Galbraith
Signed-off-by: Greg Kroah-Hartman -
commit 82084984383babe728e6e3c9a8e5c46278091315 upstream.
When we walk the lock chain, we drop all locks after each step. So the
lock chain can change under us before we reacquire the locks. That's
harmless in principle as we just follow the wrong lock path. But it
can lead to a false positive in the dead lock detection logic:T0 holds L0
T0 blocks on L1 held by T1
T1 blocks on L2 held by T2
T2 blocks on L3 held by T3
T4 blocks on L4 held by T4Now we walk the chain
lock T1 -> lock L2 -> adjust L2 -> unlock T1 ->
lock T2 -> adjust T2 -> drop locksT2 times out and blocks on L0
Now we continue:
lock T2 -> lock L0 -> deadlock detected, but it's not a deadlock at all.
Brad tried to work around that in the deadlock detection logic itself,
but the more I looked at it the less I liked it, because it's crystal
ball magic after the fact.We actually can detect a chain change very simple:
lock T1 -> lock L2 -> adjust L2 -> unlock T1 -> lock T2 -> adjust T2 ->
next_lock = T2->pi_blocked_on->lock;
drop locks
T2 times out and blocks on L0
Now we continue:
lock T2 ->
if (next_lock != T2->pi_blocked_on->lock)
return;So if we detect that T2 is now blocked on a different lock we stop the
chain walk. That's also correct in the following scenario:lock T1 -> lock L2 -> adjust L2 -> unlock T1 -> lock T2 -> adjust T2 ->
next_lock = T2->pi_blocked_on->lock;
drop locks
T3 times out and drops L3
T2 acquires L3 and blocks on L4 nowNow we continue:
lock T2 ->
if (next_lock != T2->pi_blocked_on->lock)
return;We don't have to follow up the chain at that point, because T2
propagated our priority up to T4 already.[ Folded a cleanup patch from peterz ]
Signed-off-by: Thomas Gleixner
Reported-by: Brad Mouring
Cc: Steven Rostedt
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/20140605152801.930031935@linutronix.de
Signed-off-by: Mike Galbraith
Signed-off-by: Greg Kroah-Hartman -
commit 397335f004f41e5fcf7a795e94eb3ab83411a17c upstream.
The current deadlock detection logic does not work reliably due to the
following early exit path:/*
* Drop out, when the task has no waiters. Note,
* top_waiter can be NULL, when we are in the deboosting
* mode!
*/
if (top_waiter && (!task_has_pi_waiters(task) ||
top_waiter != task_top_pi_waiter(task)))
goto out_unlock_pi;So this not only exits when the task has no waiters, it also exits
unconditionally when the current waiter is not the top priority waiter
of the task.So in a nested locking scenario, it might abort the lock chain walk
and therefor miss a potential deadlock.Simple fix: Continue the chain walk, when deadlock detection is
enabled.We also avoid the whole enqueue, if we detect the deadlock right away
(A-A). It's an optimization, but also prevents that another waiter who
comes in after the detection and before the task has undone the damage
observes the situation and detects the deadlock and returns
-EDEADLOCK, which is wrong as the other task is not in a deadlock
situation.Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Reviewed-by: Steven Rostedt
Cc: Lai Jiangshan
Link: http://lkml.kernel.org/r/20140522031949.725272460@linutronix.de
Signed-off-by: Thomas Gleixner
Signed-off-by: Mike Galbraith
Signed-off-by: Greg Kroah-Hartman -
commit 8b8b36834d0fff67fc8668093f4312dd04dcf21d upstream.
The per_cpu buffers are created one per possible CPU. But these do
not mean that those CPUs are online, nor do they even exist.With the addition of the ring buffer polling, it assumes that the
caller polls on an existing buffer. But this is not the case if
the user reads trace_pipe from a CPU that does not exist, and this
causes the kernel to crash.Simple fix is to check the cpu against buffer bitmask against to see
if the buffer was allocated or not and return -ENODEV if it is
not.More updates were done to pass the -ENODEV back up to userspace.
Link: http://lkml.kernel.org/r/5393DB61.6060707@oracle.com
Reported-by: Sasha Levin
Signed-off-by: Steven Rostedt
Signed-off-by: Greg Kroah-Hartman -
commit 5a6024f1604eef119cf3a6fa413fe0261a81a8f3 upstream.
When hot-adding and onlining CPU, kernel panic occurs, showing following
call trace.BUG: unable to handle kernel paging request at 0000000000001d08
IP: [] __alloc_pages_nodemask+0x9d/0xb10
PGD 0
Oops: 0000 [#1] SMP
...
Call Trace:
[] ? cpumask_next_and+0x35/0x50
[] ? find_busiest_group+0x113/0x8f0
[] ? deactivate_slab+0x349/0x3c0
[] new_slab+0x91/0x300
[] __slab_alloc+0x2bb/0x482
[] ? copy_process.part.25+0xfc/0x14c0
[] ? load_balance+0x218/0x890
[] ? sched_clock+0x9/0x10
[] ? trace_clock_local+0x9/0x10
[] kmem_cache_alloc_node+0x8c/0x200
[] copy_process.part.25+0xfc/0x14c0
[] ? trace_buffer_unlock_commit+0x4d/0x60
[] ? kthread_create_on_node+0x140/0x140
[] do_fork+0xbc/0x360
[] kernel_thread+0x26/0x30
[] kthreadd+0x2c2/0x300
[] ? kthread_create_on_cpu+0x60/0x60
[] ret_from_fork+0x7c/0xb0
[] ? kthread_create_on_cpu+0x60/0x60In my investigation, I found the root cause is wq_numa_possible_cpumask.
All entries of wq_numa_possible_cpumask is allocated by
alloc_cpumask_var_node(). And these entries are used without initializing.
So these entries have wrong value.When hot-adding and onlining CPU, wq_update_unbound_numa() is called.
wq_update_unbound_numa() calls alloc_unbound_pwq(). And alloc_unbound_pwq()
calls get_unbound_pool(). In get_unbound_pool(), worker_pool->node is set
as follow:3592 /* if cpumask is contained inside a NUMA node, we belong to that node */
3593 if (wq_numa_enabled) {
3594 for_each_node(node) {
3595 if (cpumask_subset(pool->attrs->cpumask,
3596 wq_numa_possible_cpumask[node])) {
3597 pool->node = node;
3598 break;
3599 }
3600 }
3601 }But wq_numa_possible_cpumask[node] does not have correct cpumask. So, wrong
node is selected. As a result, kernel panic occurs.By this patch, all entries of wq_numa_possible_cpumask are allocated by
zalloc_cpumask_var_node to initialize them. And the panic disappeared.Signed-off-by: Yasuaki Ishimatsu
Reviewed-by: Lai Jiangshan
Signed-off-by: Tejun Heo
Fixes: bce903809ab3 ("workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]")
Signed-off-by: Greg Kroah-Hartman -
commit 391acf970d21219a2a5446282d3b20eace0c0d7a upstream.
When runing with the kernel(3.15-rc7+), the follow bug occurs:
[ 9969.258987] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
[ 9969.359906] in_atomic(): 1, irqs_disabled(): 0, pid: 160655, name: python
[ 9969.441175] INFO: lockdep is turned off.
[ 9969.488184] CPU: 26 PID: 160655 Comm: python Tainted: G A 3.15.0-rc7+ #85
[ 9969.581032] Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012
[ 9969.706052] ffffffff81a20e60 ffff8803e941fbd0 ffffffff8162f523 ffff8803e941fd18
[ 9969.795323] ffff8803e941fbe0 ffffffff8109995a ffff8803e941fc58 ffffffff81633e6c
[ 9969.884710] ffffffff811ba5dc ffff880405c6b480 ffff88041fdd90a0 0000000000002000
[ 9969.974071] Call Trace:
[ 9970.003403] [] dump_stack+0x4d/0x66
[ 9970.065074] [] __might_sleep+0xfa/0x130
[ 9970.130743] [] mutex_lock_nested+0x3c/0x4f0
[ 9970.200638] [] ? kmem_cache_alloc+0x1bc/0x210
[ 9970.272610] [] cpuset_mems_allowed+0x27/0x140
[ 9970.344584] [] ? __mpol_dup+0x63/0x150
[ 9970.409282] [] __mpol_dup+0xe5/0x150
[ 9970.471897] [] ? __mpol_dup+0x63/0x150
[ 9970.536585] [] ? copy_process.part.23+0x606/0x1d40
[ 9970.613763] [] ? trace_hardirqs_on+0xd/0x10
[ 9970.683660] [] ? monotonic_to_bootbased+0x2f/0x50
[ 9970.759795] [] copy_process.part.23+0x670/0x1d40
[ 9970.834885] [] do_fork+0xd8/0x380
[ 9970.894375] [] ? __audit_syscall_entry+0x9c/0xf0
[ 9970.969470] [] SyS_clone+0x16/0x20
[ 9971.030011] [] stub_clone+0x69/0x90
[ 9971.091573] [] ? system_call_fastpath+0x16/0x1bThe cause is that cpuset_mems_allowed() try to take
mutex_lock(&callback_mutex) under the rcu_read_lock(which was hold in
__mpol_dup()). And in cpuset_mems_allowed(), the access to cpuset is
under rcu_read_lock, so in __mpol_dup, we can reduce the rcu_read_lock
protection region to protect the access to cpuset only in
current_cpuset_is_being_rebound(). So that we can avoid this bug.This patch is a temporary solution that just addresses the bug
mentioned above, can not fix the long-standing issue about cpuset.mems
rebinding on fork():"When the forker's task_struct is duplicated (which includes
->mems_allowed) and it races with an update to cpuset_being_rebound
in update_tasks_nodemask() then the task's mems_allowed doesn't get
updated. And the child task's mems_allowed can be wrong if the
cpuset's nodemask changes before the child has been added to the
cgroup's tasklist."Signed-off-by: Gu Zheng
Acked-by: Li Zefan
Signed-off-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman -
commit bddbceb688c6d0decaabc7884fede319d02f96c8 upstream.
Uevents are suppressed during attributes registration, but never
restored, so kobject_uevent() does nothing.Signed-off-by: Maxime Bizon
Signed-off-by: Tejun Heo
Fixes: 226223ab3c4118ddd10688cc2c131135848371ab
Signed-off-by: Greg Kroah-Hartman
10 Jul, 2014
1 commit
-
commit 099ed151675cd1d2dbeae1dac697975f6a68716d upstream.
Disabling reading and writing to the trace file should not be able to
disable all function tracing callbacks. There's other users today
(like kprobes and perf). Reading a trace file should not stop those
from happening.Reviewed-by: Masami Hiramatsu
Signed-off-by: Steven Rostedt
Signed-off-by: Greg Kroah-Hartman
07 Jul, 2014
2 commits
-
commit 4af4206be2bd1933cae20c2b6fb2058dbc887f7c upstream.
syscall_regfunc() and syscall_unregfunc() should set/clear
TIF_SYSCALL_TRACEPOINT system-wide, but do_each_thread() can race
with copy_process() and miss the new child which was not added to
the process/thread lists yet.Change copy_process() to update the child's TIF_SYSCALL_TRACEPOINT
under tasklist.Link: http://lkml.kernel.org/p/20140413185854.GB20668@redhat.com
Fixes: a871bd33a6c0 "tracing: Add syscall tracepoints"
Acked-by: Frederic Weisbecker
Acked-by: Paul E. McKenney
Signed-off-by: Oleg Nesterov
Signed-off-by: Steven Rostedt
Signed-off-by: Greg Kroah-Hartman -
commit 379cfdac37923653c9d4242d10052378b7563005 upstream.
In order to prevent the saved cmdline cache from being filled when
tracing is not active, the comms are only recorded after a trace event
is recorded.The problem is, a comm can fail to be recorded if the trace_cmdline_lock
is held. That lock is taken via a trylock to allow it to happen from
any context (including NMI). If the lock fails to be taken, the comm
is skipped. No big deal, as we will try again later.But! Because of the code that was added to only record after an event,
we may not try again later as the recording is made as a oneshot per
event per CPU.Only disable the recording of the comm if the comm is actually recorded.
Fixes: 7ffbd48d5cab "tracing: Cache comms only after an event occurred"
Signed-off-by: Steven Rostedt
Signed-off-by: Greg Kroah-Hartman
01 Jul, 2014
2 commits
-
commit 1e77d0a1ed7417d2a5a52a7b8d32aea1833faa6c upstream.
Till reported that the spurious interrupt detection of threaded
interrupts is broken in two ways:- note_interrupt() is called for each action thread of a shared
interrupt line. That's wrong as we are only interested whether none
of the device drivers felt responsible for the interrupt, but by
calling multiple times for a single interrupt line we account
IRQ_NONE even if one of the drivers felt responsible.- note_interrupt() when called from the thread handler is not
serialized. That leaves the members of irq_desc which are used for
the spurious detection unprotected.To solve this we need to defer the spurious detection of a threaded
interrupt to the next hardware interrupt context where we have
implicit serialization.If note_interrupt is called with action_ret == IRQ_WAKE_THREAD, we
check whether the previous interrupt requested a deferred check. If
not, we request a deferred check for the next hardware interrupt and
return.If set, we check whether one of the interrupt threads signaled
success. Depending on this information we feed the result into the
spurious detector.If one primary handler of a shared interrupt returns IRQ_HANDLED we
disable the deferred check of irq threads on the same line, as we have
found at least one device driver who cared.Reported-by: Till Straumann
Signed-off-by: Thomas Gleixner
Tested-by: Austin Schuh
Cc: Oliver Hartkopp
Cc: Wolfgang Grandegger
Cc: Pavel Pisa
Cc: Marc Kleine-Budde
Cc: linux-can@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1303071450130.22263@ionos
Signed-off-by: Greg Kroah-Hartman -
commit 4e52365f279564cef0ddd41db5237f0471381093 upstream.
When tracing a process in another pid namespace, it's important for fork
event messages to contain the child's pid as seen from the tracer's pid
namespace, not the parent's. Otherwise, the tracer won't be able to
correlate the fork event with later SIGTRAP signals it receives from the
child.We still risk a race condition if a ptracer from a different pid
namespace attaches after we compute the pid_t value. However, sending a
bogus fork event message in this unlikely scenario is still a vast
improvement over the status quo where we always send bogus fork event
messages to debuggers in a different pid namespace than the forking
process.Signed-off-by: Matthew Dempsky
Acked-by: Oleg Nesterov
Cc: Kees Cook
Cc: Julien Tinnes
Cc: Roland McGrath
Cc: Jan Kratochvil
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
27 Jun, 2014
2 commits
-
commit 0e576acbc1d9600cf2d9b4a141a2554639959d50 upstream.
If CONFIG_NO_HZ=n tick_nohz_get_sleep_length() returns NSEC_PER_SEC/HZ.
If CONFIG_NO_HZ=y and the nohz functionality is disabled via the
command line option "nohz=off" or not enabled due to missing hardware
support, then tick_nohz_get_sleep_length() returns 0. That happens
because ts->sleep_length is never set in that case.Set it to NSEC_PER_SEC/HZ when the NOHZ mode is inactive.
Reported-by: Michal Hocko
Reported-by: Borislav Petkov
Signed-off-by: Thomas Gleixner
Cc: Rui Xiang
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit 90f62cf30a78721641e08737bda787552428061e ]
It is possible by passing a netlink socket to a more privileged
executable and then to fool that executable into writing to the socket
data that happens to be valid netlink message to do something that
privileged executable did not intend to do.To keep this from happening replace bare capable and ns_capable calls
with netlink_capable, netlink_net_calls and netlink_ns_capable calls.
Which act the same as the previous calls except they verify that the
opener of the socket had the desired permissions as well.Reported-by: Andy Lutomirski
Signed-off-by: "Eric W. Biederman"
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman
17 Jun, 2014
2 commits
-
commit a3c54931199565930d6d84f4c3456f6440aefd41 upstream.
Fixes an easy DoS and possible information disclosure.
This does nothing about the broken state of x32 auditing.
eparis: If the admin has enabled auditd and has specifically loaded
audit rules. This bug has been around since before git. Wow...Signed-off-by: Andy Lutomirski
Signed-off-by: Eric Paris
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman -
commit 23adbe12ef7d3d4195e80800ab36b37bee28cd03 upstream.
The kernel has no concept of capabilities with respect to inodes; inodes
exist independently of namespaces. For example, inode_capable(inode,
CAP_LINUX_IMMUTABLE) would be nonsense.This patch changes inode_capable to check for uid and gid mappings and
renames it to capable_wrt_inode_uidgid, which should make it more
obvious what it does.Fixes CVE-2014-4014.
Cc: Theodore Ts'o
Cc: Serge Hallyn
Cc: "Eric W. Biederman"
Cc: Dave Chinner
Signed-off-by: Andy Lutomirski
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
12 Jun, 2014
1 commit
-
commit 723478c8a471403c53cf144999701f6e0c4bbd11 upstream.
/proc/sys/kernel/perf_event_max_sample_rate will accept
negative values as well as 0.Negative values are unreasonable, and 0 causes a
divide by zero exception in perf_proc_update_handler.This patch enforces a lower limit of 1.
Signed-off-by: Knut Petersen
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/5242DB0C.4070005@t-online.de
Signed-off-by: Ingo Molnar
Cc: Weng Meiling
Signed-off-by: Greg Kroah-Hartman