Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

26 Sep, 2014

3 commits

ee1760b2b mm: page_alloc: use jump labels to avoid checking number_of_cpusets ... Browse Code »

commit 664eeddeef6539247691197c1ac124d4aa872ab6 upstream.

If cpusets are not in use then we still check a global variable on every
page allocation. Use jump labels to avoid the overhead.

Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Johannes Weiner
Cc: Vlastimil Babka
Cc: Jan Kara
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Theodore Ts'o
Cc: "Paul E. McKenney"
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby

Mel Gorman
2014-09-26 17:52:02 +0800
9c0073071 mm: per-thread vma caching ... Browse Code »

commit 615d6e8756c87149f2d4c1b93d471bca002bd849 upstream.

This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed. There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma(). Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.

We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality. On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.

The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number. The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed. Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question. Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:

1) System bootup: Most programs are single threaded, so the per-thread
scheme does improve ~50% hit rate by just adding a few more slots to
the cache.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 50.61% | 19.90 |
| patched | 73.45% | 13.58 |
+----------------+----------+------------------+

2) Kernel build: This one is already pretty good with the current
approach as we're dealing with good locality.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 75.28% | 11.03 |
| patched | 88.09% | 9.31 |
+----------------+----------+------------------+

3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 70.66% | 17.14 |
| patched | 91.15% | 12.57 |
+----------------+----------+------------------+

4) Ebizzy: There's a fair amount of variation from run to run, but this
approach always shows nearly perfect hit rates, while baseline is just
about non-existent. The amounts of cycles can fluctuate between
anywhere from ~60 to ~116 for the baseline scheme, but this approach
reduces it considerably. For instance, with 80 threads:

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 1.06% | 91.54 |
| patched | 99.97% | 14.18 |
+----------------+----------+------------------+

[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso
Reviewed-by: Rik van Riel
Acked-by: Linus Torvalds
Reviewed-by: Michel Lespinasse
Cc: Oleg Nesterov
Tested-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby

Davidlohr Bueso
2014-09-26 17:51:57 +0800
337c9823c mm: optimize put_mems_allowed() usage ... Browse Code »

commit d26914d11751b23ca2e8747725f2cae10c2f2c1b upstream.

Since put_mems_allowed() is strictly optional, its a seqcount retry, we
don't need to evaluate the function if the allocation was in fact
successful, saving a smp_rmb some loads and comparisons on some relative
fast-paths.

Since the naming, get/put_mems_allowed() does suggest a mandatory
pairing, rename the interface, as suggested by Mel, to resemble the
seqcount interface.

This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
where it is important to note that the return value of the latter call
is inverted from its previous incarnation.

Signed-off-by: Peter Zijlstra
Signed-off-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby

Mel Gorman
2014-09-26 17:51:56 +0800

17 Sep, 2014

4 commits

8a0c5bc4e ring-buffer: Up rb_iter_peek() loop count to 3 ... Browse Code »

commit 021de3d904b88b1771a3a2cfc5b75023c391e646 upstream.

After writting a test to try to trigger the bug that caused the
ring buffer iterator to become corrupted, I hit another bug:

WARNING: CPU: 1 PID: 5281 at kernel/trace/ring_buffer.c:3766 rb_iter_peek+0x113/0x238()
Modules linked in: ipt_MASQUERADE sunrpc [...]
CPU: 1 PID: 5281 Comm: grep Tainted: G W 3.16.0-rc3-test+ #143
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
0000000000000000 ffffffff81809a80 ffffffff81503fb0 0000000000000000
ffffffff81040ca1 ffff8800796d6010 ffffffff810c138d ffff8800796d6010
ffff880077438c80 ffff8800796d6010 ffff88007abbe600 0000000000000003
Call Trace:
[] ? dump_stack+0x4a/0x75
[] ? warn_slowpath_common+0x7e/0x97
[] ? rb_iter_peek+0x113/0x238
[] ? rb_iter_peek+0x113/0x238
[] ? ring_buffer_iter_peek+0x2d/0x5c
[] ? tracing_iter_reset+0x6e/0x96
[] ? s_start+0xd7/0x17b
[] ? kmem_cache_alloc_trace+0xda/0xea
[] ? seq_read+0x148/0x361
[] ? vfs_read+0x93/0xf1
[] ? SyS_read+0x60/0x8e
[] ? tracesys+0xdd/0xe2

Debugging this bug, which triggers when the rb_iter_peek() loops too
many times (more than 2 times), I discovered there's a case that can
cause that function to legitimately loop 3 times!

rb_iter_peek() is different than rb_buffer_peek() as the rb_buffer_peek()
only deals with the reader page (it's for consuming reads). The
rb_iter_peek() is for traversing the buffer without consuming it, and as
such, it can loop for one more reason. That is, if we hit the end of
the reader page or any page, it will go to the next page and try again.

That is, we have this:

1. iter->head > iter->head_page->page->commit
(rb_inc_iter() which moves the iter to the next page)
try again

2. event = rb_iter_head_event()
event->type_len == RINGBUF_TYPE_TIME_EXTEND
rb_advance_iter()
try again

3. read the event.

But we never get to 3, because the count is greater than 2 and we
cause the WARNING and return NULL.

Up the counter to 3.

Fixes: 69d1b839f7ee "ring-buffer: Bind time extend and data events together"
Signed-off-by: Steven Rostedt
Signed-off-by: Jiri Slaby

Steven Rostedt (Red Hat)
2014-09-17 22:55:10 +0800
a54711861 ring-buffer: Always reset iterator to reader page ... Browse Code »

commit 651e22f2701b4113989237c3048d17337dd2185c upstream.

When performing a consuming read, the ring buffer swaps out a
page from the ring buffer with a empty page and this page that
was swapped out becomes the new reader page. The reader page
is owned by the reader and since it was swapped out of the ring
buffer, writers do not have access to it (there's an exception
to that rule, but it's out of scope for this commit).

When reading the "trace" file, it is a non consuming read, which
means that the data in the ring buffer will not be modified.
When the trace file is opened, a ring buffer iterator is allocated
and writes to the ring buffer are disabled, such that the iterator
will not have issues iterating over the data.

Although the ring buffer disabled writes, it does not disable other
reads, or even consuming reads. If a consuming read happens, then
the iterator is reset and starts reading from the beginning again.

My tests would sometimes trigger this bug on my i386 box:

WARNING: CPU: 0 PID: 5175 at kernel/trace/trace.c:1527 __trace_find_cmdline+0x66/0xaa()
Modules linked in:
CPU: 0 PID: 5175 Comm: grep Not tainted 3.16.0-rc3-test+ #8
Hardware name: /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006
00000000 00000000 f09c9e1c c18796b3 c1b5d74c f09c9e4c c103a0e3 c1b5154b
f09c9e78 00001437 c1b5d74c 000005f7 c10bd85a c10bd85a c1cac57c f09c9eb0
ed0e0000 f09c9e64 c103a185 00000009 f09c9e5c c1b5154b f09c9e78 f09c9e80^M
Call Trace:
[] dump_stack+0x4b/0x75
[] warn_slowpath_common+0x7e/0x95
[] ? __trace_find_cmdline+0x66/0xaa
[] ? __trace_find_cmdline+0x66/0xaa
[] warn_slowpath_fmt+0x33/0x35
[] __trace_find_cmdline+0x66/0xaa^M
[] trace_find_cmdline+0x40/0x64
[] trace_print_context+0x27/0xec
[] ? trace_seq_printf+0x37/0x5b
[] print_trace_line+0x319/0x39b
[] ? ring_buffer_read+0x47/0x50
[] s_show+0x192/0x1ab
[] ? s_next+0x5a/0x7c
[] seq_read+0x267/0x34c
[] vfs_read+0x8c/0xef
[] ? seq_lseek+0x154/0x154
[] SyS_read+0x54/0x7f
[] syscall_call+0x7/0xb
---[ end trace 3f507febd6b4cc83 ]---
>>>> ##### CPU 1 buffer started ####

Which was the __trace_find_cmdline() function complaining about the pid
in the event record being negative.

After adding more test cases, this would trigger more often. Strangely
enough, it would never trigger on a single test, but instead would trigger
only when running all the tests. I believe that was the case because it
required one of the tests to be shutting down via delayed instances while
a new test started up.

After spending several days debugging this, I found that it was caused by
the iterator becoming corrupted. Debugging further, I found out why
the iterator became corrupted. It happened with the rb_iter_reset().

As consuming reads may not read the full reader page, and only part
of it, there's a "read" field to know where the last read took place.
The iterator, must also start at the read position. In the rb_iter_reset()
code, if the reader page was disconnected from the ring buffer, the iterator
would start at the head page within the ring buffer (where writes still
happen). But the mistake there was that it still used the "read" field
to start the iterator on the head page, where it should always start
at zero because readers never read from within the ring buffer where
writes occur.

I originally wrote a patch to have it set the iter->head to 0 instead
of iter->head_page->read, but then I questioned why it wasn't always
setting the iter to point to the reader page, as the reader page is
still valid. The list_empty(reader_page->list) just means that it was
successful in swapping out. But the reader_page may still have data.

There was a bug report a long time ago that was not reproducible that
had something about trace_pipe (consuming read) not matching trace
(iterator read). This may explain why that happened.

Anyway, the correct answer to this bug is to always use the reader page
an not reset the iterator to inside the writable ring buffer.

Fixes: d769041f8653 "ring_buffer: implement new locking"
Signed-off-by: Steven Rostedt
Signed-off-by: Jiri Slaby

Steven Rostedt (Red Hat)
2014-09-17 22:55:10 +0800
758ce2b3f kernel/smp.c:on_each_cpu_cond(): fix warning in fallback path ... Browse Code »

commit 618fde872163e782183ce574c77f1123e2be8887 upstream.

The rarely-executed memry-allocation-failed callback path generates a
WARN_ON_ONCE() when smp_call_function_single() succeeds. Presumably
it's supposed to warn on failures.

Signed-off-by: Sasha Levin
Cc: Christoph Lameter
Cc: Gilad Ben-Yossef
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Jiri Slaby

Sasha Levin
2014-09-17 22:55:03 +0800
ace57c920 CAPABILITIES: remove undefined caps from all processes ... Browse Code »

commit 7d8b6c63751cfbbe5eef81a48c22978b3407a3ad upstream.

This is effectively a revert of 7b9a7ec565505699f503b4fcf61500dceb36e744
plus fixing it a different way...

We found, when trying to run an application from an application which
had dropped privs that the kernel does security checks on undefined
capability bits. This was ESPECIALLY difficult to debug as those
undefined bits are hidden from /proc/$PID/status.

Consider a root application which drops all capabilities from ALL 4
capability sets. We assume, since the application is going to set
eff/perm/inh from an array that it will clear not only the defined caps
less than CAP_LAST_CAP, but also the higher 28ish bits which are
undefined future capabilities.

The BSET gets cleared differently. Instead it is cleared one bit at a
time. The problem here is that in security/commoncap.c::cap_task_prctl()
we actually check the validity of a capability being read. So any task
which attempts to 'read all things set in bset' followed by 'unset all
things set in bset' will not even attempt to unset the undefined bits
higher than CAP_LAST_CAP.

So the 'parent' will look something like:
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: ffffffc000000000

All of this 'should' be fine. Given that these are undefined bits that
aren't supposed to have anything to do with permissions. But they do...

So lets now consider a task which cleared the eff/perm/inh completely
and cleared all of the valid caps in the bset (but not the invalid caps
it couldn't read out of the kernel). We know that this is exactly what
the libcap-ng library does and what the go capabilities library does.
They both leave you in that above situation if you try to clear all of
you capapabilities from all 4 sets. If that root task calls execve()
the child task will pick up all caps not blocked by the bset. The bset
however does not block bits higher than CAP_LAST_CAP. So now the child
task has bits in eff which are not in the parent. These are
'meaningless' undefined bits, but still bits which the parent doesn't
have.

The problem is now in cred_cap_issubset() (or any operation which does a
subset test) as the child, while a subset for valid cap bits, is not a
subset for invalid cap bits! So now we set durring commit creds that
the child is not dumpable. Given it is 'more priv' than its parent. It
also means the parent cannot ptrace the child and other stupidity.

The solution here:
1) stop hiding capability bits in status
This makes debugging easier!

2) stop giving any task undefined capability bits. it's simple, it you
don't put those invalid bits in CAP_FULL_SET you won't get them in init
and you won't get them in any other task either.
This fixes the cap_issubset() tests and resulting fallout (which
made the init task in a docker container untraceable among other
things)

3) mask out undefined bits when sys_capset() is called as it might use
~0, ~0 to denote 'all capabilities' for backward/forward compatibility.
This lets 'capsh --caps="all=eip" -- -c /bin/bash' run.

4) mask out undefined bit when we read a file capability off of disk as
again likely all bits are set in the xattr for forward/backward
compatibility.
This lets 'setcap all+pe /bin/bash; /bin/bash' run

Signed-off-by: Eric Paris
Reviewed-by: Kees Cook
Cc: Andrew Vagin
Cc: Andrew G. Morgan
Cc: Serge E. Hallyn
Cc: Kees Cook
Cc: Steve Grubb
Cc: Dan Walsh
Signed-off-by: James Morris
Signed-off-by: Jiri Slaby

Eric Paris
2014-09-17 22:55:03 +0800

16 Sep, 2014

1 commit

70279f385 futex: Unlock hb->lock in futex_wait_requeue_pi() error path ... Browse Code »

commit 13c42c2f43b19aab3195f2d357db00d1e885eaa8 upstream.

futex_wait_requeue_pi() calls futex_wait_setup(). If
futex_wait_setup() succeeds it returns with hb->lock held and
preemption disabled. Now the sanity check after this does:

if (match_futex(&q.key, &key2)) {
ret = -EINVAL;
goto out_put_keys;
}

which releases the keys but does not release hb->lock.

So we happily return to user space with hb->lock held and therefor
preemption disabled.

Unlock hb->lock before taking the exit route.

Reported-by: Dave "Trinity" Jones
Signed-off-by: Thomas Gleixner
Reviewed-by: Darren Hart
Reviewed-by: Davidlohr Bueso
Cc: Peter Zijlstra
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1409112318500.4178@nanos
Signed-off-by: Thomas Gleixner
Signed-off-by: Jiri Slaby

Thomas Gleixner
2014-09-16 00:19:38 +0800

02 Sep, 2014

1 commit

ab73940ae PM / hibernate: avoid unsafe pages in e820 reserved regions ... Browse Code »

commit 84c91b7ae07c62cf6dee7fde3277f4be21331f85 upstream.

When the machine doesn't well handle the e820 persistent when hibernate
resuming, then it may cause page fault when writing image to snapshot
buffer:

[ 17.929495] BUG: unable to handle kernel paging request at ffff880069d4f000
[ 17.933469] IP: [] load_image_lzo+0x810/0xe40
[ 17.933469] PGD 2194067 PUD 77ffff067 PMD 2197067 PTE 0
[ 17.933469] Oops: 0002 [#1] SMP
...

The ffff880069d4f000 page is in e820 reserved region of resume boot
kernel:

[ 0.000000] BIOS-e820: [mem 0x0000000069d4f000-0x0000000069e12fff] reserved
...
[ 0.000000] PM: Registered nosave memory: [mem 0x69d4f000-0x69e12fff]

So snapshot.c mark the pfn to forbidden pages map. But, this
page is also in the memory bitmap in snapshot image because it's an
original page used by image kernel, so it will also mark as an
unsafe(free) page in prepare_image().

That means the page in e820 when resuming mark as "forbidden" and
"free", it causes get_buffer() treat it as an allocated unsafe page.
Then snapshot_write_next() return this page to load_image, load_image
writing content to this address, but this page didn't really allocated
. So, we got page fault.

Although the root cause is from BIOS, I think aggressive check and
significant message in kernel will better then a page fault for
issue tracking, especially when serial console unavailable.

This patch adds code in mark_unsafe_pages() for check does free pages in
nosave region. If so, then it print message and return fault to stop whole
S4 resume process:

[ 8.166004] PM: Image loading progress: 0%
[ 8.658717] PM: 0x6796c000 in e820 nosave region: [mem 0x6796c000-0x6796cfff]
[ 8.918737] PM: Read 2511940 kbytes in 1.04 seconds (2415.32 MB/s)
[ 8.926633] PM: Error -14 resuming
[ 8.933534] PM: Failed to load hibernation image, recovering.

Reviewed-by: Takashi Iwai
Acked-by: Pavel Machek
Signed-off-by: Lee, Chun-Yi
[rjw: Subject]
Signed-off-by: Rafael J. Wysocki

Signed-off-by: Jiri Slaby

Lee, Chun-Yi
2014-09-02 17:38:08 +0800

19 Aug, 2014

2 commits

62c54cb10 timer: Fix lock inversion between hrtimer_bases.lock and scheduler locks ... Browse Code »

commit 504d58745c9ca28d33572e2d8a9990b43e06075d upstream.

clockevents_increase_min_delta() calls printk() from under
hrtimer_bases.lock. That causes lock inversion on scheduler locks because
printk() can call into the scheduler. Lockdep puts it as:

======================================================
[ INFO: possible circular locking dependency detected ]
3.15.0-rc8-06195-g939f04b #2 Not tainted
-------------------------------------------------------
trinity-main/74 is trying to acquire lock:
(&port_lock_key){-.....}, at: [] serial8250_console_write+0x8c/0x10c

but task is already holding lock:
(hrtimer_bases.lock){-.-...}, at: [] hrtimer_try_to_cancel+0x13/0x66

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #5 (hrtimer_bases.lock){-.-...}:
[] lock_acquire+0x92/0x101
[] _raw_spin_lock_irqsave+0x2e/0x3e
[] __hrtimer_start_range_ns+0x1c/0x197
[] perf_swevent_start_hrtimer.part.41+0x7a/0x85
[] task_clock_event_start+0x3a/0x3f
[] task_clock_event_add+0xd/0x14
[] event_sched_in+0xb6/0x17a
[] group_sched_in+0x44/0x122
[] ctx_sched_in.isra.67+0x105/0x11f
[] perf_event_sched_in.isra.70+0x47/0x4b
[] __perf_install_in_context+0x8b/0xa3
[] remote_function+0x12/0x2a
[] smp_call_function_single+0x2d/0x53
[] task_function_call+0x30/0x36
[] perf_install_in_context+0x87/0xbb
[] SYSC_perf_event_open+0x5c6/0x701
[] SyS_perf_event_open+0x17/0x19
[] syscall_call+0x7/0xb

-> #4 (&ctx->lock){......}:
[] lock_acquire+0x92/0x101
[] _raw_spin_lock+0x21/0x30
[] __perf_event_task_sched_out+0x1dc/0x34f
[] __schedule+0x4c6/0x4cb
[] schedule+0xf/0x11
[] work_resched+0x5/0x30

-> #3 (&rq->lock){-.-.-.}:
[] lock_acquire+0x92/0x101
[] _raw_spin_lock+0x21/0x30
[] __task_rq_lock+0x33/0x3a
[] wake_up_new_task+0x25/0xc2
[] do_fork+0x15c/0x2a0
[] kernel_thread+0x1a/0x1f
[] rest_init+0x1a/0x10e
[] start_kernel+0x303/0x308
[] i386_start_kernel+0x79/0x7d

-> #2 (&p->pi_lock){-.-...}:
[] lock_acquire+0x92/0x101
[] _raw_spin_lock_irqsave+0x2e/0x3e
[] try_to_wake_up+0x1d/0xd6
[] default_wake_function+0xb/0xd
[] __wake_up_common+0x39/0x59
[] __wake_up+0x29/0x3b
[] tty_wakeup+0x49/0x51
[] uart_write_wakeup+0x17/0x19
[] serial8250_tx_chars+0xbc/0xfb
[] serial8250_handle_irq+0x54/0x6a
[] serial8250_default_handle_irq+0x19/0x1c
[] serial8250_interrupt+0x38/0x9e
[] handle_irq_event_percpu+0x5f/0x1e2
[] handle_irq_event+0x2c/0x43
[] handle_level_irq+0x57/0x80
[] handle_irq+0x46/0x5c
[] do_IRQ+0x32/0x89
[] common_interrupt+0x2e/0x33
[] _raw_spin_unlock_irqrestore+0x3f/0x49
[] uart_start+0x2d/0x32
[] uart_write+0xc7/0xd6
[] n_tty_write+0xb8/0x35e
[] tty_write+0x163/0x1e4
[] redirected_tty_write+0x6d/0x75
[] vfs_write+0x75/0xb0
[] SyS_write+0x44/0x77
[] syscall_call+0x7/0xb

-> #1 (&tty->write_wait){-.....}:
[] lock_acquire+0x92/0x101
[] _raw_spin_lock_irqsave+0x2e/0x3e
[] __wake_up+0x15/0x3b
[] tty_wakeup+0x49/0x51
[] uart_write_wakeup+0x17/0x19
[] serial8250_tx_chars+0xbc/0xfb
[] serial8250_handle_irq+0x54/0x6a
[] serial8250_default_handle_irq+0x19/0x1c
[] serial8250_interrupt+0x38/0x9e
[] handle_irq_event_percpu+0x5f/0x1e2
[] handle_irq_event+0x2c/0x43
[] handle_level_irq+0x57/0x80
[] handle_irq+0x46/0x5c
[] do_IRQ+0x32/0x89
[] common_interrupt+0x2e/0x33
[] _raw_spin_unlock_irqrestore+0x3f/0x49
[] uart_start+0x2d/0x32
[] uart_write+0xc7/0xd6
[] n_tty_write+0xb8/0x35e
[] tty_write+0x163/0x1e4
[] redirected_tty_write+0x6d/0x75
[] vfs_write+0x75/0xb0
[] SyS_write+0x44/0x77
[] syscall_call+0x7/0xb

-> #0 (&port_lock_key){-.....}:
[] __lock_acquire+0x9ea/0xc6d
[] lock_acquire+0x92/0x101
[] _raw_spin_lock_irqsave+0x2e/0x3e
[] serial8250_console_write+0x8c/0x10c
[] call_console_drivers.constprop.31+0x87/0x118
[] console_unlock+0x1d7/0x398
[] vprintk_emit+0x3da/0x3e4
[] printk+0x17/0x19
[] clockevents_program_min_delta+0x104/0x116
[] clockevents_program_event+0xe7/0xf3
[] tick_program_event+0x1e/0x23
[] hrtimer_force_reprogram+0x88/0x8f
[] __remove_hrtimer+0x5b/0x79
[] hrtimer_try_to_cancel+0x49/0x66
[] hrtimer_cancel+0xd/0x18
[] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
[] task_clock_event_stop+0x20/0x64
[] task_clock_event_del+0xd/0xf
[] event_sched_out+0xab/0x11e
[] group_sched_out+0x1d/0x66
[] ctx_sched_out+0xaf/0xbf
[] __perf_event_task_sched_out+0x1ed/0x34f
[] __schedule+0x4c6/0x4cb
[] schedule+0xf/0x11
[] work_resched+0x5/0x30

other info that might help us debug this:

Chain exists of:
&port_lock_key --> &ctx->lock --> hrtimer_bases.lock

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(hrtimer_bases.lock);
lock(&ctx->lock);
lock(hrtimer_bases.lock);
lock(&port_lock_key);

*** DEADLOCK ***

4 locks held by trinity-main/74:
#0: (&rq->lock){-.-.-.}, at: [] __schedule+0xed/0x4cb
#1: (&ctx->lock){......}, at: [] __perf_event_task_sched_out+0x1dc/0x34f
#2: (hrtimer_bases.lock){-.-...}, at: [] hrtimer_try_to_cancel+0x13/0x66
#3: (console_lock){+.+...}, at: [] vprintk_emit+0x3c7/0x3e4

stack backtrace:
CPU: 0 PID: 74 Comm: trinity-main Not tainted 3.15.0-rc8-06195-g939f04b #2
00000000 81c3a310 8b995c14 81426f69 8b995c44 81425a99 8161f671 8161f570
8161f538 8161f559 8161f538 8b995c78 8b142bb0 00000004 8b142fdc 8b142bb0
8b995ca8 8104a62d 8b142fac 000016f2 81c3a310 00000001 00000001 00000003
Call Trace:
[] dump_stack+0x16/0x18
[] print_circular_bug+0x18f/0x19c
[] __lock_acquire+0x9ea/0xc6d
[] lock_acquire+0x92/0x101
[] ? serial8250_console_write+0x8c/0x10c
[] ? wait_for_xmitr+0x76/0x76
[] _raw_spin_lock_irqsave+0x2e/0x3e
[] ? serial8250_console_write+0x8c/0x10c
[] serial8250_console_write+0x8c/0x10c
[] ? lock_release+0x191/0x223
[] ? wait_for_xmitr+0x76/0x76
[] call_console_drivers.constprop.31+0x87/0x118
[] console_unlock+0x1d7/0x398
[] vprintk_emit+0x3da/0x3e4
[] printk+0x17/0x19
[] clockevents_program_min_delta+0x104/0x116
[] tick_program_event+0x1e/0x23
[] hrtimer_force_reprogram+0x88/0x8f
[] __remove_hrtimer+0x5b/0x79
[] hrtimer_try_to_cancel+0x49/0x66
[] hrtimer_cancel+0xd/0x18
[] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
[] task_clock_event_stop+0x20/0x64
[] task_clock_event_del+0xd/0xf
[] event_sched_out+0xab/0x11e
[] group_sched_out+0x1d/0x66
[] ctx_sched_out+0xaf/0xbf
[] __perf_event_task_sched_out+0x1ed/0x34f
[] ? __dequeue_entity+0x23/0x27
[] ? pick_next_task_fair+0xb1/0x120
[] __schedule+0x4c6/0x4cb
[] ? trace_hardirqs_off_caller+0xd7/0x108
[] ? trace_hardirqs_off+0xb/0xd
[] ? rcu_irq_exit+0x64/0x77

Fix the problem by using printk_deferred() which does not call into the
scheduler.

Reported-by: Fengguang Wu
Signed-off-by: Jan Kara
Signed-off-by: Thomas Gleixner
Signed-off-by: Jiri Slaby

Jan Kara
2014-08-19 20:23:37 +0800
d6a1cfb5f printk: rename printk_sched to printk_deferred ... Browse Code »

commit aac74dc495456412c4130a1167ce4beb6c1f0b38 upstream.

After learning we'll need some sort of deferred printk functionality in
the timekeeping core, Peter suggested we rename the printk_sched function
so it can be reused by needed subsystems.

This only changes the function name. No logic changes.

Signed-off-by: John Stultz
Reviewed-by: Steven Rostedt
Cc: Jan Kara
Cc: Peter Zijlstra
Cc: Jiri Bohac
Cc: Thomas Gleixner
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Jiri Slaby

John Stultz
2014-08-19 20:23:37 +0800

31 Jul, 2014

1 commit

6ba0e343b tracing: Fix wraparound problems in "uptime" trace clock ... Browse Code »

commit 58d4e21e50ff3cc57910a8abc20d7e14375d2f61 upstream.

The "uptime" trace clock added in:

commit 8aacf017b065a805d27467843490c976835eb4a5
tracing: Add "uptime" trace clock that uses jiffies

has wraparound problems when the system has been up more
than 1 hour 11 minutes and 34 seconds. It converts jiffies
to nanoseconds using:
(u64)jiffies_to_usecs(jiffy) * 1000ULL
but since jiffies_to_usecs() only returns a 32-bit value, it
truncates at 2^32 microseconds. An additional problem on 32-bit
systems is that the argument is "unsigned long", so fixing the
return value only helps until 2^32 jiffies (49.7 days on a HZ=1000
system).

Avoid these problems by using jiffies_64 as our basis, and
not converting to nanoseconds (we do convert to clock_t because
user facing API must not be dependent on internal kernel
HZ values).

Link: http://lkml.kernel.org/p/99d63c5bfe9b320a3b428d773825a37095bf6a51.1405708254.git.tony.luck@intel.com

Fixes: 8aacf017b065 "tracing: Add "uptime" trace clock that uses jiffies"
Signed-off-by: Tony Luck
Signed-off-by: Steven Rostedt
Signed-off-by: Jiri Slaby

Tony Luck
2014-07-31 00:02:38 +0800

29 Jul, 2014

8 commits

9f8d48748 sched: Fix possible divide by zero in avg_atom() calculation ... Browse Code »

commit b0ab99e7736af88b8ac1b7ae50ea287fffa2badc upstream.

proc_sched_show_task() does:

if (nr_switches)
do_div(avg_atom, nr_switches);

nr_switches is unsigned long and do_div truncates it to 32 bits, which
means it can test non-zero on e.g. x86-64 and be truncated to zero for
division.

Fix the problem by using div64_ul() instead.

As a side effect calculations of avg_atom for big nr_switches are now correct.

Signed-off-by: Mateusz Guzik
Signed-off-by: Peter Zijlstra
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1402750809-31991-1-git-send-email-mguzik@redhat.com
Signed-off-by: Ingo Molnar
Signed-off-by: Jiri Slaby

Mateusz Guzik
2014-07-29 23:01:47 +0800
91b2716ab locking/mutex: Disable optimistic spinning on some architectures ... Browse Code »

commit 4badad352a6bb202ec68afa7a574c0bb961e5ebc upstream.

The optimistic spin code assumes regular stores and cmpxchg() play nice;
this is found to not be true for at least: parisc, sparc32, tile32,
metag-lock1, arc-!llsc and hexagon.

There is further wreckage, but this in particular seemed easy to
trigger, so blacklist this.

Opt in for known good archs.

Signed-off-by: Peter Zijlstra
Reported-by: Mikulas Patocka
Cc: David Miller
Cc: Chris Metcalf
Cc: James Bottomley
Cc: Vineet Gupta
Cc: Jason Low
Cc: Waiman Long
Cc: "James E.J. Bottomley"
Cc: Paul McKenney
Cc: John David Anglin
Cc: James Hogan
Cc: Linus Torvalds
Cc: Davidlohr Bueso
Cc: Benjamin Herrenschmidt
Cc: Catalin Marinas
Cc: Russell King
Cc: Will Deacon
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: sparclinux@vger.kernel.org
Link: http://lkml.kernel.org/r/20140606175316.GV13930@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar
Signed-off-by: Jiri Slaby

Peter Zijlstra
2014-07-29 23:01:47 +0800
d5f326544 PM / sleep: Fix request_firmware() error at resume ... Browse Code »

commit 4320f6b1d9db4ca912c5eb6ecb328b2e090e1586 upstream.

The commit [247bc037: PM / Sleep: Mitigate race between the freezer
and request_firmware()] introduced the finer state control, but it
also leads to a new bug; for example, a bug report regarding the
firmware loading of intel BT device at suspend/resume:
https://bugzilla.novell.com/show_bug.cgi?id=873790

The root cause seems to be a small window between the process resume
and the clear of usermodehelper lock. The request_firmware() function
checks the UMH lock and gives up when it's in UMH_DISABLE state. This
is for avoiding the invalid f/w loading during suspend/resume phase.
The problem is, however, that usermodehelper_enable() is called at the
end of thaw_processes(). Thus, a thawed process in between can kick
off the f/w loader code path (in this case, via btusb_setup_intel())
even before the call of usermodehelper_enable(). Then
usermodehelper_read_trylock() returns an error and request_firmware()
spews WARN_ON() in the end.

This oneliner patch fixes the issue just by setting to UMH_FREEZING
state again before restarting tasks, so that the call of
request_firmware() will be blocked until the end of this function
instead of returning an error.

Fixes: 247bc0374254 (PM / Sleep: Mitigate race between the freezer and request_firmware())
Link: https://bugzilla.novell.com/show_bug.cgi?id=873790
Signed-off-by: Takashi Iwai
Signed-off-by: Rafael J. Wysocki
Signed-off-by: Jiri Slaby

Takashi Iwai
2014-07-29 23:01:47 +0800
bde32a05c alarmtimer: Fix bug where relative alarm timers were treated as absolute ... Browse Code »

commit 16927776ae757d0d132bdbfabbfe2c498342bd59 upstream.

Sharvil noticed with the posix timer_settime interface, using the
CLOCK_REALTIME_ALARM or CLOCK_BOOTTIME_ALARM clockid, if the users
tried to specify a relative time timer, it would incorrectly be
treated as absolute regardless of the state of the flags argument.

This patch corrects this, properly checking the absolute/relative flag,
as well as adds further error checking that no invalid flag bits are set.

Reported-by: Sharvil Nanavati
Signed-off-by: John Stultz
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Prarit Bhargava
Cc: Sharvil Nanavati
Link: http://lkml.kernel.org/r/1404767171-6902-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Thomas Gleixner
Signed-off-by: Jiri Slaby

John Stultz
2014-07-29 23:01:45 +0800
9b1829d6d ring-buffer: Fix polling on trace_pipe ... Browse Code »

commit 97b8ee845393701edc06e27ccec2876ff9596019 upstream.

ring_buffer_poll_wait() should always put the poll_table to its wait_queue
even there is immediate data available. Otherwise, the following epoll and
read sequence will eventually hang forever:

1. Put some data to make the trace_pipe ring_buffer read ready first
2. epoll_ctl(efd, EPOLL_CTL_ADD, trace_pipe_fd, ee)
3. epoll_wait()
4. read(trace_pipe_fd) till EAGAIN
5. Add some more data to the trace_pipe ring_buffer
6. epoll_wait() -> this epoll_wait() will block forever

~ During the epoll_ctl(efd, EPOLL_CTL_ADD,...) call in step 2,
ring_buffer_poll_wait() returns immediately without adding poll_table,
which has poll_table->_qproc pointing to ep_poll_callback(), to its
wait_queue.
~ During the epoll_wait() call in step 3 and step 6,
ring_buffer_poll_wait() cannot add ep_poll_callback() to its wait_queue
because the poll_table->_qproc is NULL and it is how epoll works.
~ When there is new data available in step 6, ring_buffer does not know
it has to call ep_poll_callback() because it is not in its wait queue.
Hence, block forever.

Other poll implementation seems to call poll_wait() unconditionally as the very
first thing to do. For example, tcp_poll() in tcp.c.

Link: http://lkml.kernel.org/p/20140610060637.GA14045@devbig242.prn2.facebook.com

Fixes: 2a2cc8f7c4d0 "ftrace: allow the event pipe to be polled"
Reviewed-by: Chris Mason
Signed-off-by: Martin Lau
Signed-off-by: Steven Rostedt
Signed-off-by: Jiri Slaby

Martin Lau
2014-07-29 23:01:42 +0800
e12331d98 tracing: Add TRACE_ITER_PRINTK flag check in __trace_puts/__trace_bputs ... Browse Code »

commit f0160a5a2912267c02cfe692eac955c360de5fdf upstream.

The TRACE_ITER_PRINTK check in __trace_puts/__trace_bputs is missing,
so add it, to be consistent with __trace_printk/__trace_bprintk.
Those functions are all called by the same function: trace_printk().

Link: http://lkml.kernel.org/p/51E7A7D6.8090900@huawei.com

Signed-off-by: zhangwei(Jovi)
Signed-off-by: Steven Rostedt
Signed-off-by: Jiri Slaby

zhangwei(Jovi)
2014-07-29 22:56:52 +0800
cce7b584f tracing: Add ftrace_trace_stack into __trace_puts/__trace_bputs ... Browse Code »

commit 8abfb8727f4a724d31f9ccfd8013fbd16d539445 upstream.

Currently trace option stacktrace is not applicable for
trace_printk with constant string argument, the reason is
in __trace_puts/__trace_bputs ftrace_trace_stack is missing.

In contrast, when using trace_printk with non constant string
argument(will call into __trace_printk/__trace_bprintk), then
trace option stacktrace is workable, this inconstant result
will confuses users a lot.

Link: http://lkml.kernel.org/p/51E7A7C9.9040401@huawei.com

Signed-off-by: zhangwei(Jovi)
Signed-off-by: Steven Rostedt
Signed-off-by: Jiri Slaby

zhangwei(Jovi)
2014-07-29 22:56:51 +0800
48b19dc8a tracing: Fix graph tracer with stack tracer on other archs ... Browse Code »

commit 5f8bf2d263a20b986225ae1ed7d6759dc4b93af9 upstream.

Running my ftrace tests on PowerPC, it failed the test that checks
if function_graph tracer is affected by the stack tracer. It was.
Looking into this, I found that the update_function_graph_func()
must be called even if the trampoline function is not changed.
This is because archs like PowerPC do not support ftrace_ops being
passed by assembly and instead uses a helper function (what the
trampoline function points to). Since this function is not changed
even when multiple ftrace_ops are added to the code, the test that
falls out before calling update_function_graph_func() will miss that
the update must still be done.

Call update_function_graph_function() for all calls to
update_ftrace_function()

Signed-off-by: Steven Rostedt
Signed-off-by: Jiri Slaby

Steven Rostedt (Red Hat)
2014-07-29 22:56:50 +0800

18 Jul, 2014

6 commits

ca846a7a1 ring-buffer: Check if buffer exists before polling ... Browse Code »

commit 8b8b36834d0fff67fc8668093f4312dd04dcf21d upstream.

The per_cpu buffers are created one per possible CPU. But these do
not mean that those CPUs are online, nor do they even exist.

With the addition of the ring buffer polling, it assumes that the
caller polls on an existing buffer. But this is not the case if
the user reads trace_pipe from a CPU that does not exist, and this
causes the kernel to crash.

Simple fix is to check the cpu against buffer bitmask against to see
if the buffer was allocated or not and return -ENODEV if it is
not.

More updates were done to pass the -ENODEV back up to userspace.

Link: http://lkml.kernel.org/r/5393DB61.6060707@oracle.com

Reported-by: Sasha Levin
Signed-off-by: Steven Rostedt
Signed-off-by: Jiri Slaby

Steven Rostedt (Red Hat)
2014-07-18 21:51:30 +0800
9b8fa8064 workqueue: zero cpumask of wq_numa_possible_cpumask on init ... Browse Code »

commit 5a6024f1604eef119cf3a6fa413fe0261a81a8f3 upstream.

When hot-adding and onlining CPU, kernel panic occurs, showing following
call trace.

BUG: unable to handle kernel paging request at 0000000000001d08
IP: [] __alloc_pages_nodemask+0x9d/0xb10
PGD 0
Oops: 0000 [#1] SMP
...
Call Trace:
[] ? cpumask_next_and+0x35/0x50
[] ? find_busiest_group+0x113/0x8f0
[] ? deactivate_slab+0x349/0x3c0
[] new_slab+0x91/0x300
[] __slab_alloc+0x2bb/0x482
[] ? copy_process.part.25+0xfc/0x14c0
[] ? load_balance+0x218/0x890
[] ? sched_clock+0x9/0x10
[] ? trace_clock_local+0x9/0x10
[] kmem_cache_alloc_node+0x8c/0x200
[] copy_process.part.25+0xfc/0x14c0
[] ? trace_buffer_unlock_commit+0x4d/0x60
[] ? kthread_create_on_node+0x140/0x140
[] do_fork+0xbc/0x360
[] kernel_thread+0x26/0x30
[] kthreadd+0x2c2/0x300
[] ? kthread_create_on_cpu+0x60/0x60
[] ret_from_fork+0x7c/0xb0
[] ? kthread_create_on_cpu+0x60/0x60

In my investigation, I found the root cause is wq_numa_possible_cpumask.
All entries of wq_numa_possible_cpumask is allocated by
alloc_cpumask_var_node(). And these entries are used without initializing.
So these entries have wrong value.

When hot-adding and onlining CPU, wq_update_unbound_numa() is called.
wq_update_unbound_numa() calls alloc_unbound_pwq(). And alloc_unbound_pwq()
calls get_unbound_pool(). In get_unbound_pool(), worker_pool->node is set
as follow:

3592 /* if cpumask is contained inside a NUMA node, we belong to that node */
3593 if (wq_numa_enabled) {
3594 for_each_node(node) {
3595 if (cpumask_subset(pool->attrs->cpumask,
3596 wq_numa_possible_cpumask[node])) {
3597 pool->node = node;
3598 break;
3599 }
3600 }
3601 }

But wq_numa_possible_cpumask[node] does not have correct cpumask. So, wrong
node is selected. As a result, kernel panic occurs.

By this patch, all entries of wq_numa_possible_cpumask are allocated by
zalloc_cpumask_var_node to initialize them. And the panic disappeared.

Signed-off-by: Yasuaki Ishimatsu
Reviewed-by: Lai Jiangshan
Signed-off-by: Tejun Heo
Fixes: bce903809ab3 ("workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]")
Signed-off-by: Jiri Slaby

Yasuaki Ishimatsu
2014-07-18 21:51:15 +0800
d9e8b4f66 cpuset,mempolicy: fix sleeping function called from invalid context ... Browse Code »

commit 391acf970d21219a2a5446282d3b20eace0c0d7a upstream.

When runing with the kernel(3.15-rc7+), the follow bug occurs:
[ 9969.258987] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
[ 9969.359906] in_atomic(): 1, irqs_disabled(): 0, pid: 160655, name: python
[ 9969.441175] INFO: lockdep is turned off.
[ 9969.488184] CPU: 26 PID: 160655 Comm: python Tainted: G A 3.15.0-rc7+ #85
[ 9969.581032] Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012
[ 9969.706052] ffffffff81a20e60 ffff8803e941fbd0 ffffffff8162f523 ffff8803e941fd18
[ 9969.795323] ffff8803e941fbe0 ffffffff8109995a ffff8803e941fc58 ffffffff81633e6c
[ 9969.884710] ffffffff811ba5dc ffff880405c6b480 ffff88041fdd90a0 0000000000002000
[ 9969.974071] Call Trace:
[ 9970.003403] [] dump_stack+0x4d/0x66
[ 9970.065074] [] __might_sleep+0xfa/0x130
[ 9970.130743] [] mutex_lock_nested+0x3c/0x4f0
[ 9970.200638] [] ? kmem_cache_alloc+0x1bc/0x210
[ 9970.272610] [] cpuset_mems_allowed+0x27/0x140
[ 9970.344584] [] ? __mpol_dup+0x63/0x150
[ 9970.409282] [] __mpol_dup+0xe5/0x150
[ 9970.471897] [] ? __mpol_dup+0x63/0x150
[ 9970.536585] [] ? copy_process.part.23+0x606/0x1d40
[ 9970.613763] [] ? trace_hardirqs_on+0xd/0x10
[ 9970.683660] [] ? monotonic_to_bootbased+0x2f/0x50
[ 9970.759795] [] copy_process.part.23+0x670/0x1d40
[ 9970.834885] [] do_fork+0xd8/0x380
[ 9970.894375] [] ? __audit_syscall_entry+0x9c/0xf0
[ 9970.969470] [] SyS_clone+0x16/0x20
[ 9971.030011] [] stub_clone+0x69/0x90
[ 9971.091573] [] ? system_call_fastpath+0x16/0x1b

The cause is that cpuset_mems_allowed() try to take
mutex_lock(&callback_mutex) under the rcu_read_lock(which was hold in
__mpol_dup()). And in cpuset_mems_allowed(), the access to cpuset is
under rcu_read_lock, so in __mpol_dup, we can reduce the rcu_read_lock
protection region to protect the access to cpuset only in
current_cpuset_is_being_rebound(). So that we can avoid this bug.

This patch is a temporary solution that just addresses the bug
mentioned above, can not fix the long-standing issue about cpuset.mems
rebinding on fork():

"When the forker's task_struct is duplicated (which includes
->mems_allowed) and it races with an update to cpuset_being_rebound
in update_tasks_nodemask() then the task's mems_allowed doesn't get
updated. And the child task's mems_allowed can be wrong if the
cpuset's nodemask changes before the child has been added to the
cgroup's tasklist."

Signed-off-by: Gu Zheng
Acked-by: Li Zefan
Signed-off-by: Tejun Heo
Signed-off-by: Jiri Slaby

Gu Zheng
2014-07-18 21:51:15 +0800
982170a36 workqueue: fix dev_set_uevent_suppress() imbalance ... Browse Code »

commit bddbceb688c6d0decaabc7884fede319d02f96c8 upstream.

Uevents are suppressed during attributes registration, but never
restored, so kobject_uevent() does nothing.

Signed-off-by: Maxime Bizon
Signed-off-by: Tejun Heo
Fixes: 226223ab3c4118ddd10688cc2c131135848371ab
Signed-off-by: Jiri Slaby

Maxime Bizon
2014-07-18 21:51:14 +0800
19a019286 tracing: Remove ftrace_stop/start() from reading the trace file ... Browse Code »

commit 099ed151675cd1d2dbeae1dac697975f6a68716d upstream.

Disabling reading and writing to the trace file should not be able to
disable all function tracing callbacks. There's other users today
(like kprobes and perf). Reading a trace file should not stop those
from happening.

Reviewed-by: Masami Hiramatsu
Signed-off-by: Steven Rostedt
Signed-off-by: Jiri Slaby

Steven Rostedt (Red Hat)
2014-07-18 21:51:07 +0800
6b5002673 mm, pcp: allow restoring percpu_pagelist_fraction default ... Browse Code »

commit 7cd2b0a34ab8e4db971920eef8982f985441adfb upstream.

Oleg reports a division by zero error on zero-length write() to the
percpu_pagelist_fraction sysctl:

divide error: 0000 [#1] SMP DEBUG_PAGEALLOC
CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000
RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120
RSP: 0018:ffff8800d87a3e78 EFLAGS: 00010246
RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010
RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50
R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060
R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800
FS: 00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0
Call Trace:
proc_sys_call_handler+0xb3/0xc0
proc_sys_write+0x14/0x20
vfs_write+0xba/0x1e0
SyS_write+0x46/0xb0
tracesys+0xe1/0xe6

However, if the percpu_pagelist_fraction sysctl is set by the user, it
is also impossible to restore it to the kernel default since the user
cannot write 0 to the sysctl.

This patch allows the user to write 0 to restore the default behavior.
It still requires a fraction equal to or larger than 8, however, as
stated by the documentation for sanity. If a value in the range [1, 7]
is written, the sysctl will return EINVAL.

This successfully solves the divide by zero issue at the same time.

Signed-off-by: David Rientjes
Reported-by: Oleg Drokin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Jiri Slaby

David Rientjes
2014-07-18 21:51:01 +0800

17 Jul, 2014

2 commits

1855b4bde tracing: Fix syscall_*regfunc() vs copy_process() race ... Browse Code »

commit 4af4206be2bd1933cae20c2b6fb2058dbc887f7c upstream.

syscall_regfunc() and syscall_unregfunc() should set/clear
TIF_SYSCALL_TRACEPOINT system-wide, but do_each_thread() can race
with copy_process() and miss the new child which was not added to
the process/thread lists yet.

Change copy_process() to update the child's TIF_SYSCALL_TRACEPOINT
under tasklist.

Link: http://lkml.kernel.org/p/20140413185854.GB20668@redhat.com

Fixes: a871bd33a6c0 "tracing: Add syscall tracepoints"
Acked-by: Frederic Weisbecker
Acked-by: Paul E. McKenney
Signed-off-by: Oleg Nesterov
Signed-off-by: Steven Rostedt
Signed-off-by: Jiri Slaby

Oleg Nesterov
2014-07-17 19:43:20 +0800
0ecae6cb9 tracing: Try again for saved cmdline if failed due to locking ... Browse Code »

commit 379cfdac37923653c9d4242d10052378b7563005 upstream.

In order to prevent the saved cmdline cache from being filled when
tracing is not active, the comms are only recorded after a trace event
is recorded.

The problem is, a comm can fail to be recorded if the trace_cmdline_lock
is held. That lock is taken via a trylock to allow it to happen from
any context (including NMI). If the lock fails to be taken, the comm
is skipped. No big deal, as we will try again later.

But! Because of the code that was added to only record after an event,
we may not try again later as the recording is made as a oneshot per
event per CPU.

Only disable the recording of the comm if the comm is actually recorded.

Fixes: 7ffbd48d5cab "tracing: Cache comms only after an event occurred"
Signed-off-by: Steven Rostedt
Signed-off-by: Jiri Slaby

Steven Rostedt (Red Hat)
2014-07-17 19:43:20 +0800

16 Jul, 2014

4 commits

0c57bb135 rtmutex: Plug slow unlock race ... Browse Code »

commit 27e35715df54cbc4f2d044f681802ae30479e7fb upstream.

When the rtmutex fast path is enabled the slow unlock function can
create the following situation:

spin_lock(foo->m->wait_lock);
foo->m->owner = NULL;
rt_mutex_lock(foo->m); refcnt);
rt_mutex_unlock(foo->m); m->wait_lock); owner */
clear_rt_mutex_waiters(m);
owner = rt_mutex_owner(m);
spin_unlock(m->wait_lock);
if (cmpxchg(m->owner, owner, 0) == owner)
return;
spin_lock(m->wait_lock);
}

So in case of a new waiter incoming while the owner tries the slow
path unlock we have two situations:

unlock(wait_lock);
lock(wait_lock);
cmpxchg(p, owner, 0) == owner
mark_rt_mutex_waiters(lock);
acquire(lock);

Or:

unlock(wait_lock);
lock(wait_lock);
mark_rt_mutex_waiters(lock);
cmpxchg(p, owner, 0) != owner
enqueue_waiter();
unlock(wait_lock);
lock(wait_lock);
wakeup_next waiter();
unlock(wait_lock);
lock(wait_lock);
acquire(lock);

If the fast path is disabled, then the simple

m->owner = NULL;
unlock(m->wait_lock);

is sufficient as all access to m->owner is serialized via
m->wait_lock;

Also document and clarify the wakeup_next_waiter function as suggested
by Oleg Nesterov.

Reported-by: Steven Rostedt
Signed-off-by: Thomas Gleixner
Reviewed-by: Steven Rostedt
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/20140611183852.937945560@linutronix.de
Signed-off-by: Thomas Gleixner
Signed-off-by: Mike Galbraith
Signed-off-by: Jiri Slaby

Thomas Gleixner
2014-07-16 17:31:17 +0800
a2e64fcdc rtmutex: Handle deadlock detection smarter ... Browse Code »

commit 3d5c9340d1949733eb37616abd15db36aef9a57c upstream.

Even in the case when deadlock detection is not requested by the
caller, we can detect deadlocks. Right now the code stops the lock
chain walk and keeps the waiter enqueued, even on itself. Silly not to
yell when such a scenario is detected and to keep the waiter enqueued.

Return -EDEADLK unconditionally and handle it at the call sites.

The futex calls return -EDEADLK. The non futex ones dequeue the
waiter, throw a warning and put the task into a schedule loop.

Tagged for stable as it makes the code more robust.

Signed-off-by: Thomas Gleixner
Cc: Steven Rostedt
Cc: Peter Zijlstra
Cc: Brad Mouring
Link: http://lkml.kernel.org/r/20140605152801.836501969@linutronix.de
Signed-off-by: Thomas Gleixner
Signed-off-by: Mike Galbraith
Signed-off-by: Jiri Slaby

Thomas Gleixner
2014-07-16 17:31:17 +0800
7ae4100df rtmutex: Detect changes in the pi lock chain ... Browse Code »

commit 82084984383babe728e6e3c9a8e5c46278091315 upstream.

When we walk the lock chain, we drop all locks after each step. So the
lock chain can change under us before we reacquire the locks. That's
harmless in principle as we just follow the wrong lock path. But it
can lead to a false positive in the dead lock detection logic:

T0 holds L0
T0 blocks on L1 held by T1
T1 blocks on L2 held by T2
T2 blocks on L3 held by T3
T4 blocks on L4 held by T4

Now we walk the chain

lock T1 -> lock L2 -> adjust L2 -> unlock T1 ->
lock T2 -> adjust T2 -> drop locks

T2 times out and blocks on L0

Now we continue:

lock T2 -> lock L0 -> deadlock detected, but it's not a deadlock at all.

Brad tried to work around that in the deadlock detection logic itself,
but the more I looked at it the less I liked it, because it's crystal
ball magic after the fact.

We actually can detect a chain change very simple:

lock T1 -> lock L2 -> adjust L2 -> unlock T1 -> lock T2 -> adjust T2 ->

next_lock = T2->pi_blocked_on->lock;

drop locks

T2 times out and blocks on L0

Now we continue:

lock T2 ->

if (next_lock != T2->pi_blocked_on->lock)
return;

So if we detect that T2 is now blocked on a different lock we stop the
chain walk. That's also correct in the following scenario:

lock T1 -> lock L2 -> adjust L2 -> unlock T1 -> lock T2 -> adjust T2 ->

next_lock = T2->pi_blocked_on->lock;

drop locks

T3 times out and drops L3
T2 acquires L3 and blocks on L4 now

Now we continue:

lock T2 ->

if (next_lock != T2->pi_blocked_on->lock)
return;

We don't have to follow up the chain at that point, because T2
propagated our priority up to T4 already.

[ Folded a cleanup patch from peterz ]

Signed-off-by: Thomas Gleixner
Reported-by: Brad Mouring
Cc: Steven Rostedt
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/20140605152801.930031935@linutronix.de
Signed-off-by: Mike Galbraith
Signed-off-by: Jiri Slaby

Thomas Gleixner
2014-07-16 17:31:17 +0800
9e30d6997 rtmutex: Fix deadlock detector for real ... Browse Code »

commit 397335f004f41e5fcf7a795e94eb3ab83411a17c upstream.

The current deadlock detection logic does not work reliably due to the
following early exit path:

/*
* Drop out, when the task has no waiters. Note,
* top_waiter can be NULL, when we are in the deboosting
* mode!
*/
if (top_waiter && (!task_has_pi_waiters(task) ||
top_waiter != task_top_pi_waiter(task)))
goto out_unlock_pi;

So this not only exits when the task has no waiters, it also exits
unconditionally when the current waiter is not the top priority waiter
of the task.

So in a nested locking scenario, it might abort the lock chain walk
and therefor miss a potential deadlock.

Simple fix: Continue the chain walk, when deadlock detection is
enabled.

We also avoid the whole enqueue, if we detect the deadlock right away
(A-A). It's an optimization, but also prevents that another waiter who
comes in after the detection and before the task has undone the damage
observes the situation and detects the deadlock and returns
-EDEADLOCK, which is wrong as the other task is not in a deadlock
situation.

Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Reviewed-by: Steven Rostedt
Cc: Lai Jiangshan
Link: http://lkml.kernel.org/r/20140522031949.725272460@linutronix.de
Signed-off-by: Thomas Gleixner
Signed-off-by: Mike Galbraith
Signed-off-by: Jiri Slaby

Thomas Gleixner
2014-07-16 17:31:16 +0800

02 Jul, 2014

2 commits

3f29879bf genirq: Sanitize spurious interrupt detection of threaded irqs ... Browse Code »

commit 1e77d0a1ed7417d2a5a52a7b8d32aea1833faa6c upstream.

Till reported that the spurious interrupt detection of threaded
interrupts is broken in two ways:

- note_interrupt() is called for each action thread of a shared
interrupt line. That's wrong as we are only interested whether none
of the device drivers felt responsible for the interrupt, but by
calling multiple times for a single interrupt line we account
IRQ_NONE even if one of the drivers felt responsible.

- note_interrupt() when called from the thread handler is not
serialized. That leaves the members of irq_desc which are used for
the spurious detection unprotected.

To solve this we need to defer the spurious detection of a threaded
interrupt to the next hardware interrupt context where we have
implicit serialization.

If note_interrupt is called with action_ret == IRQ_WAKE_THREAD, we
check whether the previous interrupt requested a deferred check. If
not, we request a deferred check for the next hardware interrupt and
return.

If set, we check whether one of the interrupt threads signaled
success. Depending on this information we feed the result into the
spurious detector.

If one primary handler of a shared interrupt returns IRQ_HANDLED we
disable the deferred check of irq threads on the same line, as we have
found at least one device driver who cared.

Reported-by: Till Straumann
Signed-off-by: Thomas Gleixner
Tested-by: Austin Schuh
Cc: Oliver Hartkopp
Cc: Wolfgang Grandegger
Cc: Pavel Pisa
Cc: Marc Kleine-Budde
Cc: linux-can@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1303071450130.22263@ionos
Signed-off-by: Jiri Slaby

Thomas Gleixner
2014-07-02 18:06:34 +0800
7b41b2642 ptrace: fix fork event messages across pid namespaces ... Browse Code »

commit 4e52365f279564cef0ddd41db5237f0471381093 upstream.

When tracing a process in another pid namespace, it's important for fork
event messages to contain the child's pid as seen from the tracer's pid
namespace, not the parent's. Otherwise, the tracer won't be able to
correlate the fork event with later SIGTRAP signals it receives from the
child.

We still risk a race condition if a ptracer from a different pid
namespace attaches after we compute the pid_t value. However, sending a
bogus fork event message in this unlikely scenario is still a vast
improvement over the status quo where we always send bogus fork event
messages to debuggers in a different pid namespace than the forking
process.

Signed-off-by: Matthew Dempsky
Acked-by: Oleg Nesterov
Cc: Kees Cook
Cc: Julien Tinnes
Cc: Roland McGrath
Cc: Jan Kratochvil
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Jiri Slaby

Matthew Dempsky
2014-07-02 18:06:13 +0800

27 Jun, 2014

2 commits

5ff029e2b sched: Make scale_rt_power() deal with backward clocks ... Browse Code »

commit cadefd3d6cc914d95163ba1eda766bfe7ce1e5b7 upstream.

Mike reported that, while unlikely, its entirely possible for
scale_rt_power() to see the time go backwards. This yields rather
'interesting' results.

So like all other sites that deal with clocks; make this one ignore
backward clock movement too.

Reported-by: Mike Galbraith
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20140227094035.GZ9987@twins.programming.kicks-ass.net
Cc: Linus Torvalds
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar
Signed-off-by: Jiri Slaby

Peter Zijlstra
2014-06-27 16:25:15 +0800
e54e6e8ec tick-sched: Check tick_nohz_enabled in tick_nohz_switch_to_nohz() ... Browse Code »

commit 27630532ef5ead28b98cfe28d8f95222ef91c2b7 upstream.

Since commit d689fe222 (NOHZ: Check for nohz active instead of nohz
enabled) the tick_nohz_switch_to_nohz() function returns because it
checks for the tick_nohz_active flag. This can't be set, because the
function itself sets it.

Undo the change in tick_nohz_switch_to_nohz().

Signed-off-by: Viresh Kumar
Cc: linaro-kernel@lists.linaro.org
Cc: fweisbec@gmail.com
Cc: Arvind.Chauhan@arm.com
Cc: linaro-networking@linaro.org
Cc: # 3.13+
Link: http://lkml.kernel.org/r/40939c05f2d65d781b92b20302b02243d0654224.1397537987.git.viresh.kumar@linaro.org
Signed-off-by: Thomas Gleixner
Signed-off-by: Jiri Slaby

Viresh Kumar
2014-06-27 16:25:12 +0800

23 Jun, 2014

2 commits

61844d8e2 Revert "sched: Fix sleep time double accounting in enqueue entity" ... Browse Code »

commit 9390675af0835ae1d654d33bfcf16096028550ad upstream.

This reverts commit 282cf499f03ec1754b6c8c945c9674b02631fb0f.

With the current implementation, the load average statistics of a sched entity
change according to other activity on the CPU even if this activity is done
between the running window of the sched entity and have no influence on the
running duration of the task.

When a task wakes up on the same CPU, we currently update last_runnable_update
with the return of __synchronize_entity_decay without updating the
runnable_avg_sum and runnable_avg_period accordingly. In fact, we have to sync
the load_contrib of the se with the rq's blocked_load_contrib before removing
it from the latter (with __synchronize_entity_decay) but we must keep
last_runnable_update unchanged for updating runnable_avg_sum/period during the
next update_entity_load_avg.

Signed-off-by: Vincent Guittot
Signed-off-by: Peter Zijlstra
Reviewed-by: Ben Segall
Cc: pjt@google.com
Cc: alex.shi@linaro.org
Link: http://lkml.kernel.org/r/1390376734-6800-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar
Signed-off-by: Jiri Slaby

Vincent Guittot
2014-06-23 18:43:35 +0800
50b8b6e75 net: Use netlink_ns_capable to verify the permisions of netlink messages ... Browse Code »

[ Upstream commit 90f62cf30a78721641e08737bda787552428061e ]

It is possible by passing a netlink socket to a more privileged
executable and then to fool that executable into writing to the socket
data that happens to be valid netlink message to do something that
privileged executable did not intend to do.

To keep this from happening replace bare capable and ns_capable calls
with netlink_capable, netlink_net_calls and netlink_ns_capable calls.
Which act the same as the previous calls except they verify that the
opener of the socket had the desired permissions as well.

Reported-by: Andy Lutomirski
Signed-off-by: "Eric W. Biederman"
Signed-off-by: David S. Miller
Signed-off-by: Jiri Slaby

Eric W. Biederman
2014-06-23 16:27:57 +0800

20 Jun, 2014

2 commits

6004b0e5a auditsc: audit_krule mask accesses need bounds checking ... Browse Code »

commit a3c54931199565930d6d84f4c3456f6440aefd41 upstream.

Fixes an easy DoS and possible information disclosure.

This does nothing about the broken state of x32 auditing.

eparis: If the admin has enabled auditd and has specifically loaded
audit rules. This bug has been around since before git. Wow...

Signed-off-by: Andy Lutomirski
Signed-off-by: Eric Paris
Signed-off-by: Linus Torvalds
Signed-off-by: Jiri Slaby

Andy Lutomirski
2014-06-20 23:34:17 +0800
2246a472b fs,userns: Change inode_capable to capable_wrt_inode_uidgid ... Browse Code »

commit 23adbe12ef7d3d4195e80800ab36b37bee28cd03 upstream.

The kernel has no concept of capabilities with respect to inodes; inodes
exist independently of namespaces. For example, inode_capable(inode,
CAP_LINUX_IMMUTABLE) would be nonsense.

This patch changes inode_capable to check for uid and gid mappings and
renames it to capable_wrt_inode_uidgid, which should make it more
obvious what it does.

Fixes CVE-2014-4014.

Cc: Theodore Ts'o
Cc: Serge Hallyn
Cc: "Eric W. Biederman"
Cc: Dave Chinner
Signed-off-by: Andy Lutomirski
Signed-off-by: Linus Torvalds
Signed-off-by: Jiri Slaby

Andy Lutomirski
2014-06-20 23:34:15 +0800