26 Sep, 2014

3 commits

  • commit 664eeddeef6539247691197c1ac124d4aa872ab6 upstream.

    If cpusets are not in use then we still check a global variable on every
    page allocation. Use jump labels to avoid the overhead.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit 615d6e8756c87149f2d4c1b93d471bca002bd849 upstream.

    This patch is a continuation of efforts trying to optimize find_vma(),
    avoiding potentially expensive rbtree walks to locate a vma upon faults.
    The original approach (https://lkml.org/lkml/2013/11/1/410), where the
    largest vma was also cached, ended up being too specific and random,
    thus further comparison with other approaches were needed. There are
    two things to consider when dealing with this, the cache hit rate and
    the latency of find_vma(). Improving the hit-rate does not necessarily
    translate in finding the vma any faster, as the overhead of any fancy
    caching schemes can be too high to consider.

    We currently cache the last used vma for the whole address space, which
    provides a nice optimization, reducing the total cycles in find_vma() by
    up to 250%, for workloads with good locality. On the other hand, this
    simple scheme is pretty much useless for workloads with poor locality.
    Analyzing ebizzy runs shows that, no matter how many threads are
    running, the mmap_cache hit rate is less than 2%, and in many situations
    below 1%.

    The proposed approach is to replace this scheme with a small per-thread
    cache, maximizing hit rates at a very low maintenance cost.
    Invalidations are performed by simply bumping up a 32-bit sequence
    number. The only expensive operation is in the rare case of a seq
    number overflow, where all caches that share the same address space are
    flushed. Upon a miss, the proposed replacement policy is based on the
    page number that contains the virtual address in question. Concretely,
    the following results are seen on an 80 core, 8 socket x86-64 box:

    1) System bootup: Most programs are single threaded, so the per-thread
    scheme does improve ~50% hit rate by just adding a few more slots to
    the cache.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 50.61% | 19.90 |
    | patched | 73.45% | 13.58 |
    +----------------+----------+------------------+

    2) Kernel build: This one is already pretty good with the current
    approach as we're dealing with good locality.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 75.28% | 11.03 |
    | patched | 88.09% | 9.31 |
    +----------------+----------+------------------+

    3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 70.66% | 17.14 |
    | patched | 91.15% | 12.57 |
    +----------------+----------+------------------+

    4) Ebizzy: There's a fair amount of variation from run to run, but this
    approach always shows nearly perfect hit rates, while baseline is just
    about non-existent. The amounts of cycles can fluctuate between
    anywhere from ~60 to ~116 for the baseline scheme, but this approach
    reduces it considerably. For instance, with 80 threads:

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 1.06% | 91.54 |
    | patched | 99.97% | 14.18 |
    +----------------+----------+------------------+

    [akpm@linux-foundation.org: fix nommu build, per Davidlohr]
    [akpm@linux-foundation.org: document vmacache_valid() logic]
    [akpm@linux-foundation.org: attempt to untangle header files]
    [akpm@linux-foundation.org: add vmacache_find() BUG_ON]
    [hughd@google.com: add vmacache_valid_mm() (from Oleg)]
    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: adjust and enhance comments]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Acked-by: Linus Torvalds
    Reviewed-by: Michel Lespinasse
    Cc: Oleg Nesterov
    Tested-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Davidlohr Bueso
     
  • commit d26914d11751b23ca2e8747725f2cae10c2f2c1b upstream.

    Since put_mems_allowed() is strictly optional, its a seqcount retry, we
    don't need to evaluate the function if the allocation was in fact
    successful, saving a smp_rmb some loads and comparisons on some relative
    fast-paths.

    Since the naming, get/put_mems_allowed() does suggest a mandatory
    pairing, rename the interface, as suggested by Mel, to resemble the
    seqcount interface.

    This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
    where it is important to note that the return value of the latter call
    is inverted from its previous incarnation.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     

17 Sep, 2014

4 commits

  • commit 021de3d904b88b1771a3a2cfc5b75023c391e646 upstream.

    After writting a test to try to trigger the bug that caused the
    ring buffer iterator to become corrupted, I hit another bug:

    WARNING: CPU: 1 PID: 5281 at kernel/trace/ring_buffer.c:3766 rb_iter_peek+0x113/0x238()
    Modules linked in: ipt_MASQUERADE sunrpc [...]
    CPU: 1 PID: 5281 Comm: grep Tainted: G W 3.16.0-rc3-test+ #143
    Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
    0000000000000000 ffffffff81809a80 ffffffff81503fb0 0000000000000000
    ffffffff81040ca1 ffff8800796d6010 ffffffff810c138d ffff8800796d6010
    ffff880077438c80 ffff8800796d6010 ffff88007abbe600 0000000000000003
    Call Trace:
    [] ? dump_stack+0x4a/0x75
    [] ? warn_slowpath_common+0x7e/0x97
    [] ? rb_iter_peek+0x113/0x238
    [] ? rb_iter_peek+0x113/0x238
    [] ? ring_buffer_iter_peek+0x2d/0x5c
    [] ? tracing_iter_reset+0x6e/0x96
    [] ? s_start+0xd7/0x17b
    [] ? kmem_cache_alloc_trace+0xda/0xea
    [] ? seq_read+0x148/0x361
    [] ? vfs_read+0x93/0xf1
    [] ? SyS_read+0x60/0x8e
    [] ? tracesys+0xdd/0xe2

    Debugging this bug, which triggers when the rb_iter_peek() loops too
    many times (more than 2 times), I discovered there's a case that can
    cause that function to legitimately loop 3 times!

    rb_iter_peek() is different than rb_buffer_peek() as the rb_buffer_peek()
    only deals with the reader page (it's for consuming reads). The
    rb_iter_peek() is for traversing the buffer without consuming it, and as
    such, it can loop for one more reason. That is, if we hit the end of
    the reader page or any page, it will go to the next page and try again.

    That is, we have this:

    1. iter->head > iter->head_page->page->commit
    (rb_inc_iter() which moves the iter to the next page)
    try again

    2. event = rb_iter_head_event()
    event->type_len == RINGBUF_TYPE_TIME_EXTEND
    rb_advance_iter()
    try again

    3. read the event.

    But we never get to 3, because the count is greater than 2 and we
    cause the WARNING and return NULL.

    Up the counter to 3.

    Fixes: 69d1b839f7ee "ring-buffer: Bind time extend and data events together"
    Signed-off-by: Steven Rostedt
    Signed-off-by: Jiri Slaby

    Steven Rostedt (Red Hat)
     
  • commit 651e22f2701b4113989237c3048d17337dd2185c upstream.

    When performing a consuming read, the ring buffer swaps out a
    page from the ring buffer with a empty page and this page that
    was swapped out becomes the new reader page. The reader page
    is owned by the reader and since it was swapped out of the ring
    buffer, writers do not have access to it (there's an exception
    to that rule, but it's out of scope for this commit).

    When reading the "trace" file, it is a non consuming read, which
    means that the data in the ring buffer will not be modified.
    When the trace file is opened, a ring buffer iterator is allocated
    and writes to the ring buffer are disabled, such that the iterator
    will not have issues iterating over the data.

    Although the ring buffer disabled writes, it does not disable other
    reads, or even consuming reads. If a consuming read happens, then
    the iterator is reset and starts reading from the beginning again.

    My tests would sometimes trigger this bug on my i386 box:

    WARNING: CPU: 0 PID: 5175 at kernel/trace/trace.c:1527 __trace_find_cmdline+0x66/0xaa()
    Modules linked in:
    CPU: 0 PID: 5175 Comm: grep Not tainted 3.16.0-rc3-test+ #8
    Hardware name: /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006
    00000000 00000000 f09c9e1c c18796b3 c1b5d74c f09c9e4c c103a0e3 c1b5154b
    f09c9e78 00001437 c1b5d74c 000005f7 c10bd85a c10bd85a c1cac57c f09c9eb0
    ed0e0000 f09c9e64 c103a185 00000009 f09c9e5c c1b5154b f09c9e78 f09c9e80^M
    Call Trace:
    [] dump_stack+0x4b/0x75
    [] warn_slowpath_common+0x7e/0x95
    [] ? __trace_find_cmdline+0x66/0xaa
    [] ? __trace_find_cmdline+0x66/0xaa
    [] warn_slowpath_fmt+0x33/0x35
    [] __trace_find_cmdline+0x66/0xaa^M
    [] trace_find_cmdline+0x40/0x64
    [] trace_print_context+0x27/0xec
    [] ? trace_seq_printf+0x37/0x5b
    [] print_trace_line+0x319/0x39b
    [] ? ring_buffer_read+0x47/0x50
    [] s_show+0x192/0x1ab
    [] ? s_next+0x5a/0x7c
    [] seq_read+0x267/0x34c
    [] vfs_read+0x8c/0xef
    [] ? seq_lseek+0x154/0x154
    [] SyS_read+0x54/0x7f
    [] syscall_call+0x7/0xb
    ---[ end trace 3f507febd6b4cc83 ]---
    >>>> ##### CPU 1 buffer started ####

    Which was the __trace_find_cmdline() function complaining about the pid
    in the event record being negative.

    After adding more test cases, this would trigger more often. Strangely
    enough, it would never trigger on a single test, but instead would trigger
    only when running all the tests. I believe that was the case because it
    required one of the tests to be shutting down via delayed instances while
    a new test started up.

    After spending several days debugging this, I found that it was caused by
    the iterator becoming corrupted. Debugging further, I found out why
    the iterator became corrupted. It happened with the rb_iter_reset().

    As consuming reads may not read the full reader page, and only part
    of it, there's a "read" field to know where the last read took place.
    The iterator, must also start at the read position. In the rb_iter_reset()
    code, if the reader page was disconnected from the ring buffer, the iterator
    would start at the head page within the ring buffer (where writes still
    happen). But the mistake there was that it still used the "read" field
    to start the iterator on the head page, where it should always start
    at zero because readers never read from within the ring buffer where
    writes occur.

    I originally wrote a patch to have it set the iter->head to 0 instead
    of iter->head_page->read, but then I questioned why it wasn't always
    setting the iter to point to the reader page, as the reader page is
    still valid. The list_empty(reader_page->list) just means that it was
    successful in swapping out. But the reader_page may still have data.

    There was a bug report a long time ago that was not reproducible that
    had something about trace_pipe (consuming read) not matching trace
    (iterator read). This may explain why that happened.

    Anyway, the correct answer to this bug is to always use the reader page
    an not reset the iterator to inside the writable ring buffer.

    Fixes: d769041f8653 "ring_buffer: implement new locking"
    Signed-off-by: Steven Rostedt
    Signed-off-by: Jiri Slaby

    Steven Rostedt (Red Hat)
     
  • commit 618fde872163e782183ce574c77f1123e2be8887 upstream.

    The rarely-executed memry-allocation-failed callback path generates a
    WARN_ON_ONCE() when smp_call_function_single() succeeds. Presumably
    it's supposed to warn on failures.

    Signed-off-by: Sasha Levin
    Cc: Christoph Lameter
    Cc: Gilad Ben-Yossef
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Sasha Levin
     
  • commit 7d8b6c63751cfbbe5eef81a48c22978b3407a3ad upstream.

    This is effectively a revert of 7b9a7ec565505699f503b4fcf61500dceb36e744
    plus fixing it a different way...

    We found, when trying to run an application from an application which
    had dropped privs that the kernel does security checks on undefined
    capability bits. This was ESPECIALLY difficult to debug as those
    undefined bits are hidden from /proc/$PID/status.

    Consider a root application which drops all capabilities from ALL 4
    capability sets. We assume, since the application is going to set
    eff/perm/inh from an array that it will clear not only the defined caps
    less than CAP_LAST_CAP, but also the higher 28ish bits which are
    undefined future capabilities.

    The BSET gets cleared differently. Instead it is cleared one bit at a
    time. The problem here is that in security/commoncap.c::cap_task_prctl()
    we actually check the validity of a capability being read. So any task
    which attempts to 'read all things set in bset' followed by 'unset all
    things set in bset' will not even attempt to unset the undefined bits
    higher than CAP_LAST_CAP.

    So the 'parent' will look something like:
    CapInh: 0000000000000000
    CapPrm: 0000000000000000
    CapEff: 0000000000000000
    CapBnd: ffffffc000000000

    All of this 'should' be fine. Given that these are undefined bits that
    aren't supposed to have anything to do with permissions. But they do...

    So lets now consider a task which cleared the eff/perm/inh completely
    and cleared all of the valid caps in the bset (but not the invalid caps
    it couldn't read out of the kernel). We know that this is exactly what
    the libcap-ng library does and what the go capabilities library does.
    They both leave you in that above situation if you try to clear all of
    you capapabilities from all 4 sets. If that root task calls execve()
    the child task will pick up all caps not blocked by the bset. The bset
    however does not block bits higher than CAP_LAST_CAP. So now the child
    task has bits in eff which are not in the parent. These are
    'meaningless' undefined bits, but still bits which the parent doesn't
    have.

    The problem is now in cred_cap_issubset() (or any operation which does a
    subset test) as the child, while a subset for valid cap bits, is not a
    subset for invalid cap bits! So now we set durring commit creds that
    the child is not dumpable. Given it is 'more priv' than its parent. It
    also means the parent cannot ptrace the child and other stupidity.

    The solution here:
    1) stop hiding capability bits in status
    This makes debugging easier!

    2) stop giving any task undefined capability bits. it's simple, it you
    don't put those invalid bits in CAP_FULL_SET you won't get them in init
    and you won't get them in any other task either.
    This fixes the cap_issubset() tests and resulting fallout (which
    made the init task in a docker container untraceable among other
    things)

    3) mask out undefined bits when sys_capset() is called as it might use
    ~0, ~0 to denote 'all capabilities' for backward/forward compatibility.
    This lets 'capsh --caps="all=eip" -- -c /bin/bash' run.

    4) mask out undefined bit when we read a file capability off of disk as
    again likely all bits are set in the xattr for forward/backward
    compatibility.
    This lets 'setcap all+pe /bin/bash; /bin/bash' run

    Signed-off-by: Eric Paris
    Reviewed-by: Kees Cook
    Cc: Andrew Vagin
    Cc: Andrew G. Morgan
    Cc: Serge E. Hallyn
    Cc: Kees Cook
    Cc: Steve Grubb
    Cc: Dan Walsh
    Signed-off-by: James Morris
    Signed-off-by: Jiri Slaby

    Eric Paris
     

16 Sep, 2014

1 commit

  • commit 13c42c2f43b19aab3195f2d357db00d1e885eaa8 upstream.

    futex_wait_requeue_pi() calls futex_wait_setup(). If
    futex_wait_setup() succeeds it returns with hb->lock held and
    preemption disabled. Now the sanity check after this does:

    if (match_futex(&q.key, &key2)) {
    ret = -EINVAL;
    goto out_put_keys;
    }

    which releases the keys but does not release hb->lock.

    So we happily return to user space with hb->lock held and therefor
    preemption disabled.

    Unlock hb->lock before taking the exit route.

    Reported-by: Dave "Trinity" Jones
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Darren Hart
    Reviewed-by: Davidlohr Bueso
    Cc: Peter Zijlstra
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1409112318500.4178@nanos
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Jiri Slaby

    Thomas Gleixner
     

02 Sep, 2014

1 commit

  • commit 84c91b7ae07c62cf6dee7fde3277f4be21331f85 upstream.

    When the machine doesn't well handle the e820 persistent when hibernate
    resuming, then it may cause page fault when writing image to snapshot
    buffer:

    [ 17.929495] BUG: unable to handle kernel paging request at ffff880069d4f000
    [ 17.933469] IP: [] load_image_lzo+0x810/0xe40
    [ 17.933469] PGD 2194067 PUD 77ffff067 PMD 2197067 PTE 0
    [ 17.933469] Oops: 0002 [#1] SMP
    ...

    The ffff880069d4f000 page is in e820 reserved region of resume boot
    kernel:

    [ 0.000000] BIOS-e820: [mem 0x0000000069d4f000-0x0000000069e12fff] reserved
    ...
    [ 0.000000] PM: Registered nosave memory: [mem 0x69d4f000-0x69e12fff]

    So snapshot.c mark the pfn to forbidden pages map. But, this
    page is also in the memory bitmap in snapshot image because it's an
    original page used by image kernel, so it will also mark as an
    unsafe(free) page in prepare_image().

    That means the page in e820 when resuming mark as "forbidden" and
    "free", it causes get_buffer() treat it as an allocated unsafe page.
    Then snapshot_write_next() return this page to load_image, load_image
    writing content to this address, but this page didn't really allocated
    . So, we got page fault.

    Although the root cause is from BIOS, I think aggressive check and
    significant message in kernel will better then a page fault for
    issue tracking, especially when serial console unavailable.

    This patch adds code in mark_unsafe_pages() for check does free pages in
    nosave region. If so, then it print message and return fault to stop whole
    S4 resume process:

    [ 8.166004] PM: Image loading progress: 0%
    [ 8.658717] PM: 0x6796c000 in e820 nosave region: [mem 0x6796c000-0x6796cfff]
    [ 8.918737] PM: Read 2511940 kbytes in 1.04 seconds (2415.32 MB/s)
    [ 8.926633] PM: Error -14 resuming
    [ 8.933534] PM: Failed to load hibernation image, recovering.

    Reviewed-by: Takashi Iwai
    Acked-by: Pavel Machek
    Signed-off-by: Lee, Chun-Yi
    [rjw: Subject]
    Signed-off-by: Rafael J. Wysocki

    Signed-off-by: Jiri Slaby

    Lee, Chun-Yi
     

19 Aug, 2014

2 commits

  • commit 504d58745c9ca28d33572e2d8a9990b43e06075d upstream.

    clockevents_increase_min_delta() calls printk() from under
    hrtimer_bases.lock. That causes lock inversion on scheduler locks because
    printk() can call into the scheduler. Lockdep puts it as:

    ======================================================
    [ INFO: possible circular locking dependency detected ]
    3.15.0-rc8-06195-g939f04b #2 Not tainted
    -------------------------------------------------------
    trinity-main/74 is trying to acquire lock:
    (&port_lock_key){-.....}, at: [] serial8250_console_write+0x8c/0x10c

    but task is already holding lock:
    (hrtimer_bases.lock){-.-...}, at: [] hrtimer_try_to_cancel+0x13/0x66

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #5 (hrtimer_bases.lock){-.-...}:
    [] lock_acquire+0x92/0x101
    [] _raw_spin_lock_irqsave+0x2e/0x3e
    [] __hrtimer_start_range_ns+0x1c/0x197
    [] perf_swevent_start_hrtimer.part.41+0x7a/0x85
    [] task_clock_event_start+0x3a/0x3f
    [] task_clock_event_add+0xd/0x14
    [] event_sched_in+0xb6/0x17a
    [] group_sched_in+0x44/0x122
    [] ctx_sched_in.isra.67+0x105/0x11f
    [] perf_event_sched_in.isra.70+0x47/0x4b
    [] __perf_install_in_context+0x8b/0xa3
    [] remote_function+0x12/0x2a
    [] smp_call_function_single+0x2d/0x53
    [] task_function_call+0x30/0x36
    [] perf_install_in_context+0x87/0xbb
    [] SYSC_perf_event_open+0x5c6/0x701
    [] SyS_perf_event_open+0x17/0x19
    [] syscall_call+0x7/0xb

    -> #4 (&ctx->lock){......}:
    [] lock_acquire+0x92/0x101
    [] _raw_spin_lock+0x21/0x30
    [] __perf_event_task_sched_out+0x1dc/0x34f
    [] __schedule+0x4c6/0x4cb
    [] schedule+0xf/0x11
    [] work_resched+0x5/0x30

    -> #3 (&rq->lock){-.-.-.}:
    [] lock_acquire+0x92/0x101
    [] _raw_spin_lock+0x21/0x30
    [] __task_rq_lock+0x33/0x3a
    [] wake_up_new_task+0x25/0xc2
    [] do_fork+0x15c/0x2a0
    [] kernel_thread+0x1a/0x1f
    [] rest_init+0x1a/0x10e
    [] start_kernel+0x303/0x308
    [] i386_start_kernel+0x79/0x7d

    -> #2 (&p->pi_lock){-.-...}:
    [] lock_acquire+0x92/0x101
    [] _raw_spin_lock_irqsave+0x2e/0x3e
    [] try_to_wake_up+0x1d/0xd6
    [] default_wake_function+0xb/0xd
    [] __wake_up_common+0x39/0x59
    [] __wake_up+0x29/0x3b
    [] tty_wakeup+0x49/0x51
    [] uart_write_wakeup+0x17/0x19
    [] serial8250_tx_chars+0xbc/0xfb
    [] serial8250_handle_irq+0x54/0x6a
    [] serial8250_default_handle_irq+0x19/0x1c
    [] serial8250_interrupt+0x38/0x9e
    [] handle_irq_event_percpu+0x5f/0x1e2
    [] handle_irq_event+0x2c/0x43
    [] handle_level_irq+0x57/0x80
    [] handle_irq+0x46/0x5c
    [] do_IRQ+0x32/0x89
    [] common_interrupt+0x2e/0x33
    [] _raw_spin_unlock_irqrestore+0x3f/0x49
    [] uart_start+0x2d/0x32
    [] uart_write+0xc7/0xd6
    [] n_tty_write+0xb8/0x35e
    [] tty_write+0x163/0x1e4
    [] redirected_tty_write+0x6d/0x75
    [] vfs_write+0x75/0xb0
    [] SyS_write+0x44/0x77
    [] syscall_call+0x7/0xb

    -> #1 (&tty->write_wait){-.....}:
    [] lock_acquire+0x92/0x101
    [] _raw_spin_lock_irqsave+0x2e/0x3e
    [] __wake_up+0x15/0x3b
    [] tty_wakeup+0x49/0x51
    [] uart_write_wakeup+0x17/0x19
    [] serial8250_tx_chars+0xbc/0xfb
    [] serial8250_handle_irq+0x54/0x6a
    [] serial8250_default_handle_irq+0x19/0x1c
    [] serial8250_interrupt+0x38/0x9e
    [] handle_irq_event_percpu+0x5f/0x1e2
    [] handle_irq_event+0x2c/0x43
    [] handle_level_irq+0x57/0x80
    [] handle_irq+0x46/0x5c
    [] do_IRQ+0x32/0x89
    [] common_interrupt+0x2e/0x33
    [] _raw_spin_unlock_irqrestore+0x3f/0x49
    [] uart_start+0x2d/0x32
    [] uart_write+0xc7/0xd6
    [] n_tty_write+0xb8/0x35e
    [] tty_write+0x163/0x1e4
    [] redirected_tty_write+0x6d/0x75
    [] vfs_write+0x75/0xb0
    [] SyS_write+0x44/0x77
    [] syscall_call+0x7/0xb

    -> #0 (&port_lock_key){-.....}:
    [] __lock_acquire+0x9ea/0xc6d
    [] lock_acquire+0x92/0x101
    [] _raw_spin_lock_irqsave+0x2e/0x3e
    [] serial8250_console_write+0x8c/0x10c
    [] call_console_drivers.constprop.31+0x87/0x118
    [] console_unlock+0x1d7/0x398
    [] vprintk_emit+0x3da/0x3e4
    [] printk+0x17/0x19
    [] clockevents_program_min_delta+0x104/0x116
    [] clockevents_program_event+0xe7/0xf3
    [] tick_program_event+0x1e/0x23
    [] hrtimer_force_reprogram+0x88/0x8f
    [] __remove_hrtimer+0x5b/0x79
    [] hrtimer_try_to_cancel+0x49/0x66
    [] hrtimer_cancel+0xd/0x18
    [] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
    [] task_clock_event_stop+0x20/0x64
    [] task_clock_event_del+0xd/0xf
    [] event_sched_out+0xab/0x11e
    [] group_sched_out+0x1d/0x66
    [] ctx_sched_out+0xaf/0xbf
    [] __perf_event_task_sched_out+0x1ed/0x34f
    [] __schedule+0x4c6/0x4cb
    [] schedule+0xf/0x11
    [] work_resched+0x5/0x30

    other info that might help us debug this:

    Chain exists of:
    &port_lock_key --> &ctx->lock --> hrtimer_bases.lock

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(hrtimer_bases.lock);
    lock(&ctx->lock);
    lock(hrtimer_bases.lock);
    lock(&port_lock_key);

    *** DEADLOCK ***

    4 locks held by trinity-main/74:
    #0: (&rq->lock){-.-.-.}, at: [] __schedule+0xed/0x4cb
    #1: (&ctx->lock){......}, at: [] __perf_event_task_sched_out+0x1dc/0x34f
    #2: (hrtimer_bases.lock){-.-...}, at: [] hrtimer_try_to_cancel+0x13/0x66
    #3: (console_lock){+.+...}, at: [] vprintk_emit+0x3c7/0x3e4

    stack backtrace:
    CPU: 0 PID: 74 Comm: trinity-main Not tainted 3.15.0-rc8-06195-g939f04b #2
    00000000 81c3a310 8b995c14 81426f69 8b995c44 81425a99 8161f671 8161f570
    8161f538 8161f559 8161f538 8b995c78 8b142bb0 00000004 8b142fdc 8b142bb0
    8b995ca8 8104a62d 8b142fac 000016f2 81c3a310 00000001 00000001 00000003
    Call Trace:
    [] dump_stack+0x16/0x18
    [] print_circular_bug+0x18f/0x19c
    [] __lock_acquire+0x9ea/0xc6d
    [] lock_acquire+0x92/0x101
    [] ? serial8250_console_write+0x8c/0x10c
    [] ? wait_for_xmitr+0x76/0x76
    [] _raw_spin_lock_irqsave+0x2e/0x3e
    [] ? serial8250_console_write+0x8c/0x10c
    [] serial8250_console_write+0x8c/0x10c
    [] ? lock_release+0x191/0x223
    [] ? wait_for_xmitr+0x76/0x76
    [] call_console_drivers.constprop.31+0x87/0x118
    [] console_unlock+0x1d7/0x398
    [] vprintk_emit+0x3da/0x3e4
    [] printk+0x17/0x19
    [] clockevents_program_min_delta+0x104/0x116
    [] tick_program_event+0x1e/0x23
    [] hrtimer_force_reprogram+0x88/0x8f
    [] __remove_hrtimer+0x5b/0x79
    [] hrtimer_try_to_cancel+0x49/0x66
    [] hrtimer_cancel+0xd/0x18
    [] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
    [] task_clock_event_stop+0x20/0x64
    [] task_clock_event_del+0xd/0xf
    [] event_sched_out+0xab/0x11e
    [] group_sched_out+0x1d/0x66
    [] ctx_sched_out+0xaf/0xbf
    [] __perf_event_task_sched_out+0x1ed/0x34f
    [] ? __dequeue_entity+0x23/0x27
    [] ? pick_next_task_fair+0xb1/0x120
    [] __schedule+0x4c6/0x4cb
    [] ? trace_hardirqs_off_caller+0xd7/0x108
    [] ? trace_hardirqs_off+0xb/0xd
    [] ? rcu_irq_exit+0x64/0x77

    Fix the problem by using printk_deferred() which does not call into the
    scheduler.

    Reported-by: Fengguang Wu
    Signed-off-by: Jan Kara
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Jiri Slaby

    Jan Kara
     
  • commit aac74dc495456412c4130a1167ce4beb6c1f0b38 upstream.

    After learning we'll need some sort of deferred printk functionality in
    the timekeeping core, Peter suggested we rename the printk_sched function
    so it can be reused by needed subsystems.

    This only changes the function name. No logic changes.

    Signed-off-by: John Stultz
    Reviewed-by: Steven Rostedt
    Cc: Jan Kara
    Cc: Peter Zijlstra
    Cc: Jiri Bohac
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    John Stultz
     

31 Jul, 2014

1 commit

  • commit 58d4e21e50ff3cc57910a8abc20d7e14375d2f61 upstream.

    The "uptime" trace clock added in:

    commit 8aacf017b065a805d27467843490c976835eb4a5
    tracing: Add "uptime" trace clock that uses jiffies

    has wraparound problems when the system has been up more
    than 1 hour 11 minutes and 34 seconds. It converts jiffies
    to nanoseconds using:
    (u64)jiffies_to_usecs(jiffy) * 1000ULL
    but since jiffies_to_usecs() only returns a 32-bit value, it
    truncates at 2^32 microseconds. An additional problem on 32-bit
    systems is that the argument is "unsigned long", so fixing the
    return value only helps until 2^32 jiffies (49.7 days on a HZ=1000
    system).

    Avoid these problems by using jiffies_64 as our basis, and
    not converting to nanoseconds (we do convert to clock_t because
    user facing API must not be dependent on internal kernel
    HZ values).

    Link: http://lkml.kernel.org/p/99d63c5bfe9b320a3b428d773825a37095bf6a51.1405708254.git.tony.luck@intel.com

    Fixes: 8aacf017b065 "tracing: Add "uptime" trace clock that uses jiffies"
    Signed-off-by: Tony Luck
    Signed-off-by: Steven Rostedt
    Signed-off-by: Jiri Slaby

    Tony Luck
     

29 Jul, 2014

8 commits

  • commit b0ab99e7736af88b8ac1b7ae50ea287fffa2badc upstream.

    proc_sched_show_task() does:

    if (nr_switches)
    do_div(avg_atom, nr_switches);

    nr_switches is unsigned long and do_div truncates it to 32 bits, which
    means it can test non-zero on e.g. x86-64 and be truncated to zero for
    division.

    Fix the problem by using div64_ul() instead.

    As a side effect calculations of avg_atom for big nr_switches are now correct.

    Signed-off-by: Mateusz Guzik
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1402750809-31991-1-git-send-email-mguzik@redhat.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Jiri Slaby

    Mateusz Guzik
     
  • commit 4badad352a6bb202ec68afa7a574c0bb961e5ebc upstream.

    The optimistic spin code assumes regular stores and cmpxchg() play nice;
    this is found to not be true for at least: parisc, sparc32, tile32,
    metag-lock1, arc-!llsc and hexagon.

    There is further wreckage, but this in particular seemed easy to
    trigger, so blacklist this.

    Opt in for known good archs.

    Signed-off-by: Peter Zijlstra
    Reported-by: Mikulas Patocka
    Cc: David Miller
    Cc: Chris Metcalf
    Cc: James Bottomley
    Cc: Vineet Gupta
    Cc: Jason Low
    Cc: Waiman Long
    Cc: "James E.J. Bottomley"
    Cc: Paul McKenney
    Cc: John David Anglin
    Cc: James Hogan
    Cc: Linus Torvalds
    Cc: Davidlohr Bueso
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Russell King
    Cc: Will Deacon
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: sparclinux@vger.kernel.org
    Link: http://lkml.kernel.org/r/20140606175316.GV13930@laptop.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar
    Signed-off-by: Jiri Slaby

    Peter Zijlstra
     
  • commit 4320f6b1d9db4ca912c5eb6ecb328b2e090e1586 upstream.

    The commit [247bc037: PM / Sleep: Mitigate race between the freezer
    and request_firmware()] introduced the finer state control, but it
    also leads to a new bug; for example, a bug report regarding the
    firmware loading of intel BT device at suspend/resume:
    https://bugzilla.novell.com/show_bug.cgi?id=873790

    The root cause seems to be a small window between the process resume
    and the clear of usermodehelper lock. The request_firmware() function
    checks the UMH lock and gives up when it's in UMH_DISABLE state. This
    is for avoiding the invalid f/w loading during suspend/resume phase.
    The problem is, however, that usermodehelper_enable() is called at the
    end of thaw_processes(). Thus, a thawed process in between can kick
    off the f/w loader code path (in this case, via btusb_setup_intel())
    even before the call of usermodehelper_enable(). Then
    usermodehelper_read_trylock() returns an error and request_firmware()
    spews WARN_ON() in the end.

    This oneliner patch fixes the issue just by setting to UMH_FREEZING
    state again before restarting tasks, so that the call of
    request_firmware() will be blocked until the end of this function
    instead of returning an error.

    Fixes: 247bc0374254 (PM / Sleep: Mitigate race between the freezer and request_firmware())
    Link: https://bugzilla.novell.com/show_bug.cgi?id=873790
    Signed-off-by: Takashi Iwai
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Jiri Slaby

    Takashi Iwai
     
  • commit 16927776ae757d0d132bdbfabbfe2c498342bd59 upstream.

    Sharvil noticed with the posix timer_settime interface, using the
    CLOCK_REALTIME_ALARM or CLOCK_BOOTTIME_ALARM clockid, if the users
    tried to specify a relative time timer, it would incorrectly be
    treated as absolute regardless of the state of the flags argument.

    This patch corrects this, properly checking the absolute/relative flag,
    as well as adds further error checking that no invalid flag bits are set.

    Reported-by: Sharvil Nanavati
    Signed-off-by: John Stultz
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Prarit Bhargava
    Cc: Sharvil Nanavati
    Link: http://lkml.kernel.org/r/1404767171-6902-1-git-send-email-john.stultz@linaro.org
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Jiri Slaby

    John Stultz
     
  • commit 97b8ee845393701edc06e27ccec2876ff9596019 upstream.

    ring_buffer_poll_wait() should always put the poll_table to its wait_queue
    even there is immediate data available. Otherwise, the following epoll and
    read sequence will eventually hang forever:

    1. Put some data to make the trace_pipe ring_buffer read ready first
    2. epoll_ctl(efd, EPOLL_CTL_ADD, trace_pipe_fd, ee)
    3. epoll_wait()
    4. read(trace_pipe_fd) till EAGAIN
    5. Add some more data to the trace_pipe ring_buffer
    6. epoll_wait() -> this epoll_wait() will block forever

    ~ During the epoll_ctl(efd, EPOLL_CTL_ADD,...) call in step 2,
    ring_buffer_poll_wait() returns immediately without adding poll_table,
    which has poll_table->_qproc pointing to ep_poll_callback(), to its
    wait_queue.
    ~ During the epoll_wait() call in step 3 and step 6,
    ring_buffer_poll_wait() cannot add ep_poll_callback() to its wait_queue
    because the poll_table->_qproc is NULL and it is how epoll works.
    ~ When there is new data available in step 6, ring_buffer does not know
    it has to call ep_poll_callback() because it is not in its wait queue.
    Hence, block forever.

    Other poll implementation seems to call poll_wait() unconditionally as the very
    first thing to do. For example, tcp_poll() in tcp.c.

    Link: http://lkml.kernel.org/p/20140610060637.GA14045@devbig242.prn2.facebook.com

    Fixes: 2a2cc8f7c4d0 "ftrace: allow the event pipe to be polled"
    Reviewed-by: Chris Mason
    Signed-off-by: Martin Lau
    Signed-off-by: Steven Rostedt
    Signed-off-by: Jiri Slaby

    Martin Lau
     
  • commit f0160a5a2912267c02cfe692eac955c360de5fdf upstream.

    The TRACE_ITER_PRINTK check in __trace_puts/__trace_bputs is missing,
    so add it, to be consistent with __trace_printk/__trace_bprintk.
    Those functions are all called by the same function: trace_printk().

    Link: http://lkml.kernel.org/p/51E7A7D6.8090900@huawei.com

    Signed-off-by: zhangwei(Jovi)
    Signed-off-by: Steven Rostedt
    Signed-off-by: Jiri Slaby

    zhangwei(Jovi)
     
  • commit 8abfb8727f4a724d31f9ccfd8013fbd16d539445 upstream.

    Currently trace option stacktrace is not applicable for
    trace_printk with constant string argument, the reason is
    in __trace_puts/__trace_bputs ftrace_trace_stack is missing.

    In contrast, when using trace_printk with non constant string
    argument(will call into __trace_printk/__trace_bprintk), then
    trace option stacktrace is workable, this inconstant result
    will confuses users a lot.

    Link: http://lkml.kernel.org/p/51E7A7C9.9040401@huawei.com

    Signed-off-by: zhangwei(Jovi)
    Signed-off-by: Steven Rostedt
    Signed-off-by: Jiri Slaby

    zhangwei(Jovi)
     
  • commit 5f8bf2d263a20b986225ae1ed7d6759dc4b93af9 upstream.

    Running my ftrace tests on PowerPC, it failed the test that checks
    if function_graph tracer is affected by the stack tracer. It was.
    Looking into this, I found that the update_function_graph_func()
    must be called even if the trampoline function is not changed.
    This is because archs like PowerPC do not support ftrace_ops being
    passed by assembly and instead uses a helper function (what the
    trampoline function points to). Since this function is not changed
    even when multiple ftrace_ops are added to the code, the test that
    falls out before calling update_function_graph_func() will miss that
    the update must still be done.

    Call update_function_graph_function() for all calls to
    update_ftrace_function()

    Signed-off-by: Steven Rostedt
    Signed-off-by: Jiri Slaby

    Steven Rostedt (Red Hat)
     

18 Jul, 2014

6 commits

  • commit 8b8b36834d0fff67fc8668093f4312dd04dcf21d upstream.

    The per_cpu buffers are created one per possible CPU. But these do
    not mean that those CPUs are online, nor do they even exist.

    With the addition of the ring buffer polling, it assumes that the
    caller polls on an existing buffer. But this is not the case if
    the user reads trace_pipe from a CPU that does not exist, and this
    causes the kernel to crash.

    Simple fix is to check the cpu against buffer bitmask against to see
    if the buffer was allocated or not and return -ENODEV if it is
    not.

    More updates were done to pass the -ENODEV back up to userspace.

    Link: http://lkml.kernel.org/r/5393DB61.6060707@oracle.com

    Reported-by: Sasha Levin
    Signed-off-by: Steven Rostedt
    Signed-off-by: Jiri Slaby

    Steven Rostedt (Red Hat)
     
  • commit 5a6024f1604eef119cf3a6fa413fe0261a81a8f3 upstream.

    When hot-adding and onlining CPU, kernel panic occurs, showing following
    call trace.

    BUG: unable to handle kernel paging request at 0000000000001d08
    IP: [] __alloc_pages_nodemask+0x9d/0xb10
    PGD 0
    Oops: 0000 [#1] SMP
    ...
    Call Trace:
    [] ? cpumask_next_and+0x35/0x50
    [] ? find_busiest_group+0x113/0x8f0
    [] ? deactivate_slab+0x349/0x3c0
    [] new_slab+0x91/0x300
    [] __slab_alloc+0x2bb/0x482
    [] ? copy_process.part.25+0xfc/0x14c0
    [] ? load_balance+0x218/0x890
    [] ? sched_clock+0x9/0x10
    [] ? trace_clock_local+0x9/0x10
    [] kmem_cache_alloc_node+0x8c/0x200
    [] copy_process.part.25+0xfc/0x14c0
    [] ? trace_buffer_unlock_commit+0x4d/0x60
    [] ? kthread_create_on_node+0x140/0x140
    [] do_fork+0xbc/0x360
    [] kernel_thread+0x26/0x30
    [] kthreadd+0x2c2/0x300
    [] ? kthread_create_on_cpu+0x60/0x60
    [] ret_from_fork+0x7c/0xb0
    [] ? kthread_create_on_cpu+0x60/0x60

    In my investigation, I found the root cause is wq_numa_possible_cpumask.
    All entries of wq_numa_possible_cpumask is allocated by
    alloc_cpumask_var_node(). And these entries are used without initializing.
    So these entries have wrong value.

    When hot-adding and onlining CPU, wq_update_unbound_numa() is called.
    wq_update_unbound_numa() calls alloc_unbound_pwq(). And alloc_unbound_pwq()
    calls get_unbound_pool(). In get_unbound_pool(), worker_pool->node is set
    as follow:

    3592 /* if cpumask is contained inside a NUMA node, we belong to that node */
    3593 if (wq_numa_enabled) {
    3594 for_each_node(node) {
    3595 if (cpumask_subset(pool->attrs->cpumask,
    3596 wq_numa_possible_cpumask[node])) {
    3597 pool->node = node;
    3598 break;
    3599 }
    3600 }
    3601 }

    But wq_numa_possible_cpumask[node] does not have correct cpumask. So, wrong
    node is selected. As a result, kernel panic occurs.

    By this patch, all entries of wq_numa_possible_cpumask are allocated by
    zalloc_cpumask_var_node to initialize them. And the panic disappeared.

    Signed-off-by: Yasuaki Ishimatsu
    Reviewed-by: Lai Jiangshan
    Signed-off-by: Tejun Heo
    Fixes: bce903809ab3 ("workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]")
    Signed-off-by: Jiri Slaby

    Yasuaki Ishimatsu
     
  • commit 391acf970d21219a2a5446282d3b20eace0c0d7a upstream.

    When runing with the kernel(3.15-rc7+), the follow bug occurs:
    [ 9969.258987] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
    [ 9969.359906] in_atomic(): 1, irqs_disabled(): 0, pid: 160655, name: python
    [ 9969.441175] INFO: lockdep is turned off.
    [ 9969.488184] CPU: 26 PID: 160655 Comm: python Tainted: G A 3.15.0-rc7+ #85
    [ 9969.581032] Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012
    [ 9969.706052] ffffffff81a20e60 ffff8803e941fbd0 ffffffff8162f523 ffff8803e941fd18
    [ 9969.795323] ffff8803e941fbe0 ffffffff8109995a ffff8803e941fc58 ffffffff81633e6c
    [ 9969.884710] ffffffff811ba5dc ffff880405c6b480 ffff88041fdd90a0 0000000000002000
    [ 9969.974071] Call Trace:
    [ 9970.003403] [] dump_stack+0x4d/0x66
    [ 9970.065074] [] __might_sleep+0xfa/0x130
    [ 9970.130743] [] mutex_lock_nested+0x3c/0x4f0
    [ 9970.200638] [] ? kmem_cache_alloc+0x1bc/0x210
    [ 9970.272610] [] cpuset_mems_allowed+0x27/0x140
    [ 9970.344584] [] ? __mpol_dup+0x63/0x150
    [ 9970.409282] [] __mpol_dup+0xe5/0x150
    [ 9970.471897] [] ? __mpol_dup+0x63/0x150
    [ 9970.536585] [] ? copy_process.part.23+0x606/0x1d40
    [ 9970.613763] [] ? trace_hardirqs_on+0xd/0x10
    [ 9970.683660] [] ? monotonic_to_bootbased+0x2f/0x50
    [ 9970.759795] [] copy_process.part.23+0x670/0x1d40
    [ 9970.834885] [] do_fork+0xd8/0x380
    [ 9970.894375] [] ? __audit_syscall_entry+0x9c/0xf0
    [ 9970.969470] [] SyS_clone+0x16/0x20
    [ 9971.030011] [] stub_clone+0x69/0x90
    [ 9971.091573] [] ? system_call_fastpath+0x16/0x1b

    The cause is that cpuset_mems_allowed() try to take
    mutex_lock(&callback_mutex) under the rcu_read_lock(which was hold in
    __mpol_dup()). And in cpuset_mems_allowed(), the access to cpuset is
    under rcu_read_lock, so in __mpol_dup, we can reduce the rcu_read_lock
    protection region to protect the access to cpuset only in
    current_cpuset_is_being_rebound(). So that we can avoid this bug.

    This patch is a temporary solution that just addresses the bug
    mentioned above, can not fix the long-standing issue about cpuset.mems
    rebinding on fork():

    "When the forker's task_struct is duplicated (which includes
    ->mems_allowed) and it races with an update to cpuset_being_rebound
    in update_tasks_nodemask() then the task's mems_allowed doesn't get
    updated. And the child task's mems_allowed can be wrong if the
    cpuset's nodemask changes before the child has been added to the
    cgroup's tasklist."

    Signed-off-by: Gu Zheng
    Acked-by: Li Zefan
    Signed-off-by: Tejun Heo
    Signed-off-by: Jiri Slaby

    Gu Zheng
     
  • commit bddbceb688c6d0decaabc7884fede319d02f96c8 upstream.

    Uevents are suppressed during attributes registration, but never
    restored, so kobject_uevent() does nothing.

    Signed-off-by: Maxime Bizon
    Signed-off-by: Tejun Heo
    Fixes: 226223ab3c4118ddd10688cc2c131135848371ab
    Signed-off-by: Jiri Slaby

    Maxime Bizon
     
  • commit 099ed151675cd1d2dbeae1dac697975f6a68716d upstream.

    Disabling reading and writing to the trace file should not be able to
    disable all function tracing callbacks. There's other users today
    (like kprobes and perf). Reading a trace file should not stop those
    from happening.

    Reviewed-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt
    Signed-off-by: Jiri Slaby

    Steven Rostedt (Red Hat)
     
  • commit 7cd2b0a34ab8e4db971920eef8982f985441adfb upstream.

    Oleg reports a division by zero error on zero-length write() to the
    percpu_pagelist_fraction sysctl:

    divide error: 0000 [#1] SMP DEBUG_PAGEALLOC
    CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19
    Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000
    RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120
    RSP: 0018:ffff8800d87a3e78 EFLAGS: 00010246
    RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010
    RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50
    R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060
    R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800
    FS: 00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0
    Call Trace:
    proc_sys_call_handler+0xb3/0xc0
    proc_sys_write+0x14/0x20
    vfs_write+0xba/0x1e0
    SyS_write+0x46/0xb0
    tracesys+0xe1/0xe6

    However, if the percpu_pagelist_fraction sysctl is set by the user, it
    is also impossible to restore it to the kernel default since the user
    cannot write 0 to the sysctl.

    This patch allows the user to write 0 to restore the default behavior.
    It still requires a fraction equal to or larger than 8, however, as
    stated by the documentation for sanity. If a value in the range [1, 7]
    is written, the sysctl will return EINVAL.

    This successfully solves the divide by zero issue at the same time.

    Signed-off-by: David Rientjes
    Reported-by: Oleg Drokin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    David Rientjes
     

17 Jul, 2014

2 commits

  • commit 4af4206be2bd1933cae20c2b6fb2058dbc887f7c upstream.

    syscall_regfunc() and syscall_unregfunc() should set/clear
    TIF_SYSCALL_TRACEPOINT system-wide, but do_each_thread() can race
    with copy_process() and miss the new child which was not added to
    the process/thread lists yet.

    Change copy_process() to update the child's TIF_SYSCALL_TRACEPOINT
    under tasklist.

    Link: http://lkml.kernel.org/p/20140413185854.GB20668@redhat.com

    Fixes: a871bd33a6c0 "tracing: Add syscall tracepoints"
    Acked-by: Frederic Weisbecker
    Acked-by: Paul E. McKenney
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Steven Rostedt
    Signed-off-by: Jiri Slaby

    Oleg Nesterov
     
  • commit 379cfdac37923653c9d4242d10052378b7563005 upstream.

    In order to prevent the saved cmdline cache from being filled when
    tracing is not active, the comms are only recorded after a trace event
    is recorded.

    The problem is, a comm can fail to be recorded if the trace_cmdline_lock
    is held. That lock is taken via a trylock to allow it to happen from
    any context (including NMI). If the lock fails to be taken, the comm
    is skipped. No big deal, as we will try again later.

    But! Because of the code that was added to only record after an event,
    we may not try again later as the recording is made as a oneshot per
    event per CPU.

    Only disable the recording of the comm if the comm is actually recorded.

    Fixes: 7ffbd48d5cab "tracing: Cache comms only after an event occurred"
    Signed-off-by: Steven Rostedt
    Signed-off-by: Jiri Slaby

    Steven Rostedt (Red Hat)
     

16 Jul, 2014

4 commits

  • commit 27e35715df54cbc4f2d044f681802ae30479e7fb upstream.

    When the rtmutex fast path is enabled the slow unlock function can
    create the following situation:

    spin_lock(foo->m->wait_lock);
    foo->m->owner = NULL;
    rt_mutex_lock(foo->m); refcnt);
    rt_mutex_unlock(foo->m); m->wait_lock); owner */
    clear_rt_mutex_waiters(m);
    owner = rt_mutex_owner(m);
    spin_unlock(m->wait_lock);
    if (cmpxchg(m->owner, owner, 0) == owner)
    return;
    spin_lock(m->wait_lock);
    }

    So in case of a new waiter incoming while the owner tries the slow
    path unlock we have two situations:

    unlock(wait_lock);
    lock(wait_lock);
    cmpxchg(p, owner, 0) == owner
    mark_rt_mutex_waiters(lock);
    acquire(lock);

    Or:

    unlock(wait_lock);
    lock(wait_lock);
    mark_rt_mutex_waiters(lock);
    cmpxchg(p, owner, 0) != owner
    enqueue_waiter();
    unlock(wait_lock);
    lock(wait_lock);
    wakeup_next waiter();
    unlock(wait_lock);
    lock(wait_lock);
    acquire(lock);

    If the fast path is disabled, then the simple

    m->owner = NULL;
    unlock(m->wait_lock);

    is sufficient as all access to m->owner is serialized via
    m->wait_lock;

    Also document and clarify the wakeup_next_waiter function as suggested
    by Oleg Nesterov.

    Reported-by: Steven Rostedt
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Steven Rostedt
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20140611183852.937945560@linutronix.de
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Mike Galbraith
    Signed-off-by: Jiri Slaby

    Thomas Gleixner
     
  • commit 3d5c9340d1949733eb37616abd15db36aef9a57c upstream.

    Even in the case when deadlock detection is not requested by the
    caller, we can detect deadlocks. Right now the code stops the lock
    chain walk and keeps the waiter enqueued, even on itself. Silly not to
    yell when such a scenario is detected and to keep the waiter enqueued.

    Return -EDEADLK unconditionally and handle it at the call sites.

    The futex calls return -EDEADLK. The non futex ones dequeue the
    waiter, throw a warning and put the task into a schedule loop.

    Tagged for stable as it makes the code more robust.

    Signed-off-by: Thomas Gleixner
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Brad Mouring
    Link: http://lkml.kernel.org/r/20140605152801.836501969@linutronix.de
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Mike Galbraith
    Signed-off-by: Jiri Slaby

    Thomas Gleixner
     
  • commit 82084984383babe728e6e3c9a8e5c46278091315 upstream.

    When we walk the lock chain, we drop all locks after each step. So the
    lock chain can change under us before we reacquire the locks. That's
    harmless in principle as we just follow the wrong lock path. But it
    can lead to a false positive in the dead lock detection logic:

    T0 holds L0
    T0 blocks on L1 held by T1
    T1 blocks on L2 held by T2
    T2 blocks on L3 held by T3
    T4 blocks on L4 held by T4

    Now we walk the chain

    lock T1 -> lock L2 -> adjust L2 -> unlock T1 ->
    lock T2 -> adjust T2 -> drop locks

    T2 times out and blocks on L0

    Now we continue:

    lock T2 -> lock L0 -> deadlock detected, but it's not a deadlock at all.

    Brad tried to work around that in the deadlock detection logic itself,
    but the more I looked at it the less I liked it, because it's crystal
    ball magic after the fact.

    We actually can detect a chain change very simple:

    lock T1 -> lock L2 -> adjust L2 -> unlock T1 -> lock T2 -> adjust T2 ->

    next_lock = T2->pi_blocked_on->lock;

    drop locks

    T2 times out and blocks on L0

    Now we continue:

    lock T2 ->

    if (next_lock != T2->pi_blocked_on->lock)
    return;

    So if we detect that T2 is now blocked on a different lock we stop the
    chain walk. That's also correct in the following scenario:

    lock T1 -> lock L2 -> adjust L2 -> unlock T1 -> lock T2 -> adjust T2 ->

    next_lock = T2->pi_blocked_on->lock;

    drop locks

    T3 times out and drops L3
    T2 acquires L3 and blocks on L4 now

    Now we continue:

    lock T2 ->

    if (next_lock != T2->pi_blocked_on->lock)
    return;

    We don't have to follow up the chain at that point, because T2
    propagated our priority up to T4 already.

    [ Folded a cleanup patch from peterz ]

    Signed-off-by: Thomas Gleixner
    Reported-by: Brad Mouring
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20140605152801.930031935@linutronix.de
    Signed-off-by: Mike Galbraith
    Signed-off-by: Jiri Slaby

    Thomas Gleixner
     
  • commit 397335f004f41e5fcf7a795e94eb3ab83411a17c upstream.

    The current deadlock detection logic does not work reliably due to the
    following early exit path:

    /*
    * Drop out, when the task has no waiters. Note,
    * top_waiter can be NULL, when we are in the deboosting
    * mode!
    */
    if (top_waiter && (!task_has_pi_waiters(task) ||
    top_waiter != task_top_pi_waiter(task)))
    goto out_unlock_pi;

    So this not only exits when the task has no waiters, it also exits
    unconditionally when the current waiter is not the top priority waiter
    of the task.

    So in a nested locking scenario, it might abort the lock chain walk
    and therefor miss a potential deadlock.

    Simple fix: Continue the chain walk, when deadlock detection is
    enabled.

    We also avoid the whole enqueue, if we detect the deadlock right away
    (A-A). It's an optimization, but also prevents that another waiter who
    comes in after the detection and before the task has undone the damage
    observes the situation and detects the deadlock and returns
    -EDEADLOCK, which is wrong as the other task is not in a deadlock
    situation.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Reviewed-by: Steven Rostedt
    Cc: Lai Jiangshan
    Link: http://lkml.kernel.org/r/20140522031949.725272460@linutronix.de
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Mike Galbraith
    Signed-off-by: Jiri Slaby

    Thomas Gleixner
     

02 Jul, 2014

2 commits

  • commit 1e77d0a1ed7417d2a5a52a7b8d32aea1833faa6c upstream.

    Till reported that the spurious interrupt detection of threaded
    interrupts is broken in two ways:

    - note_interrupt() is called for each action thread of a shared
    interrupt line. That's wrong as we are only interested whether none
    of the device drivers felt responsible for the interrupt, but by
    calling multiple times for a single interrupt line we account
    IRQ_NONE even if one of the drivers felt responsible.

    - note_interrupt() when called from the thread handler is not
    serialized. That leaves the members of irq_desc which are used for
    the spurious detection unprotected.

    To solve this we need to defer the spurious detection of a threaded
    interrupt to the next hardware interrupt context where we have
    implicit serialization.

    If note_interrupt is called with action_ret == IRQ_WAKE_THREAD, we
    check whether the previous interrupt requested a deferred check. If
    not, we request a deferred check for the next hardware interrupt and
    return.

    If set, we check whether one of the interrupt threads signaled
    success. Depending on this information we feed the result into the
    spurious detector.

    If one primary handler of a shared interrupt returns IRQ_HANDLED we
    disable the deferred check of irq threads on the same line, as we have
    found at least one device driver who cared.

    Reported-by: Till Straumann
    Signed-off-by: Thomas Gleixner
    Tested-by: Austin Schuh
    Cc: Oliver Hartkopp
    Cc: Wolfgang Grandegger
    Cc: Pavel Pisa
    Cc: Marc Kleine-Budde
    Cc: linux-can@vger.kernel.org
    Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1303071450130.22263@ionos
    Signed-off-by: Jiri Slaby

    Thomas Gleixner
     
  • commit 4e52365f279564cef0ddd41db5237f0471381093 upstream.

    When tracing a process in another pid namespace, it's important for fork
    event messages to contain the child's pid as seen from the tracer's pid
    namespace, not the parent's. Otherwise, the tracer won't be able to
    correlate the fork event with later SIGTRAP signals it receives from the
    child.

    We still risk a race condition if a ptracer from a different pid
    namespace attaches after we compute the pid_t value. However, sending a
    bogus fork event message in this unlikely scenario is still a vast
    improvement over the status quo where we always send bogus fork event
    messages to debuggers in a different pid namespace than the forking
    process.

    Signed-off-by: Matthew Dempsky
    Acked-by: Oleg Nesterov
    Cc: Kees Cook
    Cc: Julien Tinnes
    Cc: Roland McGrath
    Cc: Jan Kratochvil
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Matthew Dempsky
     

27 Jun, 2014

2 commits

  • commit cadefd3d6cc914d95163ba1eda766bfe7ce1e5b7 upstream.

    Mike reported that, while unlikely, its entirely possible for
    scale_rt_power() to see the time go backwards. This yields rather
    'interesting' results.

    So like all other sites that deal with clocks; make this one ignore
    backward clock movement too.

    Reported-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20140227094035.GZ9987@twins.programming.kicks-ass.net
    Cc: Linus Torvalds
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Jiri Slaby

    Peter Zijlstra
     
  • commit 27630532ef5ead28b98cfe28d8f95222ef91c2b7 upstream.

    Since commit d689fe222 (NOHZ: Check for nohz active instead of nohz
    enabled) the tick_nohz_switch_to_nohz() function returns because it
    checks for the tick_nohz_active flag. This can't be set, because the
    function itself sets it.

    Undo the change in tick_nohz_switch_to_nohz().

    Signed-off-by: Viresh Kumar
    Cc: linaro-kernel@lists.linaro.org
    Cc: fweisbec@gmail.com
    Cc: Arvind.Chauhan@arm.com
    Cc: linaro-networking@linaro.org
    Cc: # 3.13+
    Link: http://lkml.kernel.org/r/40939c05f2d65d781b92b20302b02243d0654224.1397537987.git.viresh.kumar@linaro.org
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Jiri Slaby

    Viresh Kumar
     

23 Jun, 2014

2 commits

  • commit 9390675af0835ae1d654d33bfcf16096028550ad upstream.

    This reverts commit 282cf499f03ec1754b6c8c945c9674b02631fb0f.

    With the current implementation, the load average statistics of a sched entity
    change according to other activity on the CPU even if this activity is done
    between the running window of the sched entity and have no influence on the
    running duration of the task.

    When a task wakes up on the same CPU, we currently update last_runnable_update
    with the return of __synchronize_entity_decay without updating the
    runnable_avg_sum and runnable_avg_period accordingly. In fact, we have to sync
    the load_contrib of the se with the rq's blocked_load_contrib before removing
    it from the latter (with __synchronize_entity_decay) but we must keep
    last_runnable_update unchanged for updating runnable_avg_sum/period during the
    next update_entity_load_avg.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra
    Reviewed-by: Ben Segall
    Cc: pjt@google.com
    Cc: alex.shi@linaro.org
    Link: http://lkml.kernel.org/r/1390376734-6800-1-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Jiri Slaby

    Vincent Guittot
     
  • [ Upstream commit 90f62cf30a78721641e08737bda787552428061e ]

    It is possible by passing a netlink socket to a more privileged
    executable and then to fool that executable into writing to the socket
    data that happens to be valid netlink message to do something that
    privileged executable did not intend to do.

    To keep this from happening replace bare capable and ns_capable calls
    with netlink_capable, netlink_net_calls and netlink_ns_capable calls.
    Which act the same as the previous calls except they verify that the
    opener of the socket had the desired permissions as well.

    Reported-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller
    Signed-off-by: Jiri Slaby

    Eric W. Biederman
     

20 Jun, 2014

2 commits

  • commit a3c54931199565930d6d84f4c3456f6440aefd41 upstream.

    Fixes an easy DoS and possible information disclosure.

    This does nothing about the broken state of x32 auditing.

    eparis: If the admin has enabled auditd and has specifically loaded
    audit rules. This bug has been around since before git. Wow...

    Signed-off-by: Andy Lutomirski
    Signed-off-by: Eric Paris
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Andy Lutomirski
     
  • commit 23adbe12ef7d3d4195e80800ab36b37bee28cd03 upstream.

    The kernel has no concept of capabilities with respect to inodes; inodes
    exist independently of namespaces. For example, inode_capable(inode,
    CAP_LINUX_IMMUTABLE) would be nonsense.

    This patch changes inode_capable to check for uid and gid mappings and
    renames it to capable_wrt_inode_uidgid, which should make it more
    obvious what it does.

    Fixes CVE-2014-4014.

    Cc: Theodore Ts'o
    Cc: Serge Hallyn
    Cc: "Eric W. Biederman"
    Cc: Dave Chinner
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jiri Slaby

    Andy Lutomirski