02 Jun, 2016

4 commits

  • commit 20878232c52329f92423d27a60e48b6a6389e0dd upstream.

    Systems show a minimal load average of 0.00, 0.01, 0.05 even when they
    have no load at all.

    Uptime and /proc/loadavg on all systems with kernels released during the
    last five years up until kernel version 4.6-rc5, show a 5- and 15-minute
    minimum loadavg of 0.01 and 0.05 respectively. This should be 0.00 on
    idle systems, but the way the kernel calculates this value prevents it
    from getting lower than the mentioned values.

    Likewise but not as obviously noticeable, a fully loaded system with no
    processes waiting, shows a maximum 1/5/15 loadavg of 1.00, 0.99, 0.95
    (multiplied by number of cores).

    Once the (old) load becomes 93 or higher, it mathematically can never
    get lower than 93, even when the active (load) remains 0 forever.
    This results in the strange 0.00, 0.01, 0.05 uptime values on idle
    systems. Note: 93/2048 = 0.0454..., which rounds up to 0.05.

    It is not correct to add a 0.5 rounding (=1024/2048) here, since the
    result from this function is fed back into the next iteration again,
    so the result of that +0.5 rounding value then gets multiplied by
    (2048-2037), and then rounded again, so there is a virtual "ghost"
    load created, next to the old and active load terms.

    By changing the way the internally kept value is rounded, that internal
    value equivalent now can reach 0.00 on idle, and 1.00 on full load. Upon
    increasing load, the internally kept load value is rounded up, when the
    load is decreasing, the load value is rounded down.

    The modified code was tested on nohz=off and nohz kernels. It was tested
    on vanilla kernel 4.6-rc5 and on centos 7.1 kernel 3.10.0-327. It was
    tested on single, dual, and octal cores system. It was tested on virtual
    hosts and bare hardware. No unwanted effects have been observed, and the
    problems that the patch intended to fix were indeed gone.

    Tested-by: Damien Wyart
    Signed-off-by: Vik Heyndrickx
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Doug Smythies
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 0f004f5a696a ("sched: Cure more NO_HZ load average woes")
    Link: http://lkml.kernel.org/r/e8d32bff-d544-7748-72b5-3c86cc71f09f@veribox.net
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Vik Heyndrickx
     
  • commit 59643d1535eb220668692a5359de22545af579f6 upstream.

    If the size passed to ring_buffer_resize() is greater than MAX_LONG - BUF_PAGE_SIZE
    then the DIV_ROUND_UP() will return zero.

    Here's the details:

    # echo 18014398509481980 > /sys/kernel/debug/tracing/buffer_size_kb

    tracing_entries_write() processes this and converts kb to bytes.

    18014398509481980 << 10 = 18446744073709547520

    and this is passed to ring_buffer_resize() as unsigned long size.

    size = DIV_ROUND_UP(size, BUF_PAGE_SIZE);

    Where DIV_ROUND_UP(a, b) is (a + b - 1)/b

    BUF_PAGE_SIZE is 4080 and here

    18446744073709547520 + 4080 - 1 = 18446744073709551599

    where 18446744073709551599 is still smaller than 2^64

    2^64 - 18446744073709551599 = 17

    But now 18446744073709551599 / 4080 = 4521260802379792

    and size = size * 4080 = 18446744073709551360

    This is checked to make sure its still greater than 2 * 4080,
    which it is.

    Then we convert to the number of buffer pages needed.

    nr_page = DIV_ROUND_UP(size, BUF_PAGE_SIZE)

    but this time size is 18446744073709551360 and

    2^64 - (18446744073709551360 + 4080 - 1) = -3823

    Thus it overflows and the resulting number is less than 4080, which makes

    3823 / 4080 = 0

    an nr_pages is set to this. As we already checked against the minimum that
    nr_pages may be, this causes the logic to fail as well, and we crash the
    kernel.

    There's no reason to have the two DIV_ROUND_UP() (that's just result of
    historical code changes), clean up the code and fix this bug.

    Fixes: 83f40318dab00 ("ring-buffer: Make removal of ring buffer pages atomic")
    Signed-off-by: Steven Rostedt
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Red Hat)
     
  • commit 9b94a8fba501f38368aef6ac1b30e7335252a220 upstream.

    The size variable to change the ring buffer in ftrace is a long. The
    nr_pages used to update the ring buffer based on the size is int. On 64 bit
    machines this can cause an overflow problem.

    For example, the following will cause the ring buffer to crash:

    # cd /sys/kernel/debug/tracing
    # echo 10 > buffer_size_kb
    # echo 8556384240 > buffer_size_kb

    Then you get the warning of:

    WARNING: CPU: 1 PID: 318 at kernel/trace/ring_buffer.c:1527 rb_update_pages+0x22f/0x260

    Which is:

    RB_WARN_ON(cpu_buffer, nr_removed);

    Note each ring buffer page holds 4080 bytes.

    This is because:

    1) 10 causes the ring buffer to have 3 pages.
    (10kb requires 3 * 4080 pages to hold)

    2) (2^31 / 2^10 + 1) * 4080 = 8556384240
    The value written into buffer_size_kb is shifted by 10 and then passed
    to ring_buffer_resize(). 8556384240 * 2^10 = 8761737461760

    3) The size passed to ring_buffer_resize() is then divided by BUF_PAGE_SIZE
    which is 4080. 8761737461760 / 4080 = 2147484672

    4) nr_pages is subtracted from the current nr_pages (3) and we get:
    2147484669. This value is saved in a signed integer nr_pages_to_update

    5) 2147484669 is greater than 2^31 but smaller than 2^32, a signed int
    turns into the value of -2147482627

    6) As the value is a negative number, in update_pages_handler() it is
    negated and passed to rb_remove_pages() and 2147482627 pages will
    be removed, which is much larger than 3 and it causes the warning
    because not all the pages asked to be removed were removed.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=118001

    Fixes: 7a8e76a3829f1 ("tracing: unified trace buffer")
    Reported-by: Hao Qin
    Signed-off-by: Steven Rostedt
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Red Hat)
     
  • commit 79c9ce57eb2d5f1497546a3946b4ae21b6fdc438 upstream.

    Jann reported that the ptrace_may_access() check in
    find_lively_task_by_vpid() is racy against exec().

    Specifically:

    perf_event_open() execve()

    ptrace_may_access()
    commit_creds()
    ... if (get_dumpable() != SUID_DUMP_USER)
    perf_event_exit_task();
    perf_install_in_context()

    would result in installing a counter across the creds boundary.

    Fix this by wrapping lots of perf_event_open() in cred_guard_mutex.
    This should be fine as perf_event_exit_task() is already called with
    cred_guard_mutex held, so all perf locks already nest inside it.

    Reported-by: Jann Horn
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Signed-off-by: Ingo Molnar
    Signed-off-by: He Kuang
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     

19 May, 2016

6 commits

  • commit f7c17d26f43d5cc1b7a6b896cd2fa24a079739b9 upstream.

    ------------[ cut here ]------------
    WARNING: CPU: 0 PID: 16 at kernel/workqueue.c:4559 rebind_workers+0x1c0/0x1d0
    Modules linked in:
    CPU: 0 PID: 16 Comm: cpuhp/0 Not tainted 4.6.0-rc4+ #31
    Hardware name: IBM IBM System x3550 M4 Server -[7914IUW]-/00Y8603, BIOS -[D7E128FUS-1.40]- 07/23/2013
    0000000000000000 ffff881037babb58 ffffffff8139d885 0000000000000010
    0000000000000000 0000000000000000 0000000000000000 ffff881037babba8
    ffffffff8108505d ffff881037ba0000 000011cf3e7d6e60 0000000000000046
    Call Trace:
    dump_stack+0x89/0xd4
    __warn+0xfd/0x120
    warn_slowpath_null+0x1d/0x20
    rebind_workers+0x1c0/0x1d0
    workqueue_cpu_up_callback+0xf5/0x1d0
    notifier_call_chain+0x64/0x90
    ? trace_hardirqs_on_caller+0xf2/0x220
    ? notify_prepare+0x80/0x80
    __raw_notifier_call_chain+0xe/0x10
    __cpu_notify+0x35/0x50
    notify_down_prepare+0x5e/0x80
    ? notify_prepare+0x80/0x80
    cpuhp_invoke_callback+0x73/0x330
    ? __schedule+0x33e/0x8a0
    cpuhp_down_callbacks+0x51/0xc0
    cpuhp_thread_fun+0xc1/0xf0
    smpboot_thread_fn+0x159/0x2a0
    ? smpboot_create_threads+0x80/0x80
    kthread+0xef/0x110
    ? wait_for_completion+0xf0/0x120
    ? schedule_tail+0x35/0xf0
    ret_from_fork+0x22/0x50
    ? __init_kthread_worker+0x70/0x70
    ---[ end trace eb12ae47d2382d8f ]---
    notify_down_prepare: attempt to take down CPU 0 failed

    This bug can be reproduced by below config w/ nohz_full= all cpus:

    CONFIG_BOOTPARAM_HOTPLUG_CPU0=y
    CONFIG_DEBUG_HOTPLUG_CPU0=y
    CONFIG_NO_HZ_FULL=y

    As Thomas pointed out:

    | If a down prepare callback fails, then DOWN_FAILED is invoked for all
    | callbacks which have successfully executed DOWN_PREPARE.
    |
    | But, workqueue has actually two notifiers. One which handles
    | UP/DOWN_FAILED/ONLINE and one which handles DOWN_PREPARE.
    |
    | Now look at the priorities of those callbacks:
    |
    | CPU_PRI_WORKQUEUE_UP = 5
    | CPU_PRI_WORKQUEUE_DOWN = -5
    |
    | So the call order on DOWN_PREPARE is:
    |
    | CB 1
    | CB ...
    | CB workqueue_up() -> Ignores DOWN_PREPARE
    | CB ...
    | CB X ---> Fails
    |
    | So we call up to CB X with DOWN_FAILED
    |
    | CB 1
    | CB ...
    | CB workqueue_up() -> Handles DOWN_FAILED
    | CB ...
    | CB X-1
    |
    | So the problem is that the workqueue stuff handles DOWN_FAILED in the up
    | callback, while it should do it in the down callback. Which is not a good idea
    | either because it wants to be called early on rollback...
    |
    | Brilliant stuff, isn't it? The hotplug rework will solve this problem because
    | the callbacks become symetric, but for the existing mess, we need some
    | workaround in the workqueue code.

    The boot CPU handles housekeeping duty(unbound timers, workqueues,
    timekeeping, ...) on behalf of full dynticks CPUs. It must remain
    online when nohz full is enabled. There is a priority set to every
    notifier_blocks:

    workqueue_cpu_up > tick_nohz_cpu_down > workqueue_cpu_down

    So tick_nohz_cpu_down callback failed when down prepare cpu 0, and
    notifier_blocks behind tick_nohz_cpu_down will not be called any
    more, which leads to workers are actually not unbound. Then hotplug
    state machine will fallback to undo and online cpu 0 again. Workers
    will be rebound unconditionally even if they are not unbound and
    trigger the warning in this progress.

    This patch fix it by catching !DISASSOCIATED to avoid rebind bound
    workers.

    Cc: Tejun Heo
    Cc: Lai Jiangshan
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Frédéric Weisbecker
    Suggested-by: Lai Jiangshan
    Signed-off-by: Wanpeng Li
    Signed-off-by: Greg Kroah-Hartman

    Wanpeng Li
     
  • commit 9f448cd3cbcec8995935e60b27802ae56aac8cc0 upstream.

    When the PMU driver reports a truncated AUX record, it effectively means
    that there is no more usable room in the event's AUX buffer (even though
    there may still be some room, so that perf_aux_output_begin() doesn't take
    action). At this point the consumer still has to be woken up and the event
    has to be disabled, otherwise the event will just keep spinning between
    perf_aux_output_begin() and perf_aux_output_end() until its context gets
    unscheduled.

    Again, for cpu-wide events this means never, so once in this condition,
    they will be forever losing data.

    Fix this by disabling the event and waking up the consumer in case of a
    truncated AUX record.

    Reported-by: Markus Metzger
    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Borislav Petkov
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: vince@deater.net
    Link: http://lkml.kernel.org/r/1462886313-13660-3-git-send-email-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Alexander Shishkin
     
  • [ Upstream commit 6aff67c85c9e5a4bc99e5211c1bac547936626ca ]

    The commit 35578d798400 ("bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter")
    introduced clever way to check bpf_helpermap_type compatibility.
    Later on commit a43eec304259 ("bpf: introduce bpf_perf_event_output() helper") adjusted
    the logic and inadvertently broke it.
    Get rid of the clever bool compare and go back to two-way check
    from map and from helper perspective.

    Fixes: a43eec304259 ("bpf: introduce bpf_perf_event_output() helper")
    Reported-by: Jann Horn
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Alexei Starovoitov
     
  • [ Upstream commit 92117d8443bc5afacc8d5ba82e541946310f106e ]

    On a system with >32Gbyte of phyiscal memory and infinite RLIMIT_MEMLOCK,
    the malicious application may overflow 32-bit bpf program refcnt.
    It's also possible to overflow map refcnt on 1Tb system.
    Impose 32k hard limit which means that the same bpf program or
    map cannot be shared by more than 32k processes.

    Fixes: 1be7f75d1668 ("bpf: enable non-root eBPF programs")
    Reported-by: Jann Horn
    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Alexei Starovoitov
     
  • [ Upstream commit 8358b02bf67d3a5d8a825070e1aa73f25fb2e4c7 ]

    When bpf(BPF_PROG_LOAD, ...) was invoked with a BPF program whose bytecode
    references a non-map file descriptor as a map file descriptor, the error
    handling code called fdput() twice instead of once (in __bpf_map_get() and
    in replace_map_fd_with_map_ptr()). If the file descriptor table of the
    current task is shared, this causes f_count to be decremented too much,
    allowing the struct file to be freed while it is still in use
    (use-after-free). This can be exploited to gain root privileges by an
    unprivileged user.

    This bug was introduced in
    commit 0246e64d9a5f ("bpf: handle pseudo BPF_LD_IMM64 insn"), but is only
    exploitable since
    commit 1be7f75d1668 ("bpf: enable non-root eBPF programs") because
    previously, CAP_SYS_ADMIN was required to reach the vulnerable code.

    (posted publicly according to request by maintainer)

    Signed-off-by: Jann Horn
    Signed-off-by: Linus Torvalds
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     
  • [ Upstream commit d82bccc69041a51f7b7b9b4a36db0772f4cdba21 ]

    verifier must check for reserved size bits in instruction opcode and
    reject BPF_LD | BPF_ABS | BPF_DW and BPF_LD | BPF_IND | BPF_DW instructions,
    otherwise interpreter will WARN_RATELIMIT on them during execution.

    Fixes: ddd872bc3098 ("bpf: verifier: add checks for BPF_ABS | BPF_IND instructions")
    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Alexei Starovoitov
     

11 May, 2016

1 commit

  • commit 854145e0a8e9a05f7366d240e2f99d9c1ca6d6dd upstream.

    Currently register functions for events will be called
    through the 'reg' field of event class directly without
    any check when seting up triggers.

    Triggers for events that don't support register through
    debug fs (events under events/ftrace are for trace-cmd to
    read event format, and most of them don't have a register
    function except events/ftrace/functionx) can't be enabled
    at all, and an oops will be hit when setting up trigger
    for those events, so just not creating them is an easy way
    to avoid the oops.

    Link: http://lkml.kernel.org/r/1462275274-3911-1-git-send-email-chuhu@redhat.com

    Fixes: 85f2b08268c01 ("tracing: Add basic event trigger framework")
    Signed-off-by: Chunyu Hu
    Signed-off-by: Steven Rostedt
    Signed-off-by: Greg Kroah-Hartman

    Chunyu Hu
     

05 May, 2016

7 commits

  • commit 920c720aa5aa3900a7f1689228fdfc2580a91e7e upstream.

    Similar to commit b4b29f94856a ("locking/osq: Fix ordering of node
    initialisation in osq_lock") the use of xchg_acquire() is
    fundamentally broken with MCS like constructs.

    Furthermore, it turns out we rely on the global transitivity of this
    operation because the unlock path observes the pointer with a
    READ_ONCE(), not an smp_load_acquire().

    This is non-critical because the MCS code isn't actually used and
    mostly serves as documentation, a stepping stone to the more complex
    things we've build on top of the idea.

    Reported-by: Andrea Parri
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Fixes: 3552a07a9c4a ("locking/mcs: Use acquire/release semantics")
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     
  • commit 8bb5ef79bc0f4016ecf79e8dce6096a3c63603e4 upstream.

    There are three subsystem callbacks in css shutdown path -
    css_offline(), css_released() and css_free(). Except for
    css_released(), cgroup core didn't guarantee the order of invocation.
    css_offline() or css_free() could be called on a parent css before its
    children. This behavior is unexpected and led to bugs in cpu and
    memory controller.

    The previous patch updated ordering for css_offline() which fixes the
    cpu controller issue. While there currently isn't a known bug caused
    by misordering of css_free() invocations, let's fix it too for
    consistency.

    css_free() ordering can be trivially fixed by moving putting of the
    parent css below css_free() invocation.

    Signed-off-by: Tejun Heo
    Cc: Peter Zijlstra
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • commit 5cf1cacb49aee39c3e02ae87068fc3c6430659b0 upstream.

    Since e93ad19d0564 ("cpuset: make mm migration asynchronous"), cpuset
    kicks off asynchronous NUMA node migration if necessary during task
    migration and flushes it from cpuset_post_attach_flush() which is
    called at the end of __cgroup_procs_write(). This is to avoid
    performing migration with cgroup_threadgroup_rwsem write-locked which
    can lead to deadlock through dependency on kworker creation.

    memcg has a similar issue with charge moving, so let's convert it to
    an official callback rather than the current one-off cpuset specific
    function. This patch adds cgroup_subsys->post_attach callback and
    makes cpuset register cpuset_post_attach_flush() as its ->post_attach.

    The conversion is mostly one-to-one except that the new callback is
    called under cgroup_mutex. This is to guarantee that no other
    migration operations are started before ->post_attach callbacks are
    finished. cgroup_mutex is one of the outermost mutex in the system
    and has never been and shouldn't be a problem. We can add specialized
    synchronization around __cgroup_procs_write() but I don't think
    there's any noticeable benefit.

    Signed-off-by: Tejun Heo
    Cc: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • commit 346c09f80459a3ad97df1816d6d606169a51001a upstream.

    The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
    with the following backtrace:

    [ 601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
    [ 601.347574] Tainted: G O 4.4.5-1-storage+ #6
    [ 601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [ 601.348142] kworker/u129:5 D ffff880803077988 0 1636 2 0x00000000
    [ 601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
    [ 601.348999] ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
    [ 601.349662] ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
    [ 601.350333] ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
    [ 601.350965] Call Trace:
    [ 601.351203] [] ? bit_wait+0x60/0x60
    [ 601.351444] [] schedule+0x35/0x80
    [ 601.351709] [] schedule_timeout+0x192/0x230
    [ 601.351958] [] ? blk_flush_plug_list+0xc7/0x220
    [ 601.352208] [] ? ktime_get+0x37/0xa0
    [ 601.352446] [] ? bit_wait+0x60/0x60
    [ 601.352688] [] io_schedule_timeout+0xa4/0x110
    [ 601.352951] [] ? _raw_spin_unlock_irqrestore+0xe/0x10
    [ 601.353196] [] bit_wait_io+0x1b/0x70
    [ 601.353440] [] __wait_on_bit+0x5d/0x90
    [ 601.353689] [] wait_on_page_bit+0xc0/0xd0
    [ 601.353958] [] ? autoremove_wake_function+0x40/0x40
    [ 601.354200] [] __filemap_fdatawait_range+0xe4/0x140
    [ 601.354441] [] filemap_fdatawait_range+0x14/0x30
    [ 601.354688] [] filemap_write_and_wait_range+0x3f/0x70
    [ 601.354932] [] blkdev_fsync+0x1b/0x50
    [ 601.355193] [] vfs_fsync_range+0x49/0xa0
    [ 601.355432] [] blkdev_write_iter+0xca/0x100
    [ 601.355679] [] __vfs_write+0xaa/0xe0
    [ 601.355925] [] vfs_write+0xa9/0x1a0
    [ 601.356164] [] kernel_write+0x38/0x50

    The underlying device is a null_blk, with default parameters:

    queue_mode = MQ
    submit_queues = 1

    Verification that nullb0 has something inflight:

    root@pserver8:~# cat /sys/block/nullb0/inflight
    0 1
    root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
    ...
    /sys/block/nullb0/mq/0/cpu2/rq_list
    CTX pending:
    ffff8838038e2400
    ...

    During debug it became clear that stalled request is always inserted in
    the rq_list from the following path:

    save_stack_trace_tsk + 34
    blk_mq_insert_requests + 231
    blk_mq_flush_plug_list + 281
    blk_flush_plug_list + 199
    wait_on_page_bit + 192
    __filemap_fdatawait_range + 228
    filemap_fdatawait_range + 20
    filemap_write_and_wait_range + 63
    blkdev_fsync + 27
    vfs_fsync_range + 73
    blkdev_write_iter + 202
    __vfs_write + 170
    vfs_write + 169
    kernel_write + 56

    So blk_flush_plug_list() was called with from_schedule == true.

    If from_schedule is true, that means that finally blk_mq_insert_requests()
    offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
    i.e. it calls kblockd_schedule_delayed_work_on().

    That means, that we race with another CPU, which is about to execute
    __blk_mq_run_hw_queue() work.

    Further debugging shows the following traces from different CPUs:

    CPU#0 CPU#1
    ---------------------------------- -------------------------------
    reqeust A inserted
    STORE hctx->ctx_map[0] bit marked
    kblockd_schedule...() returns 1

    request B inserted
    STORE hctx->ctx_map[1] bit marked
    kblockd_schedule...() returns 0
    *** WORK PENDING bit is cleared ***
    flush_busy_ctxs() is executed, but
    bit 1, set by CPU#1, is not observed

    As a result request B pended forever.

    This behaviour can be explained by speculative LOAD of hctx->ctx_map on
    CPU#0, which is reordered with clear of PENDING bit and executed _before_
    actual STORE of bit 1 on CPU#1.

    The proper fix is an explicit full barrier , which guarantees
    that clear of PENDING bit is to be executed before all possible
    speculative LOADS or STORES inside actual work function.

    Signed-off-by: Roman Pen
    Cc: Gioh Kim
    Cc: Michael Wang
    Cc: Tejun Heo
    Cc: Jens Axboe
    Cc: linux-block@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Roman Pen
     
  • commit fe1bce9e2107ba3a8faffe572483b6974201a0e6 upstream.

    Otherwise an incoming waker on the dest hash bucket can miss
    the waiter adding itself to the plist during the lockless
    check optimization (small window but still the correct way
    of doing this); similarly to the decrement counterpart.

    Suggested-by: Peter Zijlstra
    Signed-off-by: Davidlohr Bueso
    Cc: Davidlohr Bueso
    Cc: bigeasy@linutronix.de
    Cc: dvhart@infradead.org
    Link: http://lkml.kernel.org/r/1461208164-29150-1-git-send-email-dave@stgolabs.net
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Davidlohr Bueso
     
  • commit 89e9e66ba1b3bde9d8ea90566c2aee20697ad681 upstream.

    If userspace calls UNLOCK_PI unconditionally without trying the TID -> 0
    transition in user space first then the user space value might not have the
    waiters bit set. This opens the following race:

    CPU0 CPU1
    uval = get_user(futex)
    lock(hb)
    lock(hb)
    futex |= FUTEX_WAITERS
    ....
    unlock(hb)

    cmpxchg(futex, uval, newval)

    So the cmpxchg fails and returns -EINVAL to user space, which is wrong because
    the futex value is valid.

    To handle this (yes, yet another) corner case gracefully, check for a flag
    change and retry.

    [ tglx: Massaged changelog and slightly reworked implementation ]

    Fixes: ccf9e6a80d9e ("futex: Make unlock_pi more robust")
    Signed-off-by: Sebastian Andrzej Siewior
    Cc: Davidlohr Bueso
    Cc: Darren Hart
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1460723739-5195-1-git-send-email-bigeasy@linutronix.de
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Sebastian Andrzej Siewior
     
  • commit 2f5177f0fd7e531b26d54633be62d1d4cb94621c upstream.

    The CPU controller hasn't kept up with the various changes in the whole
    cgroup initialization / destruction sequence, and commit:

    2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")

    caused it to explode.

    The reason for this is that zombies do not inhibit css_offline() from
    being called, but do stall css_released(). Now we tear down the cfs_rq
    structures on css_offline() but zombies can run after that, leading to
    use-after-free issues.

    The solution is to move the tear-down to css_released(), which
    guarantees nobody (including no zombies) is still using our cgroup.

    Furthermore, a few simple cleanups are possible too. There doesn't
    appear to be any point to us using css_online() (anymore?) so fold that
    in css_alloc().

    And since cgroup code guarantees an RCU grace period between
    css_released() and css_free() we can forgo using call_rcu() and free the
    stuff immediately.

    Suggested-by: Tejun Heo
    Reported-by: Kazuki Yamaguchi
    Reported-by: Niklas Cassel
    Tested-by: Niklas Cassel
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Tejun Heo
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")
    Link: http://lkml.kernel.org/r/20160316152245.GY6344@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     

20 Apr, 2016

3 commits

  • commit 28a967c3a2f99fa3b5f762f25cb2a319d933571b upstream.

    Because event_sched_out() checks event->pending_disable _before_
    actually disabling the event, it can happen that the event fires after
    it checks but before it gets disabled.

    This would leave event->pending_disable set and the queued irq_work
    will try and process it.

    However, if the event trigger was during schedule(), the event might
    have been de-scheduled by the time the irq_work runs, and
    perf_event_disable_local() will fail.

    Fix this by checking event->pending_disable _after_ we call
    event->pmu->del(). This depends on the latter being a compiler
    barrier, such that the compiler does not lift the load and re-creates
    the problem.

    Tested-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dvyukov@google.com
    Cc: eranian@google.com
    Cc: oleg@redhat.com
    Cc: panand@redhat.com
    Cc: sasha.levin@oracle.com
    Cc: vince@deater.net
    Link: http://lkml.kernel.org/r/20160224174948.040469884@infradead.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     
  • commit 130056275ade730e7a79c110212c8815202773ee upstream.

    In case of: err_file: fput(event_file), we'll end up calling
    perf_release() which in turn will free the event.

    Do not then free the event _again_.

    Tested-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dvyukov@google.com
    Cc: eranian@google.com
    Cc: oleg@redhat.com
    Cc: panand@redhat.com
    Cc: sasha.levin@oracle.com
    Cc: vince@deater.net
    Link: http://lkml.kernel.org/r/20160224174947.697350349@infradead.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     
  • [ Upstream commit cdc4e47da8f4c32eeb6b2061a8a834f4362a12b7 ]

    Lots of places in the kernel use memcpy(buf, comm, TASK_COMM_LEN); but
    the result is typically passed to print("%s", buf) and extra bytes
    after zero don't cause any harm.
    In bpf the result of bpf_get_current_comm() is used as the part of
    map key and was causing spurious hash map mismatches.
    Use strlcpy() to guarantee zero-terminated string.
    bpf verifier checks that output buffer is zero-initialized,
    so even for short task names the output buffer don't have junk bytes.
    Note it's not a security concern, since kprobe+bpf is root only.

    Fixes: ffeedafbf023 ("bpf: introduce current->pid, tgid, uid, gid, comm accessors")
    Reported-by: Tobias Waldekranz
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Alexei Starovoitov
     

13 Apr, 2016

10 commits

  • commit e9532e69b8d1d1284e8ecf8d2586de34aec61244 upstream.

    On CPU hotplug the steal time accounting can keep a stale rq->prev_steal_time
    value over CPU down and up. So after the CPU comes up again the delta
    calculation in steal_account_process_tick() wreckages itself due to the
    unsigned math:

    u64 steal = paravirt_steal_clock(smp_processor_id());

    steal -= this_rq()->prev_steal_time;

    So if steal is smaller than rq->prev_steal_time we end up with an insane large
    value which then gets added to rq->prev_steal_time, resulting in a permanent
    wreckage of the accounting. As a consequence the per CPU stats in /proc/stat
    become stale.

    Nice trick to tell the world how idle the system is (100%) while the CPU is
    100% busy running tasks. Though we prefer realistic numbers.

    None of the accounting values which use a previous value to account for
    fractions is reset at CPU hotplug time. update_rq_clock_task() has a sanity
    check for prev_irq_time and prev_steal_time_rq, but that sanity check solely
    deals with clock warps and limits the /proc/stat visible wreckage. The
    prev_time values are still wrong.

    Solution is simple: Reset rq->prev_*_time when the CPU is plugged in again.

    Signed-off-by: Thomas Gleixner
    Acked-by: Rik van Riel
    Cc: Frederic Weisbecker
    Cc: Glauber Costa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Fixes: commit 095c0aa83e52 "sched: adjust scheduler cpu power for stolen time"
    Fixes: commit aa483808516c "sched: Remove irq time from available CPU power"
    Fixes: commit e6e6685accfa "KVM guest: Steal time accounting"
    Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1603041539490.3686@nanos
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 276142730c39c9839465a36a90e5674a8c34e839 upstream.

    When suspending to RAM, waking up and later suspending to disk,
    we gratuitously runtime resume devices after the thaw phase.
    This does not occur if we always suspend to RAM or always to disk.

    pm_complete_with_resume_check(), which gets called from
    pci_pm_complete() among others, schedules a runtime resume
    if PM_SUSPEND_FLAG_FW_RESUME is set. The flag is set during
    a suspend-to-RAM cycle. It is cleared at the beginning of
    the suspend-to-RAM cycle but not afterwards and it is not
    cleared during a suspend-to-disk cycle at all. Fix it.

    Fixes: ef25ba047601 (PM / sleep: Add flags to indicate platform firmware involvement)
    Signed-off-by: Lukas Wunner
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Greg Kroah-Hartman

    Lukas Wunner
     
  • commit 3debb0a9ddb16526de8b456491b7db60114f7b5e upstream.

    The trace_printk() code will allocate extra buffers if the compile detects
    that a trace_printk() is used. To do this, the format of the trace_printk()
    is saved to the __trace_printk_fmt section, and if that section is bigger
    than zero, the buffers are allocated (along with a message that this has
    happened).

    If trace_printk() uses a format that is not a constant, and thus something
    not guaranteed to be around when the print happens, the compiler optimizes
    the fmt out, as it is not used, and the __trace_printk_fmt section is not
    filled. This means the kernel will not allocate the special buffers needed
    for the trace_printk() and the trace_printk() will not write anything to the
    tracing buffer.

    Adding a "__used" to the variable in the __trace_printk_fmt section will
    keep it around, even though it is set to NULL. This will keep the string
    from being printed in the debugfs/tracing/printk_formats section as it is
    not needed.

    Reported-by: Vlastimil Babka
    Fixes: 07d777fe8c398 "tracing: Add percpu buffers for trace_printk()"
    Signed-off-by: Steven Rostedt
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Red Hat)
     
  • commit a29054d9478d0435ab01b7544da4f674ab13f533 upstream.

    If tracing contains data and the trace_pipe file is read with sendfile(),
    then it can trigger a NULL pointer dereference and various BUG_ON within the
    VM code.

    There's a patch to fix this in the splice_to_pipe() code, but it's also a
    good idea to not let that happen from trace_pipe either.

    Link: http://lkml.kernel.org/r/1457641146-9068-1-git-send-email-rabin@rab.in

    Reported-by: Rabin Vincent
    Signed-off-by: Steven Rostedt
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Red Hat)
     
  • commit cb86e05390debcc084cfdb0a71ed4c5dbbec517d upstream.

    Joel Fernandes reported that the function tracing of preempt disabled
    sections was not being reported when running either the preemptirqsoff or
    preemptoff tracers. This was due to the fact that the function tracer
    callback for those tracers checked if irqs were disabled before tracing. But
    this fails when we want to trace preempt off locations as well.

    Joel explained that he wanted to see funcitons where interrupts are enabled
    but preemption was disabled. The expected output he wanted:

    -2265 1d.h1 3419us : preempt_count_sub -2265 1d..1 3419us : __do_softirq -2265 1d..1 3419us : msecs_to_jiffies -2265 1d..1 3420us : irqtime_account_irq -2265 1d..1 3420us : __local_bh_disable_ip -2265 1..s1 3421us : run_timer_softirq -2265 1..s1 3421us : hrtimer_run_pending -2265 1..s1 3421us : _raw_spin_lock_irq -2265 1d.s1 3422us : preempt_count_add -2265 1d.s2 3422us : _raw_spin_unlock_irq -2265 1..s2 3422us : preempt_count_sub -2265 1..s1 3423us : rcu_bh_qs -2265 1d.s1 3423us : irqtime_account_irq -2265 1d.s1 3423us : __local_bh_enable
    Signed-off-by: Steven Rostedt
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Red Hat)
     
  • commit 378c6520e7d29280f400ef2ceaf155c86f05a71a upstream.

    This commit fixes the following security hole affecting systems where
    all of the following conditions are fulfilled:

    - The fs.suid_dumpable sysctl is set to 2.
    - The kernel.core_pattern sysctl's value starts with "/". (Systems
    where kernel.core_pattern starts with "|/" are not affected.)
    - Unprivileged user namespace creation is permitted. (This is
    true on Linux >=3.8, but some distributions disallow it by
    default using a distro patch.)

    Under these conditions, if a program executes under secure exec rules,
    causing it to run with the SUID_DUMP_ROOT flag, then unshares its user
    namespace, changes its root directory and crashes, the coredump will be
    written using fsuid=0 and a path derived from kernel.core_pattern - but
    this path is interpreted relative to the root directory of the process,
    allowing the attacker to control where a coredump will be written with
    root privileges.

    To fix the security issue, always interpret core_pattern for dumps that
    are written under SUID_DUMP_ROOT relative to the root directory of init.

    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Andy Lutomirski
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     
  • commit 2b021cbf3cb6208f0d40fd2f1869f237934340ed upstream.

    Before 2e91fa7f6d45 ("cgroup: keep zombies associated with their
    original cgroups"), all dead tasks were associated with init_css_set.
    If a zombie task is requested for migration, while migration prep
    operations would still be performed on init_css_set, the actual
    migration would ignore zombie tasks. As init_css_set is always valid,
    this worked fine.

    However, after 2e91fa7f6d45, zombie tasks stay with the css_set it was
    associated with at the time of death. Let's say a task T associated
    with cgroup A on hierarchy H-1 and cgroup B on hiearchy H-2. After T
    becomes a zombie, it would still remain associated with A and B. If A
    only contains zombie tasks, it can be removed. On removal, A gets
    marked offline but stays pinned until all zombies are drained. At
    this point, if migration is initiated on T to a cgroup C on hierarchy
    H-2, migration path would try to prepare T's css_set for migration and
    trigger the following.

    WARNING: CPU: 0 PID: 1576 at kernel/cgroup.c:474 cgroup_get+0x121/0x160()
    CPU: 0 PID: 1576 Comm: bash Not tainted 4.4.0-work+ #289
    ...
    Call Trace:
    [] dump_stack+0x4e/0x82
    [] warn_slowpath_common+0x78/0xb0
    [] warn_slowpath_null+0x15/0x20
    [] cgroup_get+0x121/0x160
    [] link_css_set+0x7b/0x90
    [] find_css_set+0x3bc/0x5e0
    [] cgroup_migrate_prepare_dst+0x89/0x1f0
    [] cgroup_attach_task+0x157/0x230
    [] __cgroup_procs_write+0x2b7/0x470
    [] cgroup_tasks_write+0xc/0x10
    [] cgroup_file_write+0x30/0x1b0
    [] kernfs_fop_write+0x13c/0x180
    [] __vfs_write+0x23/0xe0
    [] vfs_write+0xa4/0x1a0
    [] SyS_write+0x44/0xa0
    [] entry_SYSCALL_64_fastpath+0x12/0x6f

    It doesn't make sense to prepare migration for css_sets pointing to
    dead cgroups as they are guaranteed to contain only zombies which are
    ignored later during migration. This patch makes cgroup destruction
    path mark all affected css_sets as dead and updates the migration path
    to ignore them during preparation.

    Signed-off-by: Tejun Heo
    Fixes: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups")
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • commit a1ee1932aa6bea0bb074f5e3ced112664e4637ed upstream.

    While working on a script to restore all sysctl params before a series of
    tests I found that writing any value into the
    /proc/sys/kernel/{nmi_watchdog,soft_watchdog,watchdog,watchdog_thresh}
    causes them to call proc_watchdog_update().

    NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
    NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
    NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
    NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.

    There doesn't appear to be a reason for doing this work every time a write
    occurs, so only do it when the values change.

    Signed-off-by: Josh Hunt
    Acked-by: Don Zickus
    Reviewed-by: Aaron Tomlin
    Cc: Ulrich Obergfell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Joshua Hunt
     
  • commit f9c904b7613b8b4c85b10cd6b33ad41b2843fa9d upstream.

    The callers of steal_account_process_tick() expect it to return
    whether a jiffy should be considered stolen or not.

    Currently the return value of steal_account_process_tick() is in
    units of cputime, which vary between either jiffies or nsecs
    depending on CONFIG_VIRT_CPU_ACCOUNTING_GEN.

    If cputime has nsecs granularity and there is a tiny amount of
    stolen time (a few nsecs, say) then we will consider the entire
    tick stolen and will not account the tick on user/system/idle,
    causing /proc/stats to show invalid data.

    The fix is to change steal_account_process_tick() to accumulate
    the stolen time and only account it once it's worth a jiffy.

    (Thanks to Frederic Weisbecker for suggestions to fix a bug in my
    first version of the patch.)

    Signed-off-by: Chris Friesen
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Thomas Gleixner
    Cc: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/56DBBDB8.40305@mail.usask.ca
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Chris Friesen
     
  • commit 927a5570855836e5d5859a80ce7e91e963545e8f upstream.

    The error path in perf_event_open() is such that asking for a sampling
    event on a PMU that doesn't generate interrupts will end up in dropping
    the perf_sched_count even though it hasn't been incremented for this
    event yet.

    Given a sufficient amount of these calls, we'll end up disabling
    scheduler's jump label even though we'd still have active events in the
    system, thereby facilitating the arrival of the infernal regions upon us.

    I'm fixing this by moving account_event() inside perf_event_alloc().

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: vince@deater.net
    Link: http://lkml.kernel.org/r/1456917854-29427-1-git-send-email-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Alexander Shishkin
     

10 Mar, 2016

2 commits

  • commit 8244062ef1e54502ef55f54cced659913f244c3e upstream.

    For CONFIG_KALLSYMS, we keep two symbol tables and two string tables.
    There's one full copy, marked SHF_ALLOC and laid out at the end of the
    module's init section. There's also a cut-down version that only
    contains core symbols and strings, and lives in the module's core
    section.

    After module init (and before we free the module memory), we switch
    the mod->symtab, mod->num_symtab and mod->strtab to point to the core
    versions. We do this under the module_mutex.

    However, kallsyms doesn't take the module_mutex: it uses
    preempt_disable() and rcu tricks to walk through the modules, because
    it's used in the oops path. It's also used in /proc/kallsyms.
    There's nothing atomic about the change of these variables, so we can
    get the old (larger!) num_symtab and the new symtab pointer; in fact
    this is what I saw when trying to reproduce.

    By grouping these variables together, we can use a
    carefully-dereferenced pointer to ensure we always get one or the
    other (the free of the module init section is already done in an RCU
    callback, so that's safe). We allocate the init one at the end of the
    module init section, and keep the core one inside the struct module
    itself (it could also have been allocated at the end of the module
    core, but that's probably overkill).

    [ Rebased for 4.4-stable and older, because the following changes aren't
    in the older trees:
    - e0224418516b4d8a6c2160574bac18447c354ef0: adds arg to is_core_symbol
    - 7523e4dc5057e157212b4741abd6256e03404cf1: module_init/module_core/init_size/core_size
    become init_layout.base/core_layout.base/init_layout.size/core_layout.size.
    ]

    Reported-by: Weilong Chen
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=111541
    Cc: stable@kernel.org
    Signed-off-by: Rusty Russell
    Signed-off-by: Greg Kroah-Hartman

    Rusty Russell
     
  • commit e57cbaf0eb006eaa207395f3bfd7ce52c1b5539c upstream.

    Commit 9f61668073a8d "tracing: Allow triggers to filter for CPU ids and
    process names" added a 'comm' filter that will filter events based on the
    current tasks struct 'comm'. But this now hides the ability to filter events
    that have a 'comm' field too. For example, sched_migrate_task trace event.
    That has a 'comm' field of the task to be migrated.

    echo 'comm == "bash"' > events/sched_migrate_task/filter

    will now filter all sched_migrate_task events for tasks named "bash" that
    migrates other tasks (in interrupt context), instead of seeing when "bash"
    itself gets migrated.

    This fix requires a couple of changes.

    1) Change the look up order for filter predicates to look at the events
    fields before looking at the generic filters.

    2) Instead of basing the filter function off of the "comm" name, have the
    generic "comm" filter have its own filter_type (FILTER_COMM). Test
    against the type instead of the name to assign the filter function.

    3) Add a new "COMM" filter that works just like "comm" but will filter based
    on the current task, even if the trace event contains a "comm" field.

    Do the same for "cpu" field, adding a FILTER_CPU and a filter "CPU".

    Fixes: 9f61668073a8d "tracing: Allow triggers to filter for CPU ids and process names"
    Reported-by: Matt Fleming
    Signed-off-by: Steven Rostedt
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Red Hat)
     

04 Mar, 2016

7 commits

  • commit 59ceeaaf355fa0fb16558ef7c24413c804932ada upstream.

    In __request_region, if a conflict with a BUSY and MUXED resource is
    detected, then the caller goes to sleep and waits for the resource to be
    released. A pointer on the conflicting resource is kept. At wake-up
    this pointer is used as a parent to retry to request the region.

    A first problem is that this pointer might well be invalid (if for
    example the conflicting resource have already been freed). Another
    problem is that the next call to __request_region() fails to detect a
    remaining conflict. The previously conflicting resource is passed as a
    parameter and __request_region() will look for a conflict among the
    children of this resource and not at the resource itself. It is likely
    to succeed anyway, even if there is still a conflict.

    Instead, the parent of the conflicting resource should be passed to
    __request_region().

    As a fix, this patch doesn't update the parent resource pointer in the
    case we have to wait for a muxed region right after.

    Reported-and-tested-by: Vincent Pelletier
    Signed-off-by: Simon Guinot
    Tested-by: Vincent Donnefort
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Simon Guinot
     
  • commit d045437a169f899dfb0f6f7ede24cc042543ced9 upstream.

    The ftrace:function event is only displayed for parsing the function tracer
    data. It is not used to enable function tracing, and does not include an
    "enable" file in its event directory.

    Originally, this event was kept separate from other events because it did
    not have a ->reg parameter. But perf added a "reg" parameter for its use
    which caused issues, because it made the event available to functions where
    it was not compatible for.

    Commit 9b63776fa3ca9 "tracing: Do not enable function event with enable"
    added a TRACE_EVENT_FL_IGNORE_ENABLE flag that prevented the function event
    from being enabled by normal trace events. But this commit missed keeping
    the function event from being displayed by the "available_events" directory,
    which is used to show what events can be enabled by set_event.

    One documented way to enable all events is to:

    cat available_events > set_event

    But because the function event is displayed in the available_events, this
    now causes an INVALID error:

    cat: write error: Invalid argument

    Reported-by: Chunyu Hu
    Fixes: 9b63776fa3ca9 "tracing: Do not enable function event with enable"
    Signed-off-by: Steven Rostedt
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Red Hat)
     
  • commit aa226ff4a1ce79f229c6b7a4c0a14e17fececd01 upstream.

    There are three subsystem callbacks in css shutdown path -
    css_offline(), css_released() and css_free(). Except for
    css_released(), cgroup core didn't guarantee the order of invocation.
    css_offline() or css_free() could be called on a parent css before its
    children. This behavior is unexpected and led to bugs in cpu and
    memory controller.

    This patch updates offline path so that a parent css is never offlined
    before its children. Each css keeps online_cnt which reaches zero iff
    itself and all its children are offline and offline_css() is invoked
    only after online_cnt reaches zero.

    This fixes the memory controller bug and allows the fix for cpu
    controller.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Christian Borntraeger
    Reported-by: Brian Christiansen
    Link: http://lkml.kernel.org/g/5698A023.9070703@de.ibm.com
    Link: http://lkml.kernel.org/g/CAKB58ikDkzc8REt31WBkD99+hxNzjK4+FBmhkgS+NVrC9vjMSg@mail.gmail.com
    Cc: Heiko Carstens
    Cc: Peter Zijlstra
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • commit e93ad19d05648397ef3bcb838d26aec06c245dc0 upstream.

    If "cpuset.memory_migrate" is set, when a process is moved from one
    cpuset to another with a different memory node mask, pages in used by
    the process are migrated to the new set of nodes. This was performed
    synchronously in the ->attach() callback, which is synchronized
    against process management. Recently, the synchronization was changed
    from per-process rwsem to global percpu rwsem for simplicity and
    optimization.

    Combined with the synchronous mm migration, this led to deadlocks
    because mm migration could schedule a work item which may in turn try
    to create a new worker blocking on the process management lock held
    from cgroup process migration path.

    This heavy an operation shouldn't be performed synchronously from that
    deep inside cgroup migration in the first place. This patch punts the
    actual migration to an ordered workqueue and updates cgroup process
    migration and cpuset config update paths to flush the workqueue after
    all locks are released. This way, the operations still seem
    synchronous to userland without entangling mm migration with process
    management synchronization. CPU hotplug can also invoke mm migration
    but there's no reason for it to wait for mm migrations and thus
    doesn't synchronize against their completions.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Christian Borntraeger
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • commit 041bd12e272c53a35c54c13875839bcb98c999ce upstream.

    This reverts commit 874bbfe600a660cba9c776b3957b1ce393151b76.

    Workqueue used to implicity guarantee that work items queued without
    explicit CPU specified are put on the local CPU. Recent changes in
    timer broke the guarantee and led to vmstat breakage which was fixed
    by 176bed1de5bf ("vmstat: explicitly schedule per-cpu work on the CPU
    we need it to run on").

    vmstat is the most likely to expose the issue and it's quite possible
    that there are other similar problems which are a lot more difficult
    to trigger. As a preventive measure, 874bbfe600a6 ("workqueue: make
    sure delayed work run in local cpu") was applied to restore the local
    CPU guarnatee. Unfortunately, the change exposed a bug in timer code
    which got fixed by 22b886dd1018 ("timers: Use proper base migration in
    add_timer_on()"). Due to code restructuring, the commit couldn't be
    backported beyond certain point and stable kernels which only had
    874bbfe600a6 started crashing.

    The local CPU guarantee was accidental more than anything else and we
    want to get rid of it anyway. As, with the vmstat case fixed,
    874bbfe600a6 is causing more problems than it's fixing, it has been
    decided to take the chance and officially break the guarantee by
    reverting the commit. A debug feature will be added to force foreign
    CPU assignment to expose cases relying on the guarantee and fixes for
    the individual cases will be backported to stable as necessary.

    Signed-off-by: Tejun Heo
    Fixes: 874bbfe600a6 ("workqueue: make sure delayed work run in local cpu")
    Link: http://lkml.kernel.org/g/20160120211926.GJ10810@quack.suse.cz
    Cc: Mike Galbraith
    Cc: Henrique de Moraes Holschuh
    Cc: Daniel Bilik
    Cc: Jan Kara
    Cc: Shaohua Li
    Cc: Sasha Levin
    Cc: Ben Hutchings
    Cc: Thomas Gleixner
    Cc: Daniel Bilik
    Cc: Jiri Slaby
    Cc: Michal Hocko
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • commit d6e022f1d207a161cd88e08ef0371554680ffc46 upstream.

    When looking up the pool_workqueue to use for an unbound workqueue,
    workqueue assumes that the target CPU is always bound to a valid NUMA
    node. However, currently, when a CPU goes offline, the mapping is
    destroyed and cpu_to_node() returns NUMA_NO_NODE.

    This has always been broken but hasn't triggered often enough before
    874bbfe600a6 ("workqueue: make sure delayed work run in local cpu").
    After the commit, workqueue forcifully assigns the local CPU for
    delayed work items without explicit target CPU to fix a different
    issue. This widens the window where CPU can go offline while a
    delayed work item is pending causing delayed work items dispatched
    with target CPU set to an already offlined CPU. The resulting
    NUMA_NO_NODE mapping makes workqueue try to queue the work item on a
    NULL pool_workqueue and thus crash.

    While 874bbfe600a6 has been reverted for a different reason making the
    bug less visible again, it can still happen. Fix it by mapping
    NUMA_NO_NODE to the default pool_workqueue from unbound_pwq_by_node().
    This is a temporary workaround. The long term solution is keeping CPU
    -> NODE mapping stable across CPU off/online cycles which is being
    worked on.

    Signed-off-by: Tejun Heo
    Reported-by: Mike Galbraith
    Cc: Tang Chen
    Cc: Rafael J. Wysocki
    Cc: Len Brown
    Link: http://lkml.kernel.org/g/1454424264.11183.46.camel@gmail.com
    Link: http://lkml.kernel.org/g/1453702100-2597-1-git-send-email-tangchen@cn.fujitsu.com
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • commit 1ca8ec532fc2d986f1f4a319857bb18e0c9739b4 upstream.

    commit 0ff53d096422 sets the next tick interrupt to the last jiffies update,
    i.e. in the past, because the forward operation is invoked before the set
    operation. There is no resulting damage (yet), but we get an extra pointless
    tick interrupt.

    Revert the order so we get the next tick interrupt in the future.

    Fixes: commit 0ff53d096422 "tick: sched: Force tick interrupt and get rid of softirq magic"
    Signed-off-by: Wanpeng Li
    Cc: Peter Zijlstra
    Cc: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/1453893967-3458-1-git-send-email-wanpeng.li@hotmail.com
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Wanpeng Li