06 Apr, 2019

1 commit

  • [ Upstream commit 31b265b3baaf55f209229888b7ffea523ddab366 ]

    As reported back in 2016-11 [1], the "ftdump" kdb command triggers a
    BUG for "sleeping function called from invalid context".

    kdb's "ftdump" command wants to call ring_buffer_read_prepare() in
    atomic context. A very simple solution for this is to add allocation
    flags to ring_buffer_read_prepare() so kdb can call it without
    triggering the allocation error. This patch does that.

    Note that in the original email thread about this, it was suggested
    that perhaps the solution for kdb was to either preallocate the buffer
    ahead of time or create our own iterator. I'm hoping that this
    alternative of adding allocation flags to ring_buffer_read_prepare()
    can be considered since it means I don't need to duplicate more of the
    core trace code into "trace_kdb.c" (for either creating my own
    iterator or re-preparing a ring allocator whose memory was already
    allocated).

    NOTE: another option for kdb is to actually figure out how to make it
    reuse the existing ftrace_dump() function and totally eliminate the
    duplication. This sounds very appealing and actually works (the "sr
    z" command can be seen to properly dump the ftrace buffer). The
    downside here is that ftrace_dump() fully consumes the trace buffer.
    Unless that is changed I'd rather not use it because it means "ftdump
    | grep xyz" won't be very useful to search the ftrace buffer since it
    will throw away the whole trace on the first grep. A future patch to
    dump only the last few lines of the buffer will also be hard to
    implement.

    [1] https://lkml.kernel.org/r/20161117191605.GA21459@google.com

    Link: http://lkml.kernel.org/r/20190308193205.213659-1-dianders@chromium.org

    Reported-by: Brian Norris
    Signed-off-by: Douglas Anderson
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Sasha Levin

    Douglas Anderson
     

24 Mar, 2019

3 commits

  • commit 83540fbc8812a580b6ad8f93f4c29e62e417687e upstream.

    The first version of this method was missing the check for
    `ret == PATH_MAX`; then such a check was added, but it didn't call kfree()
    on error, so there was still a small memory leak in the error case.
    Fix it by using strndup_user() instead of open-coding it.

    Link: http://lkml.kernel.org/r/20190220165443.152385-1-jannh@google.com

    Cc: Ingo Molnar
    Cc: stable@vger.kernel.org
    Fixes: 0eadcc7a7bc0 ("perf/core: Fix perf_uprobe_init()")
    Reviewed-by: Masami Hiramatsu
    Acked-by: Song Liu
    Signed-off-by: Jann Horn
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     
  • commit e7f0c424d0806b05d6f47be9f202b037eb701707 upstream.

    Commit d716ff71dd12 ("tracing: Remove taking of trace_types_lock in
    pipe files") use the current tracer instead of the copy in
    tracing_open_pipe(), but it forget to remove the freeing sentence in
    the error path.

    There's an error path that can call kfree(iter->trace) after the iter->trace
    was assigned to tr->current_trace, which would be bad to free.

    Link: http://lkml.kernel.org/r/1550060946-45984-1-git-send-email-yi.zhang@huawei.com

    Cc: stable@vger.kernel.org
    Fixes: d716ff71dd12 ("tracing: Remove taking of trace_types_lock in pipe files")
    Signed-off-by: zhangyi (F)
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    zhangyi (F)
     
  • commit 9f0bbf3115ca9f91f43b7c74e9ac7d79f47fc6c2 upstream.

    Because there may be random garbage beyond a string's null terminator,
    it's not correct to copy the the complete character array for use as a
    hist trigger key. This results in multiple histogram entries for the
    'same' string key.

    So, in the case of a string key, use strncpy instead of memcpy to
    avoid copying in the extra bytes.

    Before, using the gdbus entries in the following hist trigger as an
    example:

    # echo 'hist:key=comm' > /sys/kernel/debug/tracing/events/sched/sched_waking/trigger
    # cat /sys/kernel/debug/tracing/events/sched/sched_waking/hist

    ...

    { comm: ImgDecoder #4 } hitcount: 203
    { comm: gmain } hitcount: 213
    { comm: gmain } hitcount: 216
    { comm: StreamTrans #73 } hitcount: 221
    { comm: mozStorage #3 } hitcount: 230
    { comm: gdbus } hitcount: 233
    { comm: StyleThread#5 } hitcount: 253
    { comm: gdbus } hitcount: 256
    { comm: gdbus } hitcount: 260
    { comm: StyleThread#4 } hitcount: 271

    ...

    # cat /sys/kernel/debug/tracing/events/sched/sched_waking/hist | egrep gdbus | wc -l
    51

    After:

    # cat /sys/kernel/debug/tracing/events/sched/sched_waking/hist | egrep gdbus | wc -l
    1

    Link: http://lkml.kernel.org/r/50c35ae1267d64eee975b8125e151e600071d4dc.1549309756.git.tom.zanussi@linux.intel.com

    Cc: Namhyung Kim
    Cc: stable@vger.kernel.org
    Fixes: 79e577cbce4c4 ("tracing: Support string type key properly")
    Signed-off-by: Tom Zanussi
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Tom Zanussi
     

14 Mar, 2019

1 commit

  • [ Upstream commit e16ec34039c701594d55d08a5aa49ee3e1abc821 ]

    Lockdep found a potential deadlock between cpu_hotplug_lock, bpf_event_mutex, and cpuctx_mutex:
    [ 13.007000] WARNING: possible circular locking dependency detected
    [ 13.007587] 5.0.0-rc3-00018-g2fa53f892422-dirty #477 Not tainted
    [ 13.008124] ------------------------------------------------------
    [ 13.008624] test_progs/246 is trying to acquire lock:
    [ 13.009030] 0000000094160d1d (tracepoints_mutex){+.+.}, at: tracepoint_probe_register_prio+0x2d/0x300
    [ 13.009770]
    [ 13.009770] but task is already holding lock:
    [ 13.010239] 00000000d663ef86 (bpf_event_mutex){+.+.}, at: bpf_probe_register+0x1d/0x60
    [ 13.010877]
    [ 13.010877] which lock already depends on the new lock.
    [ 13.010877]
    [ 13.011532]
    [ 13.011532] the existing dependency chain (in reverse order) is:
    [ 13.012129]
    [ 13.012129] -> #4 (bpf_event_mutex){+.+.}:
    [ 13.012582] perf_event_query_prog_array+0x9b/0x130
    [ 13.013016] _perf_ioctl+0x3aa/0x830
    [ 13.013354] perf_ioctl+0x2e/0x50
    [ 13.013668] do_vfs_ioctl+0x8f/0x6a0
    [ 13.014003] ksys_ioctl+0x70/0x80
    [ 13.014320] __x64_sys_ioctl+0x16/0x20
    [ 13.014668] do_syscall_64+0x4a/0x180
    [ 13.015007] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 13.015469]
    [ 13.015469] -> #3 (&cpuctx_mutex){+.+.}:
    [ 13.015910] perf_event_init_cpu+0x5a/0x90
    [ 13.016291] perf_event_init+0x1b2/0x1de
    [ 13.016654] start_kernel+0x2b8/0x42a
    [ 13.016995] secondary_startup_64+0xa4/0xb0
    [ 13.017382]
    [ 13.017382] -> #2 (pmus_lock){+.+.}:
    [ 13.017794] perf_event_init_cpu+0x21/0x90
    [ 13.018172] cpuhp_invoke_callback+0xb3/0x960
    [ 13.018573] _cpu_up+0xa7/0x140
    [ 13.018871] do_cpu_up+0xa4/0xc0
    [ 13.019178] smp_init+0xcd/0xd2
    [ 13.019483] kernel_init_freeable+0x123/0x24f
    [ 13.019878] kernel_init+0xa/0x110
    [ 13.020201] ret_from_fork+0x24/0x30
    [ 13.020541]
    [ 13.020541] -> #1 (cpu_hotplug_lock.rw_sem){++++}:
    [ 13.021051] static_key_slow_inc+0xe/0x20
    [ 13.021424] tracepoint_probe_register_prio+0x28c/0x300
    [ 13.021891] perf_trace_event_init+0x11f/0x250
    [ 13.022297] perf_trace_init+0x6b/0xa0
    [ 13.022644] perf_tp_event_init+0x25/0x40
    [ 13.023011] perf_try_init_event+0x6b/0x90
    [ 13.023386] perf_event_alloc+0x9a8/0xc40
    [ 13.023754] __do_sys_perf_event_open+0x1dd/0xd30
    [ 13.024173] do_syscall_64+0x4a/0x180
    [ 13.024519] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 13.024968]
    [ 13.024968] -> #0 (tracepoints_mutex){+.+.}:
    [ 13.025434] __mutex_lock+0x86/0x970
    [ 13.025764] tracepoint_probe_register_prio+0x2d/0x300
    [ 13.026215] bpf_probe_register+0x40/0x60
    [ 13.026584] bpf_raw_tracepoint_open.isra.34+0xa4/0x130
    [ 13.027042] __do_sys_bpf+0x94f/0x1a90
    [ 13.027389] do_syscall_64+0x4a/0x180
    [ 13.027727] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 13.028171]
    [ 13.028171] other info that might help us debug this:
    [ 13.028171]
    [ 13.028807] Chain exists of:
    [ 13.028807] tracepoints_mutex --> &cpuctx_mutex --> bpf_event_mutex
    [ 13.028807]
    [ 13.029666] Possible unsafe locking scenario:
    [ 13.029666]
    [ 13.030140] CPU0 CPU1
    [ 13.030510] ---- ----
    [ 13.030875] lock(bpf_event_mutex);
    [ 13.031166] lock(&cpuctx_mutex);
    [ 13.031645] lock(bpf_event_mutex);
    [ 13.032135] lock(tracepoints_mutex);
    [ 13.032441]
    [ 13.032441] *** DEADLOCK ***
    [ 13.032441]
    [ 13.032911] 1 lock held by test_progs/246:
    [ 13.033239] #0: 00000000d663ef86 (bpf_event_mutex){+.+.}, at: bpf_probe_register+0x1d/0x60
    [ 13.033909]
    [ 13.033909] stack backtrace:
    [ 13.034258] CPU: 1 PID: 246 Comm: test_progs Not tainted 5.0.0-rc3-00018-g2fa53f892422-dirty #477
    [ 13.034964] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
    [ 13.035657] Call Trace:
    [ 13.035859] dump_stack+0x5f/0x8b
    [ 13.036130] print_circular_bug.isra.37+0x1ce/0x1db
    [ 13.036526] __lock_acquire+0x1158/0x1350
    [ 13.036852] ? lock_acquire+0x98/0x190
    [ 13.037154] lock_acquire+0x98/0x190
    [ 13.037447] ? tracepoint_probe_register_prio+0x2d/0x300
    [ 13.037876] __mutex_lock+0x86/0x970
    [ 13.038167] ? tracepoint_probe_register_prio+0x2d/0x300
    [ 13.038600] ? tracepoint_probe_register_prio+0x2d/0x300
    [ 13.039028] ? __mutex_lock+0x86/0x970
    [ 13.039337] ? __mutex_lock+0x24a/0x970
    [ 13.039649] ? bpf_probe_register+0x1d/0x60
    [ 13.039992] ? __bpf_trace_sched_wake_idle_without_ipi+0x10/0x10
    [ 13.040478] ? tracepoint_probe_register_prio+0x2d/0x300
    [ 13.040906] tracepoint_probe_register_prio+0x2d/0x300
    [ 13.041325] bpf_probe_register+0x40/0x60
    [ 13.041649] bpf_raw_tracepoint_open.isra.34+0xa4/0x130
    [ 13.042068] ? __might_fault+0x3e/0x90
    [ 13.042374] __do_sys_bpf+0x94f/0x1a90
    [ 13.042678] do_syscall_64+0x4a/0x180
    [ 13.042975] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 13.043382] RIP: 0033:0x7f23b10a07f9
    [ 13.045155] RSP: 002b:00007ffdef42fdd8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141
    [ 13.045759] RAX: ffffffffffffffda RBX: 00007ffdef42ff70 RCX: 00007f23b10a07f9
    [ 13.046326] RDX: 0000000000000070 RSI: 00007ffdef42fe10 RDI: 0000000000000011
    [ 13.046893] RBP: 00007ffdef42fdf0 R08: 0000000000000038 R09: 00007ffdef42fe10
    [ 13.047462] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
    [ 13.048029] R13: 0000000000000016 R14: 00007f23b1db4690 R15: 0000000000000000

    Since tracepoints_mutex will be taken in tracepoint_probe_register/unregister()
    there is no need to take bpf_event_mutex too.
    bpf_event_mutex is protecting modifications to prog array used in kprobe/perf bpf progs.
    bpf_raw_tracepoints don't need to take this mutex.

    Fixes: c4f6699dfcb8 ("bpf: introduce BPF_RAW_TRACEPOINT")
    Acked-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin

    Alexei Starovoitov
     

10 Mar, 2019

1 commit

  • commit 6a072128d262d2b98d31626906a96700d1fc11eb upstream.

    Then tracing syscall exit event it is extremely useful to filter exit
    codes equal to some negative value, to react only to required errors.
    But negative numbers does not work:

    [root@snorch sys_exit_read]# echo "ret == -1" > filter
    bash: echo: write error: Invalid argument
    [root@snorch sys_exit_read]# cat filter
    ret == -1
    ^
    parse_error: Invalid value (did you forget quotes)?

    Similar thing happens when setting triggers.

    These is a regression in v4.17 introduced by the commit mentioned below,
    testing without these commit shows no problem with negative numbers.

    Link: http://lkml.kernel.org/r/20180823102534.7642-1-ptikhomirov@virtuozzo.com

    Cc: stable@vger.kernel.org
    Fixes: 80765597bc58 ("tracing: Rewrite filter logic to be simpler and faster")
    Signed-off-by: Pavel Tikhomirov
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Pavel Tikhomirov
     

27 Feb, 2019

1 commit

  • commit 9e7382153f80ba45a0bbcd540fb77d4b15f6e966 upstream.

    The following commit

    441dae8f2f29 ("tracing: Add support for display of tgid in trace output")

    removed the call to print_event_info() from print_func_help_header_irq()
    which results in the ftrace header not reporting the number of entries
    written in the buffer. As this wasn't the original intent of the patch,
    re-introduce the call to print_event_info() to restore the orginal
    behaviour.

    Link: http://lkml.kernel.org/r/20190214152950.4179-1-quentin.perret@arm.com

    Acked-by: Joel Fernandes
    Cc: stable@vger.kernel.org
    Fixes: 441dae8f2f29 ("tracing: Add support for display of tgid in trace output")
    Signed-off-by: Quentin Perret
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Quentin Perret
     

20 Feb, 2019

1 commit

  • commit 0722069a5374b904ec1a67f91249f90e1cfae259 upstream.

    When printing multiple uprobe arguments as strings the output for the
    earlier arguments would also include all later string arguments.

    This is best explained in an example:

    Consider adding a uprobe to a function receiving two strings as
    parameters which is at offset 0xa0 in strlib.so and we want to print
    both parameters when the uprobe is hit (on x86_64):

    $ echo 'p:func /lib/strlib.so:0xa0 +0(%di):string +0(%si):string' > \
    /sys/kernel/debug/tracing/uprobe_events

    When the function is called as func("foo", "bar") and we hit the probe,
    the trace file shows a line like the following:

    [...] func: (0x7f7e683706a0) arg1="foobar" arg2="bar"

    Note the extra "bar" printed as part of arg1. This behaviour stacks up
    for additional string arguments.

    The strings are stored in a dynamically growing part of the uprobe
    buffer by fetch_store_string() after copying them from userspace via
    strncpy_from_user(). The return value of strncpy_from_user() is then
    directly used as the required size for the string. However, this does
    not take the terminating null byte into account as the documentation
    for strncpy_from_user() cleary states that it "[...] returns the
    length of the string (not including the trailing NUL)" even though the
    null byte will be copied to the destination.

    Therefore, subsequent calls to fetch_store_string() will overwrite
    the terminating null byte of the most recently fetched string with
    the first character of the current string, leading to the
    "accumulation" of strings in earlier arguments in the output.

    Fix this by incrementing the return value of strncpy_from_user() by
    one if we did not hit the maximum buffer size.

    Link: http://lkml.kernel.org/r/20190116141629.5752-1-andreas.ziegler@fau.de

    Cc: Ingo Molnar
    Cc: stable@vger.kernel.org
    Fixes: 5baaa59ef09e ("tracing/probes: Implement 'memory' fetch method for uprobes")
    Acked-by: Masami Hiramatsu
    Signed-off-by: Andreas Ziegler
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Greg Kroah-Hartman

    Andreas Ziegler
     

15 Feb, 2019

1 commit

  • commit ea6eb5e7d15e1838de335609994b4546e2abcaaf upstream.

    The subsystem-specific message prefix for uprobes was also
    "trace_kprobe: " instead of "trace_uprobe: " as described in
    the original commit message.

    Link: http://lkml.kernel.org/r/20190117133023.19292-1-andreas.ziegler@fau.de

    Cc: Ingo Molnar
    Cc: stable@vger.kernel.org
    Acked-by: Masami Hiramatsu
    Fixes: 7257634135c24 ("tracing/probe: Show subsystem name in messages")
    Signed-off-by: Andreas Ziegler
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Andreas Ziegler
     

20 Dec, 2018

3 commits

  • commit 2840f84f74035e5a535959d5f17269c69fa6edc5 upstream.

    The following commands will cause a memory leak:

    # cd /sys/kernel/tracing
    # mkdir instances/foo
    # echo schedule > instance/foo/set_ftrace_filter
    # rmdir instances/foo

    The reason is that the hashes that hold the filters to set_ftrace_filter and
    set_ftrace_notrace are not freed if they contain any data on the instance
    and the instance is removed.

    Found by kmemleak detector.

    Cc: stable@vger.kernel.org
    Fixes: 591dffdade9f ("ftrace: Allow for function tracing instance to filter functions")
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (VMware)
     
  • commit 3cec638b3d793b7cacdec5b8072364b41caeb0e1 upstream.

    When create_event_filter() fails in set_trigger_filter(), the filter may
    still be allocated and needs to be freed. The caller expects the
    data->filter to be updated with the new filter, even if the new filter
    failed (we could add an error message by setting set_str parameter of
    create_event_filter(), but that's another update).

    But because the error would just exit, filter was left hanging and
    nothing could free it.

    Found by kmemleak detector.

    Cc: stable@vger.kernel.org
    Fixes: bac5fb97a173a ("tracing: Add and use generic set_trigger_filter() implementation")
    Reviewed-by: Tom Zanussi
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (VMware)
     
  • commit b61c19209c2c35ea2a2fe502d484703686eba98c upstream.

    The create_filter() calls create_filter_start() which allocates a
    "parse_error" descriptor, but fails to call create_filter_finish() that
    frees it.

    The op_stack and inverts in predicate_parse() were also not freed.

    Found by kmemleak detector.

    Cc: stable@vger.kernel.org
    Fixes: 80765597bc587 ("tracing: Rewrite filter logic to be simpler and faster")
    Reviewed-by: Tom Zanussi
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (VMware)
     

17 Dec, 2018

1 commit

  • [ Upstream commit 1efb6ee3edea57f57f9fb05dba8dcb3f7333f61f ]

    A format string consisting of "%p" or "%s" followed by an invalid
    specifier (e.g. "%p%\n" or "%s%") could pass the check which
    would make format_decode (lib/vsprintf.c) to warn.

    Fixes: 9c959c863f82 ("tracing: Allow BPF programs to call bpf_trace_printk()")
    Reported-by: syzbot+1ec5c5ec949c4adaa0c4@syzkaller.appspotmail.com
    Signed-off-by: Martynas Pumputis
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin

    Martynas Pumputis
     

08 Dec, 2018

1 commit

  • commit 5cf99a0f3161bc3ae2391269d134d6bf7e26f00e upstream.

    The tracefs file set_graph_function is used to only function graph functions
    that are listed in that file (or all functions if the file is empty). The
    way this is implemented is that the function graph tracer looks at every
    function, and if the current depth is zero and the function matches
    something in the file then it will trace that function. When other functions
    are called, the depth will be greater than zero (because the original
    function will be at depth zero), and all functions will be traced where the
    depth is greater than zero.

    The issue is that when a function is first entered, and the handler that
    checks this logic is called, the depth is set to zero. If an interrupt comes
    in and a function in the interrupt handler is traced, its depth will be
    greater than zero and it will automatically be traced, even if the original
    function was not. But because the logic only looks at depth it may trace
    interrupts when it should not be.

    The recent design change of the function graph tracer to fix other bugs
    caused the depth to be zero while the function graph callback handler is
    being called for a longer time, widening the race of this happening. This
    bug was actually there for a longer time, but because the race window was so
    small it seldom happened. The Fixes tag below is for the commit that widen
    the race window, because that commit belongs to a series that will also help
    fix the original bug.

    Cc: stable@kernel.org
    Fixes: 39eb456dacb5 ("function_graph: Use new curr_ret_depth to manage depth instead of curr_ret_stack")
    Reported-by: Joe Lawrence
    Tested-by: Joe Lawrence
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (VMware)
     

06 Dec, 2018

6 commits

  • commit 7c6ea35ef50810aa12ab26f21cb858d980881576 upstream.

    The function graph profiler uses the ret_stack to store the "subtime" and
    reuse it by nested functions and also on the return. But the current logic
    has the profiler callback called before the ret_stack is updated, and it is
    just modifying the ret_stack that will later be allocated (it's just lucky
    that the "subtime" is not touched when it is allocated).

    This could also cause a crash if we are at the end of the ret_stack when
    this happens.

    By reversing the order of the allocating the ret_stack and then calling the
    callbacks attached to a function being traced, the ret_stack entry is no
    longer used before it is allocated.

    Cc: stable@kernel.org
    Fixes: 03274a3ffb449 ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
    Reviewed-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (VMware)
     
  • commit 552701dd0fa7c3d448142e87210590ba424694a0 upstream.

    In the past, curr_ret_stack had two functions. One was to denote the depth
    of the call graph, the other is to keep track of where on the ret_stack the
    data is used. Although they may be slightly related, there are two cases
    where they need to be used differently.

    The one case is that it keeps the ret_stack data from being corrupted by an
    interrupt coming in and overwriting the data still in use. The other is just
    to know where the depth of the stack currently is.

    The function profiler uses the ret_stack to save a "subtime" variable that
    is part of the data on the ret_stack. If curr_ret_stack is modified too
    early, then this variable can be corrupted.

    The "max_depth" option, when set to 1, will record the first functions going
    into the kernel. To see all top functions (when dealing with timings), the
    depth variable needs to be lowered before calling the return hook. But by
    lowering the curr_ret_stack, it makes the data on the ret_stack still being
    used by the return hook susceptible to being overwritten.

    Now that there's two variables to handle both cases (curr_ret_depth), we can
    move them to the locations where they can handle both cases.

    Cc: stable@kernel.org
    Fixes: 03274a3ffb449 ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
    Reviewed-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (VMware)
     
  • commit b1b35f2e218a5b57d03bbc3b0667d5064570dc60 upstream.

    The profiler uses trace->depth to find its entry on the ret_stack, but the
    depth may not match the actual location of where its entry is (if an
    interrupt were to preempt the processing of the profiler for another
    function, the depth and the curr_ret_stack will be different).

    Have it use the curr_ret_stack as the index to find its ret_stack entry
    instead of using the depth variable, as that is no longer guaranteed to be
    the same.

    Cc: stable@kernel.org
    Fixes: 03274a3ffb449 ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
    Reviewed-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (VMware)
     
  • commit 39eb456dacb543de90d3bc6a8e0ac5cf51ac475e upstream.

    Currently, the depth of the ret_stack is determined by curr_ret_stack index.
    The issue is that there's a race between setting of the curr_ret_stack and
    calling of the callback attached to the return of the function.

    Commit 03274a3ffb44 ("tracing/fgraph: Adjust fgraph depth before calling
    trace return callback") moved the calling of the callback to after the
    setting of the curr_ret_stack, even stating that it was safe to do so, when
    in fact, it was the reason there was a barrier() there (yes, I should have
    commented that barrier()).

    Not only does the curr_ret_stack keep track of the current call graph depth,
    it also keeps the ret_stack content from being overwritten by new data.

    The function profiler, uses the "subtime" variable of ret_stack structure
    and by moving the curr_ret_stack, it allows for interrupts to use the same
    structure it was using, corrupting the data, and breaking the profiler.

    To fix this, there needs to be two variables to handle the call stack depth
    and the pointer to where the ret_stack is being used, as they need to change
    at two different locations.

    Cc: stable@kernel.org
    Fixes: 03274a3ffb449 ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
    Reviewed-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (VMware)
     
  • commit d125f3f866df88da5a85df00291f88f0baa89f7c upstream.

    As all architectures now call function_graph_enter() to do the entry work,
    no architecture should ever call ftrace_push_return_trace(). Make it static.

    This is needed to prepare for a fix of a design bug on how the curr_ret_stack
    is used.

    Cc: stable@kernel.org
    Fixes: 03274a3ffb449 ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
    Reviewed-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (VMware)
     
  • commit 8114865ff82e200b383e46821c25cb0625b842b5 upstream.

    Currently all the architectures do basically the same thing in preparing the
    function graph tracer on entry to a function. This code can be pulled into a
    generic location and then this will allow the function graph tracer to be
    fixed, as well as extended.

    Create a new function graph helper function_graph_enter() that will call the
    hook function (ftrace_graph_entry) and the shadow stack operation
    (ftrace_push_return_trace), and remove the need of the architecture code to
    manage the shadow stack.

    This is needed to prepare for a fix of a design bug on how the curr_ret_stack
    is used.

    Cc: stable@kernel.org
    Fixes: 03274a3ffb449 ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
    Reviewed-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (VMware)
     

21 Nov, 2018

1 commit

  • [ Upstream commit 59158ec4aef7d44be51a6f3e7e17fc64c32604eb ]

    Current kprobe event doesn't checks correctly whether the
    given event is on unloaded module or not. It just checks
    the event has ":" in the name.

    That is not enough because if we define a probe on non-exist
    symbol on loaded module, it allows to define that (with
    warning message)

    To ensure it correctly, this searches the module name on
    loaded module list and only if there is not, it allows to
    define it. (this event will be available when the target
    module is loaded)

    Link: http://lkml.kernel.org/r/153547309528.26502.8300278470528281328.stgit@devbox

    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Masami Hiramatsu
     

14 Nov, 2018

1 commit

  • commit 18858511fd8a877303cc34c06efa461b26a0e070 upstream.

    Return -ENOENT error if there is no target synthetic event.
    This notices an operation failure to user as below;

    # echo 'wakeup_latency u64 lat; pid_t pid;' > synthetic_events
    # echo '!wakeup' >> synthetic_events
    sh: write error: No such file or directory

    Link: http://lkml.kernel.org/r/154013449986.25576.9487131386597290172.stgit@devbox

    Acked-by: Tom Zanussi
    Tested-by: Tom Zanussi
    Cc: Shuah Khan
    Cc: Rajvi Jingar
    Cc: stable@vger.kernel.org
    Fixes: 4b147936fa50 ('tracing: Add support for 'synthetic' events')
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Masami Hiramatsu
     

20 Oct, 2018

2 commits

  • Fix synthetic event to allow independent semicolon at end.

    The synthetic_events interface accepts a semicolon after the
    last word if there is no space.

    # echo "myevent u64 var;" >> synthetic_events

    But if there is a space, it returns an error.

    # echo "myevent u64 var ;" > synthetic_events
    sh: write error: Invalid argument

    This behavior is difficult for users to understand. Let's
    allow the last independent semicolon too.

    Link: http://lkml.kernel.org/r/153986835420.18251.2191216690677025744.stgit@devbox

    Cc: Shuah Khan
    Cc: Tom Zanussi
    Cc: stable@vger.kernel.org
    Fixes: commit 4b147936fa50 ("tracing: Add support for 'synthetic' events")
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)

    Masami Hiramatsu
     
  • Fix synthetic event to accept unsigned modifier for its field type
    correctly.

    Currently, synthetic_events interface returns error for "unsigned"
    modifiers as below;

    # echo "myevent unsigned long var" >> synthetic_events
    sh: write error: Invalid argument

    This is because argv_split() breaks "unsigned long" into "unsigned"
    and "long", but parse_synth_field() doesn't expected it.

    With this fix, synthetic_events can handle the "unsigned long"
    correctly like as below;

    # echo "myevent unsigned long var" >> synthetic_events
    # cat synthetic_events
    myevent unsigned long var

    Link: http://lkml.kernel.org/r/153986832571.18251.8448135724590496531.stgit@devbox

    Cc: Shuah Khan
    Cc: Tom Zanussi
    Cc: stable@vger.kernel.org
    Fixes: commit 4b147936fa50 ("tracing: Add support for 'synthetic' events")
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)

    Masami Hiramatsu
     

18 Oct, 2018

1 commit

  • The preemptirq_delay_test module is used for the ftrace selftest code that
    tests the latency tracers. The problem is that it uses ktime for the delay
    loop, and then checks the tracer to see if the delay loop is caught, but the
    tracer uses trace_clock_local() which uses various different other clocks to
    measure the latency. As ktime uses the clock cycles, and the code then
    converts that to nanoseconds, it causes rounding errors, and the preemptirq
    latency tests are failing due to being off by 1 (it expects to see a delay
    of 500000 us, but the delay is only 499999 us). This is happening due to a
    rounding error in the ktime (which is totally legit). The purpose of the
    test is to see if it can catch the delay, not to test the accuracy between
    trace_clock_local() and ktime_get(). Best to use apples to apples, and have
    the delay loop use the same clock as the latency tracer does.

    Cc: stable@vger.kernel.org
    Fixes: f96e8577da102 ("lib: Add module for testing preemptoff/irqsoff latency tracers")
    Acked-by: Joel Fernandes (Google)
    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     

18 Sep, 2018

1 commit

  • When reducing ring buffer size, pages are removed by scheduling a work
    item on each CPU for the corresponding CPU ring buffer. After the pages
    are removed from ring buffer linked list, the pages are free()d in a
    tight loop. The loop does not give up CPU until all pages are removed.
    In a worst case behavior, when lot of pages are to be freed, it can
    cause system stall.

    After the pages are removed from the list, the free() can happen while
    the work is rescheduled. Call cond_resched() in the loop to prevent the
    system hangup.

    Link: http://lkml.kernel.org/r/20180907223129.71994-1-vnagarnaik@google.com

    Cc: stable@vger.kernel.org
    Fixes: 83f40318dab00 ("ring-buffer: Make removal of ring buffer pages atomic")
    Reported-by: Jason Behmer
    Signed-off-by: Vaibhav Nagarnaik
    Signed-off-by: Steven Rostedt (VMware)

    Vaibhav Nagarnaik
     

24 Aug, 2018

1 commit

  • Pull tracing fixes from Steven Rostedt:
    "Masami found an off by one bug in the code that keeps "notrace"
    functions from being traced by kprobes. During my testing, I found
    that there's places that we may want to add kprobes to notrace, thus
    we may end up changing this code before 4.19 is released.

    The history behind this change is that we found that adding kprobes to
    various notrace functions caused the kernel to crashed. We took the
    safe route and decided not to allow kprobes to trace any notrace
    function.

    But because notrace is added to functions that just cause weird side
    effects to the function tracer, but are still safe, preventing kprobes
    for all notrace functios may be too much of a big hammer.

    One such place is __schedule() is marked notrace, to keep function
    tracer from doing strange recursive loops when it gets traced with
    NEED_RESCHED set. With this change, one can not add kprobes to the
    scheduler.

    Masami also added code to use gcov on ftrace"

    * tag 'trace-v4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing/kprobes: Fix to check notrace function with correct range
    tracing: Allow gcov profiling on only ftrace subsystem

    Linus Torvalds
     

23 Aug, 2018

1 commit

  • Pull more block updates from Jens Axboe:

    - Set of bcache fixes and changes (Coly)

    - The flush warn fix (me)

    - Small series of BFQ fixes (Paolo)

    - wbt hang fix (Ming)

    - blktrace fix (Steven)

    - blk-mq hardware queue count update fix (Jianchao)

    - Various little fixes

    * tag 'for-4.19/post-20180822' of git://git.kernel.dk/linux-block: (31 commits)
    block/DAC960.c: make some arrays static const, shrinks object size
    blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter
    blk-mq: init hctx sched after update ctx and hctx mapping
    block: remove duplicate initialization
    tracing/blktrace: Fix to allow setting same value
    pktcdvd: fix setting of 'ret' error return for a few cases
    block: change return type to bool
    block, bfq: return nbytes and not zero from struct cftype .write() method
    block, bfq: improve code of bfq_bfqq_charge_time
    block, bfq: reduce write overcharge
    block, bfq: always update the budget of an entity when needed
    block, bfq: readd missing reset of parent-entity service
    blk-wbt: fix IO hang in wbt_wait()
    block: don't warn for flush on read-only device
    bcache: add the missing comments for smp_mb()/smp_wmb()
    bcache: remove unnecessary space before ioctl function pointer arguments
    bcache: add missing SPDX header
    bcache: move open brace at end of function definitions to next line
    bcache: add static const prefix to char * array declarations
    bcache: fix code comments style
    ...

    Linus Torvalds
     

21 Aug, 2018

3 commits

  • Fix within_notrace_func() to check notrace function correctly.

    Since the ftrace_location_range(start, end) function checks
    the range inclusively (start
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)

    Masami Hiramatsu
     
  • Add GCOV_PROFILE_FTRACE to allow gcov profiling on only files in ftrace
    subsystem. This config option will be used for checking kselftest/ftrace
    coverage.

    Link: http://lkml.kernel.org/r/153483647755.32472.4746349899604275441.stgit@devbox

    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)

    Masami Hiramatsu
     
  • Pull tracing updates from Steven Rostedt:

    - Restructure of lockdep and latency tracers

    This is the biggest change. Joel Fernandes restructured the hooks
    from irqs and preemption disabling and enabling. He got rid of a lot
    of the preprocessor #ifdef mess that they caused.

    He turned both lockdep and the latency tracers to use trace events
    inserted in the preempt/irqs disabling paths. But unfortunately,
    these started to cause issues in corner cases. Thus, parts of the
    code was reverted back to where lockdep and the latency tracers just
    get called directly (without using the trace events). But because the
    original change cleaned up the code very nicely we kept that, as well
    as the trace events for preempt and irqs disabling, but they are
    limited to not being called in NMIs.

    - Have trace events use SRCU for "rcu idle" calls. This was required
    for the preempt/irqs off trace events. But it also had to not allow
    them to be called in NMI context. Waiting till Paul makes an NMI safe
    SRCU API.

    - New notrace SRCU API to allow trace events to use SRCU.

    - Addition of mcount-nop option support

    - SPDX headers replacing GPL templates.

    - Various other fixes and clean ups.

    - Some fixes are marked for stable, but were not fully tested before
    the merge window opened.

    * tag 'trace-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (44 commits)
    tracing: Fix SPDX format headers to use C++ style comments
    tracing: Add SPDX License format tags to tracing files
    tracing: Add SPDX License format to bpf_trace.c
    blktrace: Add SPDX License format header
    s390/ftrace: Add -mfentry and -mnop-mcount support
    tracing: Add -mcount-nop option support
    tracing: Avoid calling cc-option -mrecord-mcount for every Makefile
    tracing: Handle CC_FLAGS_FTRACE more accurately
    Uprobe: Additional argument arch_uprobe to uprobe_write_opcode()
    Uprobes: Simplify uprobe_register() body
    tracepoints: Free early tracepoints after RCU is initialized
    uprobes: Use synchronize_rcu() not synchronize_sched()
    tracing: Fix synchronizing to event changes with tracepoint_synchronize_unregister()
    ftrace: Remove unused pointer ftrace_swapper_pid
    tracing: More reverting of "tracing: Centralize preemptirq tracepoints and unify their usage"
    tracing/irqsoff: Handle preempt_count for different configs
    tracing: Partial revert of "tracing: Centralize preemptirq tracepoints and unify their usage"
    tracing: irqsoff: Account for additional preempt_disable
    trace: Use rcu_dereference_raw for hooks from trace-event subsystem
    tracing/kprobes: Fix within_notrace_func() to check only notrace functions
    ...

    Linus Torvalds
     

17 Aug, 2018

5 commits

  • The Linux kernel adopted the SPDX License format headers to ease license
    compliance management, and uses the C++ '//' style comments for the SPDX
    header tags. Some files in the tracing directory used the C style /* */
    comments for them. To be consistent across all files, replace the /* */
    C style SPDX tags with the C++ // SPDX tags.

    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     
  • Add the SPDX License header to ease license compliance management.

    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     
  • Add the SPDX License header to ease license compliance management.

    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     
  • Masami Hiramatsu reported:

    Current trace-enable attribute in sysfs returns an error
    if user writes the same setting value as current one,
    e.g.

    # cat /sys/block/sda/trace/enable
    0
    # echo 0 > /sys/block/sda/trace/enable
    bash: echo: write error: Invalid argument
    # echo 1 > /sys/block/sda/trace/enable
    # echo 1 > /sys/block/sda/trace/enable
    bash: echo: write error: Device or resource busy

    But this is not a preferred behavior, it should ignore
    if new setting is same as current one. This fixes the
    problem as below.

    # cat /sys/block/sda/trace/enable
    0
    # echo 0 > /sys/block/sda/trace/enable
    # echo 1 > /sys/block/sda/trace/enable
    # echo 1 > /sys/block/sda/trace/enable

    Link: http://lkml.kernel.org/r/20180816103802.08678002@gandalf.local.home

    Cc: Ingo Molnar
    Cc: Jens Axboe
    Cc: linux-block@vger.kernel.org
    Cc: stable@vger.kernel.org
    Fixes: cd649b8bb830d ("blktrace: remove sysfs_blk_trace_enable_show/store()")
    Reported-by: Masami Hiramatsu
    Tested-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Jens Axboe

    Steven Rostedt (VMware)
     
  • Add the SPDX License header to ease license compliance management.

    Acked-by: Jens Axboe
    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     

16 Aug, 2018

2 commits

  • -mcount-nop gcc option generates the calls to the profiling functions
    as nops which allows to avoid patching mcount jump with NOP instructions
    initially.

    -mcount-nop gcc option will be activated if platform selects
    HAVE_NOP_MCOUNT and gcc actually supports it.
    In addition to that CC_USING_NOP_MCOUNT is defined and could be used by
    architectures to adapt ftrace patching behavior.

    Link: http://lkml.kernel.org/r/patch-3.thread-aa7b8d.git-e02ed2dc082b.your-ad-here.call-01533557518-ext-9465@work.hours

    Signed-off-by: Vasily Gorbik
    Signed-off-by: Steven Rostedt (VMware)

    Vasily Gorbik
     
  • Pull printk updates from Petr Mladek:

    - Different vendors have a different expectation about a console
    quietness. Make it configurable to reduce bike-shedding about the
    upstream default

    - Decide about the message visibility when the message is stored. It
    avoids races caused by a delayed console handling

    - Always store printk() messages into the per-CPU buffers again in NMI.
    The only exception is when flushing trace log in panic(). There the
    risk of loosing messages is worth an eventual reordering

    - Handle invalid %pO printf modifiers correctly

    - Better handle %p printf modifier tests before crng is initialized

    - Some clean up

    * tag 'printk-for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
    lib/vsprintf: Do not handle %pO[^F] as %px
    printk: Fix warning about unused suppress_message_printing
    printk/nmi: Prevent deadlock when accessing the main log buffer in NMI
    printk: Create helper function to queue deferred console handling
    printk: Split the code for storing a message into the log buffer
    printk: Clean up syslog_print_all()
    printk: Remove unnecessary kmalloc() from syslog during clear
    printk: Make CONSOLE_LOGLEVEL_QUIET configurable
    printk: make sure to print log on console.
    lib/test_printf.c: accept "ptrval" as valid result for plain 'p' tests

    Linus Torvalds
     

15 Aug, 2018

2 commits

  • Pull documentation update from Jonathan Corbet:
    "This was a moderately busy cycle for docs, with the usual collection
    of small fixes and updates.

    We also have new ktime_get_*() docs from Arnd, some kernel-doc fixes,
    a new set of Italian translations (non so se vale la pena, ma non fa
    male - speriamo bene), and some extensive early memory-management
    documentation improvements from Mike Rapoport"

    * tag 'docs-4.19' of git://git.lwn.net/linux: (52 commits)
    Documentation: corrections to console/console.txt
    Documentation: add ioctl number entry for v4l2-subdev.h
    Remove gendered language from management style documentation
    scripts/kernel-doc: Escape all literal braces in regexes
    docs/mm: add description of boot time memory management
    docs/mm: memblock: add overview documentation
    docs/mm: memblock: add kernel-doc description for memblock types
    docs/mm: memblock: add kernel-doc comments for memblock_add[_node]
    docs/mm: memblock: update kernel-doc comments
    mm/memblock: add a name for memblock flags enumeration
    docs/mm: bootmem: add overview documentation
    docs/mm: bootmem: add kernel-doc description of 'struct bootmem_data'
    docs/mm: bootmem: fix kernel-doc warnings
    docs/mm: nobootmem: fixup kernel-doc comments
    mm/bootmem: drop duplicated kernel-doc comments
    Documentation: vm.txt: Adding 'nr_hugepages_mempolicy' parameter description.
    doc:it_IT: translation for kernel-hacking
    docs: Fix the reference labels in Locking.rst
    doc: tracing: Fix a typo of trace_stat
    mm: Introduce new type vm_fault_t
    ...

    Linus Torvalds
     
  • Pull block updates from Jens Axboe:
    "First pull request for this merge window, there will also be a
    followup request with some stragglers.

    This pull request contains:

    - Fix for a thundering heard issue in the wbt block code (Anchal
    Agarwal)

    - A few NVMe pull requests:
    * Improved tracepoints (Keith)
    * Larger inline data support for RDMA (Steve Wise)
    * RDMA setup/teardown fixes (Sagi)
    * Effects log suppor for NVMe target (Chaitanya Kulkarni)
    * Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
    * TP4004 (ANA) support (Christoph)
    * Various NVMe fixes

    - Block io-latency controller support. Much needed support for
    properly containing block devices. (Josef)

    - Series improving how we handle sense information on the stack
    (Kees)

    - Lightnvm fixes and updates/improvements (Mathias/Javier et al)

    - Zoned device support for null_blk (Matias)

    - AIX partition fixes (Mauricio Faria de Oliveira)

    - DIF checksum code made generic (Max Gurtovoy)

    - Add support for discard in iostats (Michael Callahan / Tejun)

    - Set of updates for BFQ (Paolo)

    - Removal of async write support for bsg (Christoph)

    - Bio page dirtying and clone fixups (Christoph)

    - Set of bcache fix/changes (via Coly)

    - Series improving blk-mq queue setup/teardown speed (Ming)

    - Series improving merging performance on blk-mq (Ming)

    - Lots of other fixes and cleanups from a slew of folks"

    * tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
    blkcg: Make blkg_root_lookup() work for queues in bypass mode
    bcache: fix error setting writeback_rate through sysfs interface
    null_blk: add lock drop/acquire annotation
    Blk-throttle: reduce tail io latency when iops limit is enforced
    block: paride: pd: mark expected switch fall-throughs
    block: Ensure that a request queue is dissociated from the cgroup controller
    block: Introduce blk_exit_queue()
    blkcg: Introduce blkg_root_lookup()
    block: Remove two superfluous #include directives
    blk-mq: count the hctx as active before allocating tag
    block: bvec_nr_vecs() returns value for wrong slab
    bcache: trivial - remove tailing backslash in macro BTREE_FLAG
    bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
    bcache: set max writeback rate when I/O request is idle
    bcache: add code comments for bset.c
    bcache: fix mistaken comments in request.c
    bcache: fix mistaken code comments in bcache.h
    bcache: add a comment in super.c
    bcache: avoid unncessary cache prefetch bch_btree_node_get()
    bcache: display rate debug parameters to 0 when writeback is not running
    ...

    Linus Torvalds