21 Jul, 2014

1 commit

  • The "uptime" trace clock added in:

    commit 8aacf017b065a805d27467843490c976835eb4a5
    tracing: Add "uptime" trace clock that uses jiffies

    has wraparound problems when the system has been up more
    than 1 hour 11 minutes and 34 seconds. It converts jiffies
    to nanoseconds using:
    (u64)jiffies_to_usecs(jiffy) * 1000ULL
    but since jiffies_to_usecs() only returns a 32-bit value, it
    truncates at 2^32 microseconds. An additional problem on 32-bit
    systems is that the argument is "unsigned long", so fixing the
    return value only helps until 2^32 jiffies (49.7 days on a HZ=1000
    system).

    Avoid these problems by using jiffies_64 as our basis, and
    not converting to nanoseconds (we do convert to clock_t because
    user facing API must not be dependent on internal kernel
    HZ values).

    Link: http://lkml.kernel.org/p/99d63c5bfe9b320a3b428d773825a37095bf6a51.1405708254.git.tony.luck@intel.com

    Cc: stable@vger.kernel.org # 3.10+
    Fixes: 8aacf017b065 "tracing: Add "uptime" trace clock that uses jiffies"
    Signed-off-by: Tony Luck
    Signed-off-by: Steven Rostedt

    Tony Luck
     

16 Jul, 2014

1 commit

  • ring_buffer_poll_wait() should always put the poll_table to its wait_queue
    even there is immediate data available. Otherwise, the following epoll and
    read sequence will eventually hang forever:

    1. Put some data to make the trace_pipe ring_buffer read ready first
    2. epoll_ctl(efd, EPOLL_CTL_ADD, trace_pipe_fd, ee)
    3. epoll_wait()
    4. read(trace_pipe_fd) till EAGAIN
    5. Add some more data to the trace_pipe ring_buffer
    6. epoll_wait() -> this epoll_wait() will block forever

    ~ During the epoll_ctl(efd, EPOLL_CTL_ADD,...) call in step 2,
    ring_buffer_poll_wait() returns immediately without adding poll_table,
    which has poll_table->_qproc pointing to ep_poll_callback(), to its
    wait_queue.
    ~ During the epoll_wait() call in step 3 and step 6,
    ring_buffer_poll_wait() cannot add ep_poll_callback() to its wait_queue
    because the poll_table->_qproc is NULL and it is how epoll works.
    ~ When there is new data available in step 6, ring_buffer does not know
    it has to call ep_poll_callback() because it is not in its wait queue.
    Hence, block forever.

    Other poll implementation seems to call poll_wait() unconditionally as the very
    first thing to do. For example, tcp_poll() in tcp.c.

    Link: http://lkml.kernel.org/p/20140610060637.GA14045@devbig242.prn2.facebook.com

    Cc: stable@vger.kernel.org # 2.6.27
    Fixes: 2a2cc8f7c4d0 "ftrace: allow the event pipe to be polled"
    Reviewed-by: Chris Mason
    Signed-off-by: Martin Lau
    Signed-off-by: Steven Rostedt

    Martin Lau
     

15 Jul, 2014

4 commits

  • The TRACE_ITER_PRINTK check in __trace_puts/__trace_bputs is missing,
    so add it, to be consistent with __trace_printk/__trace_bprintk.
    Those functions are all called by the same function: trace_printk().

    Link: http://lkml.kernel.org/p/51E7A7D6.8090900@huawei.com

    Cc: stable@vger.kernel.org # 3.11+
    Signed-off-by: zhangwei(Jovi)
    Signed-off-by: Steven Rostedt

    zhangwei(Jovi)
     
  • Running my ftrace tests on PowerPC, it failed the test that checks
    if function_graph tracer is affected by the stack tracer. It was.
    Looking into this, I found that the update_function_graph_func()
    must be called even if the trampoline function is not changed.
    This is because archs like PowerPC do not support ftrace_ops being
    passed by assembly and instead uses a helper function (what the
    trampoline function points to). Since this function is not changed
    even when multiple ftrace_ops are added to the code, the test that
    falls out before calling update_function_graph_func() will miss that
    the update must still be done.

    Call update_function_graph_function() for all calls to
    update_ftrace_function()

    Cc: stable@vger.kernel.org # 3.3+
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     
  • Currently trace option stacktrace is not applicable for
    trace_printk with constant string argument, the reason is
    in __trace_puts/__trace_bputs ftrace_trace_stack is missing.

    In contrast, when using trace_printk with non constant string
    argument(will call into __trace_printk/__trace_bprintk), then
    trace option stacktrace is workable, this inconstant result
    will confuses users a lot.

    Link: http://lkml.kernel.org/p/51E7A7C9.9040401@huawei.com

    Cc: stable@vger.kernel.org # 3.10+
    Signed-off-by: zhangwei(Jovi)
    Signed-off-by: Steven Rostedt

    zhangwei(Jovi)
     
  • instance_rmdir() path destroys the event files but forgets to free
    file->filter. Change remove_event_file_dir() to free_event_filter().

    Link: http://lkml.kernel.org/p/20140711190638.GA19517@redhat.com

    Cc: Masami Hiramatsu
    Cc: Namhyung Kim
    Cc: Srikar Dronamraju
    Cc: Tom Zanussi
    Cc: "zhangwei(Jovi)"
    Cc: stable@vger.kernel.org # 3.11+
    Fixes: f6a84bdc75b5 "tracing: Introduce remove_event_file_dir()"
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Steven Rostedt

    Oleg Nesterov
     

02 Jul, 2014

1 commit


01 Jul, 2014

3 commits

  • The usage of uprobe_buffer_enable() added by dcad1a20 is very wrong,

    1. uprobe_buffer_enable() and uprobe_buffer_disable() are not balanced,
    _enable() should be called only if !enabled.

    2. If uprobe_buffer_enable() fails probe_event_enable() should clear
    tp.flags and free event_file_link.

    3. If uprobe_register() fails it should do uprobe_buffer_disable().

    Link: http://lkml.kernel.org/p/20140627170146.GA18332@redhat.com

    Acked-by: Namhyung Kim
    Acked-by: Srikar Dronamraju
    Reviewed-by: Masami Hiramatsu
    Fixes: dcad1a204f72 "tracing/uprobes: Fetch args before reserving a ring buffer"
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Steven Rostedt

    Oleg Nesterov
     
  • I do not know why dd9fa555d7bb "tracing/uprobes: Move argument fetching
    to uprobe_dispatcher()" added the UPROBE_HANDLER_REMOVE, but it looks
    wrong.

    OK, perhaps it makes sense to avoid store_trace_args() if the tracee is
    nacked by uprobe_perf_filter(). But then we should kill the same code
    in uprobe_perf_func() and unify the TRACE/PROFILE filtering (we need to
    do this anyway to mix perf/ftrace). Until then this code actually adds
    the pessimization because uprobe_perf_filter() will be called twice and
    return T in likely case.

    Link: http://lkml.kernel.org/p/20140627170143.GA18329@redhat.com

    Acked-by: Namhyung Kim
    Acked-by: Srikar Dronamraju
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Steven Rostedt

    Oleg Nesterov
     
  • This reverts commit 43fe98913c9f67e3b523615ee3316f9520a623e0.

    This patch is very wrong. Firstly, this change leads to unbalanced
    uprobe_unregister(). Just for example,

    # perf probe -x /lib/libc.so.6 syscall
    # echo 1 >> /sys/kernel/debug/tracing/events/probe_libc/enable
    # perf record -e probe_libc:syscall whatever

    after that uprobe is dead (unregistered) but the user of ftrace/perf
    can't know this, and it looks as if nobody hits this probe.

    This would be easy to fix, but there are other reasons why it is not
    simple to mix ftrace and perf. If nothing else, they can't share the
    same ->consumer.filter. This is fixable too, but probably we need to
    fix the poorly designed uprobe_register() interface first. At least
    "register" and "apply" should be clearly separated.

    Link: http://lkml.kernel.org/p/20140627170136.GA18319@redhat.com

    Cc: Tom Zanussi
    Cc: "zhangwei(Jovi)"
    Cc: stable@vger.kernel.org # v3.14
    Acked-by: Namhyung Kim
    Acked-by: Srikar Dronamraju
    Reviewed-by: Masami Hiramatsu
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Steven Rostedt

    Oleg Nesterov
     

13 Jun, 2014

2 commits

  • Pull tracing cleanups and bugfixes from Steven Rostedt:
    "One bug fix that goes back to 3.10. Accessing a non existent buffer
    if "possible cpus" is greater than actual CPUs (including offline
    CPUs).

    Namhyung Kim did some reviews of the patches I sent this merge window
    and found a memory leak and had a few clean ups"

    * tag 'trace-3.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Fix check of ftrace_trace_arrays list_empty() check
    tracing: Fix leak of per cpu max data in instances
    tracing: Cleanup saved_cmdlines_size changes
    ring-buffer: Check if buffer exists before polling

    Linus Torvalds
     
  • Pull more perf updates from Ingo Molnar:
    "A second round of perf updates:

    - wide reaching kprobes sanitization and robustization, with the hope
    of fixing all 'probe this function crashes the kernel' bugs, by
    Masami Hiramatsu.

    - uprobes updates from Oleg Nesterov: tmpfs support, corner case
    fixes and robustization work.

    - perf tooling updates and fixes from Jiri Olsa, Namhyung Ki, Arnaldo
    et al:
    * Add support to accumulate hist periods (Namhyung Kim)
    * various fixes, refactorings and enhancements"

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
    perf: Differentiate exec() and non-exec() comm events
    perf: Fix perf_event_comm() vs. exec() assumption
    uprobes/x86: Rename arch_uprobe->def to ->defparam, minor comment updates
    perf/documentation: Add description for conditional branch filter
    perf/x86: Add conditional branch filtering support
    perf/tool: Add conditional branch filter 'cond' to perf record
    perf: Add new conditional branch filter 'PERF_SAMPLE_BRANCH_COND'
    uprobes: Teach copy_insn() to support tmpfs
    uprobes: Shift ->readpage check from __copy_insn() to uprobe_register()
    perf/x86: Use common PMU interrupt disabled code
    perf/ARM: Use common PMU interrupt disabled code
    perf: Disable sampled events if no PMU interrupt
    perf: Fix use after free in perf_remove_from_context()
    perf tools: Fix 'make help' message error
    perf record: Fix poll return value propagation
    perf tools: Move elide bool into perf_hpp_fmt struct
    perf tools: Remove elide setup for SORT_MODE__MEMORY mode
    perf tools: Fix "==" into "=" in ui_browser__warning assignment
    perf tools: Allow overriding sysfs and proc finding with env var
    perf tools: Consider header files outside perf directory in tags target
    ...

    Linus Torvalds
     

11 Jun, 2014

2 commits

  • The check that tests if ftrace_trace_arrays is empty in
    top_trace_array(), uses the .prev pointer:

    if (list_empty(ftrace_trace_arrays.prev))

    instead of testing the variable itself:

    if (list_empty(&ftrace_trace_arrays))

    Although it is technically correct, it is awkward and confusing.
    Use the proper method.

    Link: http://lkml.kernel.org/r/87oay1bas8.fsf@sejong.aot.lge.com

    Reported-by: Namhyung Kim
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     
  • The freeing of an instance, if max data is configured, there will be
    per cpu data structures created. But these are not freed when the instance
    is deleted, which causes a memory leak.

    A new helper function is added that frees the individual buffers within a
    trace array, instead of duplicating the code. This way changes made for one
    are applied to the other (normal buffer vs max buffer).

    Link: http://lkml.kernel.org/r/87k38pbake.fsf@sejong.aot.lge.com

    Reported-by: Namhyung Kim
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     

10 Jun, 2014

3 commits

  • The recent addition of saved_cmdlines_size file had some remaining
    (minor - mostly coding style) issues. Fix them by passing pointer
    name to sizeof() and using scnprintf().

    Link: http://lkml.kernel.org/p/1402384295-23680-1-git-send-email-namhyung@kernel.org

    Cc: Namhyung Kim
    Cc: Ingo Molnar
    Cc: Yoshihiro YUNOMAE
    Signed-off-by: Namhyung Kim
    Signed-off-by: Steven Rostedt

    Namhyung Kim
     
  • The per_cpu buffers are created one per possible CPU. But these do
    not mean that those CPUs are online, nor do they even exist.

    With the addition of the ring buffer polling, it assumes that the
    caller polls on an existing buffer. But this is not the case if
    the user reads trace_pipe from a CPU that does not exist, and this
    causes the kernel to crash.

    Simple fix is to check the cpu against buffer bitmask against to see
    if the buffer was allocated or not and return -ENODEV if it is
    not.

    More updates were done to pass the -ENODEV back up to userspace.

    Link: http://lkml.kernel.org/r/5393DB61.6060707@oracle.com

    Reported-by: Sasha Levin
    Cc: stable@vger.kernel.org # 3.10+
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     
  • Pull tracing updates from Steven Rostedt:
    "Lots of tweaks, small fixes, optimizations, and some helper functions
    to help out the rest of the kernel to ease their use of trace events.

    The big change for this release is the allowing of other tracers, such
    as the latency tracers, to be used in the trace instances and allow
    for function or function graph tracing to be in the top level
    simultaneously"

    * tag 'trace-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (44 commits)
    tracing: Fix memory leak on instance deletion
    tracing: Fix leak of ring buffer data when new instances creation fails
    tracing/kprobes: Avoid self tests if tracing is disabled on boot up
    tracing: Return error if ftrace_trace_arrays list is empty
    tracing: Only calculate stats of tracepoint benchmarks for 2^32 times
    tracing: Convert stddev into u64 in tracepoint benchmark
    tracing: Introduce saved_cmdlines_size file
    tracing: Add __get_dynamic_array_len() macro for trace events
    tracing: Remove unused variable in trace_benchmark
    tracing: Eliminate double free on failure of allocation on boot up
    ftrace/x86: Call text_ip_addr() instead of the duplicated code
    tracing: Print max callstack on stacktrace bug
    tracing: Move locking of trace_cmdline_lock into start/stop seq calls
    tracing: Try again for saved cmdline if failed due to locking
    tracing: Have saved_cmdlines use the seq_read infrastructure
    tracing: Add tracepoint benchmark tracepoint
    tracing: Print nasty banner when trace_printk() is in use
    tracing: Add funcgraph_tail option to print function name after closing braces
    tracing: Eliminate duplicate TRACE_GRAPH_PRINT_xx defines
    tracing: Add __bitmask() macro to trace events to cpumasks and other bitmasks
    ...

    Linus Torvalds
     

07 Jun, 2014

1 commit

  • When an instance is created, it also gets a snapshot ring buffer
    allocated (with minimum of pages). But when it is deleted the snapshot
    buffer is not. There was a helper function added to match the allocation
    of these ring buffers to a way to free them, but it wasn't used by
    the deletion of an instance. Using that helper function solves this
    memory leak.

    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     

06 Jun, 2014

6 commits

  • Yoshihiro Yunomae reported that the ring buffer data for a trace
    instance does not get properly cleaned up when it fails. He proposed
    a patch that manually cleaned the data up and addad a bunch of labels.
    The labels are not needed because all trace array is allocated with
    a kzalloc which initializes it to 0 and all kfree()s can take a NULL
    pointer and will ignore it.

    Adding a new helper function free_trace_buffers() that can also take
    null buffers to free the buffers that were allocated by
    allocate_trace_buffers().

    Link: http://lkml.kernel.org/r/20140605223522.32311.31664.stgit@yunodevel

    Reported-by: Yoshihiro YUNOMAE
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     
  • If tracing is disabled on boot up, the kernel should not execute tracing
    self tests. The kernel should check whether tracing is disabled or not
    before executing any of the tracing self tests.

    Link: http://lkml.kernel.org/p/20140605223520.32311.56097.stgit@yunodevel

    Acked-by: Masami Hiramatsu
    Signed-off-by: Yoshihiro YUNOMAE
    Signed-off-by: Steven Rostedt

    Yoshihiro YUNOMAE
     
  • ftrace_trace_arrays links global_trace.list. However, global_trace
    is not added to ftrace_trace_arrays if trace_alloc_buffers() failed.
    As the result, ftrace_trace_arrays becomes an empty list. If
    ftrace_trace_arrays is an empty list, current top_trace_array() returns
    an invalid pointer. As the result, the kernel can induce memory corruption
    or panic.

    Current implementation does not check whether ftrace_trace_arrays is empty
    list or not. So, in this patch, if ftrace_trace_arrays is empty list,
    top_trace_array() returns NULL. Moreover, this patch makes all functions
    calling top_trace_array() handle it appropriately.

    Link: http://lkml.kernel.org/p/20140605223517.32311.99233.stgit@yunodevel

    Signed-off-by: Yoshihiro YUNOMAE
    Signed-off-by: Steven Rostedt

    Yoshihiro YUNOMAE
     
  • When calculating the average and standard deviation, it is required that
    the count be less than UINT_MAX, otherwise the do_div() will get
    undefined results. After 2^32 counts of data, the average and standard
    deviation should pretty much be set anyway.

    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     
  • I've been told that do_div() expects an unsigned 64 bit number, and
    is undefined if a signed is used. This gave a warning on the MIPS
    build. I'm not sure if a signed 64 bit dividend is really an issue
    or not, but the calculation this is used for is standard deviation,
    and that isn't going to be negative. We can just convert it to
    unsigned and be safe.

    Reported-by: David Daney
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     
  • Introduce saved_cmdlines_size file for changing the number of saved pid-comms.
    saved_cmdlines currently stores 128 command names using SAVED_CMDLINES, but
    'no-existing processes' names are often lost in saved_cmdlines when we
    read the trace data. So, by introducing saved_cmdlines_size file, we can
    now change the 128 command names saved to something much larger if needed.

    When we write a value to saved_cmdlines_size, the number of the value will
    be stored in pid-comm list:

    # echo 1024 > /sys/kernel/debug/tracing/saved_cmdlines_size

    Here, 1024 command names can be stored. The default number is 128 and the maximum
    number is PID_MAX_DEFAULT (=32768 if CONFIG_BASE_SMALL is not set). So, if we
    want to avoid losing any command names, we need to set 32768 to
    saved_cmdlines_size.

    We can read the maximum number of the list:

    # cat /sys/kernel/debug/tracing/saved_cmdlines_size
    128

    Link: http://lkml.kernel.org/p/20140605012427.22115.16173.stgit@yunodevel

    Signed-off-by: Yoshihiro YUNOMAE
    Signed-off-by: Steven Rostedt

    Yoshihiro YUNOMAE
     

05 Jun, 2014

3 commits


04 Jun, 2014

1 commit

  • If allocation of the max_buffer fails on boot up, the error path will
    free both per_cpu data structures from the buffers. With the new redesign
    of the code, those structures are freed if allocations failed. That is,
    the helper function that allocates the buffers will free the per cpu data
    on failure. No need to do it again. In fact, the second free will cause
    a bug as the code can not handle a double free.

    Link: http://lkml.kernel.org/p/20140603042803.27308.30956.stgit@yunodevel

    Signed-off-by: Yoshihiro YUNOMAE
    Signed-off-by: Steven Rostedt

    Yoshihiro YUNOMAE
     

03 Jun, 2014

1 commit

  • While I played with my own feature(ex, something on the way to reclaim),
    the kernel would easily oops. I guessed that the reason had to do with
    stack overflow and wanted to prove it.

    I discovered the stack tracer which proved to be very useful for me but
    the kernel would oops before my user program gather the information via
    "watch cat /sys/kernel/debug/tracing/stack_trace" so I couldn't get any
    message from that. What I needed was to have the stack tracer emit the
    kernel stack usage before it does the oops so I could find what was
    hogging the stack.

    This patch shows the callstack of max stack usage right before an oops so
    we can find a culprit.

    So, the result is as follows.

    [ 1116.522206] init: lightdm main process (1246) terminated with status 1
    [ 1119.922916] init: failsafe-x main process (1272) terminated with status 1
    [ 3887.728131] kworker/u24:1 (6637) used greatest stack depth: 256 bytes left
    [ 6397.629227] cc1 (9554) used greatest stack depth: 128 bytes left
    [ 7174.467392] Depth Size Location (47 entries)
    [ 7174.467392] ----- ---- --------
    [ 7174.467785] 0) 7248 256 get_page_from_freelist+0xa7/0x920
    [ 7174.468506] 1) 6992 352 __alloc_pages_nodemask+0x1cd/0xb20
    [ 7174.469224] 2) 6640 8 alloc_pages_current+0x10f/0x1f0
    [ 7174.469413] 3) 6632 168 new_slab+0x2c5/0x370
    [ 7174.469413] 4) 6464 8 __slab_alloc+0x3a9/0x501
    [ 7174.469413] 5) 6456 80 __kmalloc+0x1cb/0x200
    [ 7174.469413] 6) 6376 376 vring_add_indirect+0x36/0x200
    [ 7174.469413] 7) 6000 144 virtqueue_add_sgs+0x2e2/0x320
    [ 7174.469413] 8) 5856 288 __virtblk_add_req+0xda/0x1b0
    [ 7174.469413] 9) 5568 96 virtio_queue_rq+0xd3/0x1d0
    [ 7174.469413] 10) 5472 128 __blk_mq_run_hw_queue+0x1ef/0x440
    [ 7174.469413] 11) 5344 16 blk_mq_run_hw_queue+0x35/0x40
    [ 7174.469413] 12) 5328 96 blk_mq_insert_requests+0xdb/0x160
    [ 7174.469413] 13) 5232 112 blk_mq_flush_plug_list+0x12b/0x140
    [ 7174.469413] 14) 5120 112 blk_flush_plug_list+0xc7/0x220
    [ 7174.469413] 15) 5008 64 io_schedule_timeout+0x88/0x100
    [ 7174.469413] 16) 4944 128 mempool_alloc+0x145/0x170
    [ 7174.469413] 17) 4816 96 bio_alloc_bioset+0x10b/0x1d0
    [ 7174.469413] 18) 4720 48 get_swap_bio+0x30/0x90
    [ 7174.469413] 19) 4672 160 __swap_writepage+0x150/0x230
    [ 7174.469413] 20) 4512 32 swap_writepage+0x42/0x90
    [ 7174.469413] 21) 4480 320 shrink_page_list+0x676/0xa80
    [ 7174.469413] 22) 4160 208 shrink_inactive_list+0x262/0x4e0
    [ 7174.469413] 23) 3952 304 shrink_lruvec+0x3e1/0x6a0
    [ 7174.469413] 24) 3648 80 shrink_zone+0x3f/0x110
    [ 7174.469413] 25) 3568 128 do_try_to_free_pages+0x156/0x4c0
    [ 7174.469413] 26) 3440 208 try_to_free_pages+0xf7/0x1e0
    [ 7174.469413] 27) 3232 352 __alloc_pages_nodemask+0x783/0xb20
    [ 7174.469413] 28) 2880 8 alloc_pages_current+0x10f/0x1f0
    [ 7174.469413] 29) 2872 200 __page_cache_alloc+0x13f/0x160
    [ 7174.469413] 30) 2672 80 find_or_create_page+0x4c/0xb0
    [ 7174.469413] 31) 2592 80 ext4_mb_load_buddy+0x1e9/0x370
    [ 7174.469413] 32) 2512 176 ext4_mb_regular_allocator+0x1b7/0x460
    [ 7174.469413] 33) 2336 128 ext4_mb_new_blocks+0x458/0x5f0
    [ 7174.469413] 34) 2208 256 ext4_ext_map_blocks+0x70b/0x1010
    [ 7174.469413] 35) 1952 160 ext4_map_blocks+0x325/0x530
    [ 7174.469413] 36) 1792 384 ext4_writepages+0x6d1/0xce0
    [ 7174.469413] 37) 1408 16 do_writepages+0x23/0x40
    [ 7174.469413] 38) 1392 96 __writeback_single_inode+0x45/0x2e0
    [ 7174.469413] 39) 1296 176 writeback_sb_inodes+0x2ad/0x500
    [ 7174.469413] 40) 1120 80 __writeback_inodes_wb+0x9e/0xd0
    [ 7174.469413] 41) 1040 160 wb_writeback+0x29b/0x350
    [ 7174.469413] 42) 880 208 bdi_writeback_workfn+0x11c/0x480
    [ 7174.469413] 43) 672 144 process_one_work+0x1d2/0x570
    [ 7174.469413] 44) 528 112 worker_thread+0x116/0x370
    [ 7174.469413] 45) 416 240 kthread+0xf3/0x110
    [ 7174.469413] 46) 176 176 ret_from_fork+0x7c/0xb0
    [ 7174.469413] ------------[ cut here ]------------
    [ 7174.469413] kernel BUG at kernel/trace/trace_stack.c:174!
    [ 7174.469413] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
    [ 7174.469413] Dumping ftrace buffer:
    [ 7174.469413] (ftrace buffer empty)
    [ 7174.469413] Modules linked in:
    [ 7174.469413] CPU: 0 PID: 440 Comm: kworker/u24:0 Not tainted 3.14.0+ #212
    [ 7174.469413] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    [ 7174.469413] Workqueue: writeback bdi_writeback_workfn (flush-253:0)
    [ 7174.469413] task: ffff880034170000 ti: ffff880029518000 task.ti: ffff880029518000
    [ 7174.469413] RIP: 0010:[] [] stack_trace_call+0x2de/0x340
    [ 7174.469413] RSP: 0000:ffff880029518290 EFLAGS: 00010046
    [ 7174.469413] RAX: 0000000000000030 RBX: 000000000000002f RCX: 0000000000000000
    [ 7174.469413] RDX: 0000000000000000 RSI: 000000000000002f RDI: ffffffff810b7159
    [ 7174.469413] RBP: ffff8800295182f0 R08: ffffffffffffffff R09: 0000000000000000
    [ 7174.469413] R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff82768dfc
    [ 7174.469413] R13: 000000000000f2e8 R14: ffff8800295182b8 R15: 00000000000000f8
    [ 7174.469413] FS: 0000000000000000(0000) GS:ffff880037c00000(0000) knlGS:0000000000000000
    [ 7174.469413] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [ 7174.469413] CR2: 00002acd0b994000 CR3: 0000000001c0b000 CR4: 00000000000006f0
    [ 7174.469413] Stack:
    [ 7174.469413] 0000000000000000 ffffffff8114fdb7 0000000000000087 0000000000001c50
    [ 7174.469413] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
    [ 7174.469413] 0000000000000002 ffff880034170000 ffff880034171028 0000000000000000
    [ 7174.469413] Call Trace:
    [ 7174.469413] [] ? get_page_from_freelist+0xa7/0x920
    [ 7174.469413] [] ftrace_call+0x5/0x2f
    [ 7174.469413] [] ? next_zones_zonelist+0x5/0x70
    [ 7174.469413] [] ? __bfs+0x11a/0x270
    [ 7174.469413] [] ? next_zones_zonelist+0x5/0x70
    [ 7174.469413] [] ? get_page_from_freelist+0xa7/0x920
    [ 7174.469413] [] ? alloc_pages_current+0x10f/0x1f0
    [ 7174.469413] [] __alloc_pages_nodemask+0x1cd/0xb20
    [ 7174.469413] [] ? check_irq_usage+0x96/0xe0
    [ 7174.469413] [] ? ftrace_call+0x5/0x2f
    [ 7174.469413] [] alloc_pages_current+0x10f/0x1f0
    [ 7174.469413] [] ? new_slab+0x2c5/0x370
    [ 7174.469413] [] new_slab+0x2c5/0x370
    [ 7174.469413] [] ? ftrace_call+0x5/0x2f
    [ 7174.469413] [] __slab_alloc+0x3a9/0x501
    [ 7174.469413] [] ? __kmalloc+0x1cb/0x200
    [ 7174.469413] [] ? vring_add_indirect+0x36/0x200
    [ 7174.469413] [] ? vring_add_indirect+0x36/0x200
    [ 7174.469413] [] ? vring_add_indirect+0x36/0x200
    [ 7174.469413] [] __kmalloc+0x1cb/0x200
    [ 7174.469413] [] ? vring_add_indirect+0x200/0x200
    [ 7174.469413] [] vring_add_indirect+0x36/0x200
    [ 7174.469413] [] virtqueue_add_sgs+0x2e2/0x320
    [ 7174.469413] [] __virtblk_add_req+0xda/0x1b0
    [ 7174.469413] [] virtio_queue_rq+0xd3/0x1d0
    [ 7174.469413] [] __blk_mq_run_hw_queue+0x1ef/0x440
    [ 7174.469413] [] blk_mq_run_hw_queue+0x35/0x40
    [ 7174.469413] [] blk_mq_insert_requests+0xdb/0x160
    [ 7174.469413] [] blk_mq_flush_plug_list+0x12b/0x140
    [ 7174.469413] [] blk_flush_plug_list+0xc7/0x220
    [ 7174.469413] [] ? _raw_spin_unlock_irqrestore+0x3f/0x70
    [ 7174.469413] [] io_schedule_timeout+0x88/0x100
    [ 7174.469413] [] ? io_schedule_timeout+0x5/0x100
    [ 7174.469413] [] mempool_alloc+0x145/0x170
    [ 7174.469413] [] ? __init_waitqueue_head+0x60/0x60
    [ 7174.469413] [] bio_alloc_bioset+0x10b/0x1d0
    [ 7174.469413] [] ? end_swap_bio_read+0xc0/0xc0
    [ 7174.469413] [] ? end_swap_bio_read+0xc0/0xc0
    [ 7174.469413] [] get_swap_bio+0x30/0x90
    [ 7174.469413] [] ? end_swap_bio_read+0xc0/0xc0
    [ 7174.469413] [] __swap_writepage+0x150/0x230
    [ 7174.469413] [] ? do_raw_spin_unlock+0x5/0xa0
    [ 7174.469413] [] ? end_swap_bio_read+0xc0/0xc0
    [ 7174.469413] [] ? __swap_writepage+0x5/0x230
    [ 7174.469413] [] swap_writepage+0x42/0x90
    [ 7174.469413] [] shrink_page_list+0x676/0xa80
    [ 7174.469413] [] ? ftrace_call+0x5/0x2f
    [ 7174.469413] [] shrink_inactive_list+0x262/0x4e0
    [ 7174.469413] [] shrink_lruvec+0x3e1/0x6a0
    [ 7174.469413] [] shrink_zone+0x3f/0x110
    [ 7174.469413] [] ? ftrace_call+0x5/0x2f
    [ 7174.469413] [] do_try_to_free_pages+0x156/0x4c0
    [ 7174.469413] [] try_to_free_pages+0xf7/0x1e0
    [ 7174.469413] [] __alloc_pages_nodemask+0x783/0xb20
    [ 7174.469413] [] alloc_pages_current+0x10f/0x1f0
    [ 7174.469413] [] ? __page_cache_alloc+0x13f/0x160
    [ 7174.469413] [] __page_cache_alloc+0x13f/0x160
    [ 7174.469413] [] find_or_create_page+0x4c/0xb0
    [ 7174.469413] [] ? find_get_page+0x5/0x130
    [ 7174.469413] [] ext4_mb_load_buddy+0x1e9/0x370
    [ 7174.469413] [] ext4_mb_regular_allocator+0x1b7/0x460
    [ 7174.469413] [] ? ext4_mb_use_preallocated+0x40/0x360
    [ 7174.469413] [] ? ftrace_call+0x5/0x2f
    [ 7174.469413] [] ext4_mb_new_blocks+0x458/0x5f0
    [ 7174.469413] [] ext4_ext_map_blocks+0x70b/0x1010
    [ 7174.469413] [] ext4_map_blocks+0x325/0x530
    [ 7174.469413] [] ext4_writepages+0x6d1/0xce0
    [ 7174.469413] [] ? ext4_journalled_write_end+0x330/0x330
    [ 7174.469413] [] do_writepages+0x23/0x40
    [ 7174.469413] [] __writeback_single_inode+0x45/0x2e0
    [ 7174.469413] [] writeback_sb_inodes+0x2ad/0x500
    [ 7174.469413] [] __writeback_inodes_wb+0x9e/0xd0
    [ 7174.469413] [] wb_writeback+0x29b/0x350
    [ 7174.469413] [] ? __local_bh_enable_ip+0x6d/0xd0
    [ 7174.469413] [] bdi_writeback_workfn+0x11c/0x480
    [ 7174.469413] [] ? process_one_work+0x170/0x570
    [ 7174.469413] [] process_one_work+0x1d2/0x570
    [ 7174.469413] [] ? process_one_work+0x170/0x570
    [ 7174.469413] [] worker_thread+0x116/0x370
    [ 7174.469413] [] ? manage_workers.isra.19+0x2e0/0x2e0
    [ 7174.469413] [] kthread+0xf3/0x110
    [ 7174.469413] [] ? flush_kthread_worker+0x150/0x150
    [ 7174.469413] [] ret_from_fork+0x7c/0xb0
    [ 7174.469413] [] ? flush_kthread_worker+0x150/0x150
    [ 7174.469413] Code: c0 49 bc fc 8d 76 82 ff ff ff ff e8 44 5a 5b 00 31 f6 8b 05 95 2b b3 00 48 39 c6 7d 0e 4c 8b 04 f5 20 5f c5 81 49 83 f8 ff 75 11 0b 48 63 05 71 5a 64 01 48 29 c3 e9 d0 fd ff ff 48 8d 5e 01
    [ 7174.469413] RIP [] stack_trace_call+0x2de/0x340
    [ 7174.469413] RSP
    [ 7174.469413] ---[ end trace c97d325b36b718f3 ]---

    Link: http://lkml.kernel.org/p/1401683592-1651-1-git-send-email-minchan@kernel.org

    Signed-off-by: Minchan Kim
    Signed-off-by: Steven Rostedt

    Minchan Kim
     

31 May, 2014

1 commit

  • With the conversion of the saved_cmdlines output to use seq_read, there
    is now a race between accessing the values of the saved_cmdlines and
    the writing to them. The trace_cmdline_lock needs to be taken at
    the start and stop of the seq calls.

    A new __trace_find_cmdline() call is created to allow for the look up
    to happen without taking the lock.

    Fixes: 42584c81c5ad tracing: Have saved_cmdlines use the seq_read infrastructure
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     

30 May, 2014

4 commits

  • In order to prevent the saved cmdline cache from being filled when
    tracing is not active, the comms are only recorded after a trace event
    is recorded.

    The problem is, a comm can fail to be recorded if the trace_cmdline_lock
    is held. That lock is taken via a trylock to allow it to happen from
    any context (including NMI). If the lock fails to be taken, the comm
    is skipped. No big deal, as we will try again later.

    But! Because of the code that was added to only record after an event,
    we may not try again later as the recording is made as a oneshot per
    event per CPU.

    Only disable the recording of the comm if the comm is actually recorded.

    Fixes: 7ffbd48d5cab "tracing: Cache comms only after an event occurred"
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     
  • Current tracing_saved_cmdlines_read() implementation is naive; It allocates
    a large buffer, constructs output data to that buffer for each read
    operation, and then copies a portion of the buffer to the user space
    buffer. This has several issues such as slow memory allocation, high
    CPU usage, and even corruption of the output data.

    The seq_read infrastructure is made to handle this type of work.
    By converting it to use seq_read() the code becomes smaller, simplified,
    as well as correct.

    Link: http://lkml.kernel.org/p/20140220084431.3839.51793.stgit@yunodevel

    Signed-off-by: Hidehiro Kawai
    Cc: Namhyung Kim
    Cc: Masami Hiramatsu
    Signed-off-by: Yoshihiro YUNOMAE
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Steven Rostedt

    Yoshihiro YUNOMAE
     
  • In order to help benchmark the time tracepoints take, a new config
    option is added called CONFIG_TRACEPOINT_BENCHMARK. When this option
    is set a tracepoint is created called "benchmark:benchmark_event".
    When the tracepoint is enabled, it kicks off a kernel thread that
    goes into an infinite loop (calling cond_sched() to let other tasks
    run), and calls the tracepoint. Each iteration will record the time
    it took to write to the tracepoint and the next iteration that
    data will be passed to the tracepoint itself. That is, the tracepoint
    will report the time it took to do the previous tracepoint.
    The string written to the tracepoint is a static string of 128 bytes
    to keep the time the same. The initial string is simply a write of
    "START". The second string records the cold cache time of the first
    write which is not added to the rest of the calculations.

    As it is a tight loop, it benchmarks as hot cache. That's fine because
    we care most about hot paths that are probably in cache already.

    An example of the output:

    START
    first=3672 [COLD CACHED]
    last=632 first=3672 max=632 min=632 avg=316 std=446 std^2=199712
    last=278 first=3672 max=632 min=278 avg=303 std=316 std^2=100337
    last=277 first=3672 max=632 min=277 avg=296 std=258 std^2=67064
    last=273 first=3672 max=632 min=273 avg=292 std=224 std^2=50411
    last=273 first=3672 max=632 min=273 avg=288 std=200 std^2=40389
    last=281 first=3672 max=632 min=273 avg=287 std=183 std^2=33666

    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     
  • trace_printk() is used to debug fast paths within the kernel. Places
    that gets called in any context (interrupt or NMI) or thousands of
    times a second. Something you do not want to do with a printk().

    In order to make it completely lockless as it needs a temporary buffer
    to handle some of the string formatting, a page is created per cpu for
    every context (four per cpu; normal, softirq, irq, NMI).

    Since trace_printk() should only be used for debugging purposes,
    there's no reason to waste memory on these buffers on a production
    system. That means, trace_printk() should never be used unless a
    developer is debugging their kernel. There's macro magic to allocate
    the buffers if trace_printk() is used anywhere in the kernel.

    To help enforce that trace_printk() isn't used outside of development,
    when it is used, a nasty banner is displayed on bootup (or when a module
    is loaded that uses trace_printk() and the kernel core does not).

    Here's the banner:

    **********************************************************
    ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE **
    ** **
    ** trace_printk() being used. Allocating extra memory. **
    ** **
    ** This means that this is a DEBUG kernel and it is **
    ** unsafe for produciton use. **
    ** **
    ** If you see this message and you are not debugging **
    ** the kernel, report this immediately to your vendor! **
    ** **
    ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE **
    **********************************************************

    That should hopefully keep developers from trying to sneak in a
    trace_printk() or two.

    Link: http://lkml.kernel.org/p/20140528131440.2283213c@gandalf.local.home

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

21 May, 2014

2 commits

  • In the function-graph tracer, add a funcgraph_tail option
    to print the function name on all } lines, not just
    functions whose first line is no longer in the trace
    buffer.

    If a function calls other traced functions, its total
    time appears on its } line. This change allows grep
    to be used to determine the function for which the
    line corresponds.

    Update Documentation/trace/ftrace.txt to describe
    this new option.

    Link: http://lkml.kernel.org/p/20140520221041.8359.6782.stgit@beardog.cce.hp.com

    Signed-off-by: Robert Elliott
    Signed-off-by: Steven Rostedt

    Robert Elliott
     
  • Eliminate duplicate TRACE_GRAPH_PRINT_xx defines
    in trace_functions_graph.c that are already in
    trace.h.

    Add TRACE_GRAPH_PRINT_IRQS to trace.h, which is
    the only one that is missing.

    Link: http://lkml.kernel.org/p/20140520221031.8359.24733.stgit@beardog.cce.hp.com

    Signed-off-by: Robert Elliott
    Signed-off-by: Steven Rostedt

    Robert Elliott
     

15 May, 2014

1 commit

  • Being able to show a cpumask of events can be useful as some events
    may affect only some CPUs. There is no standard way to record the
    cpumask and converting it to a string is rather expensive during
    the trace as traces happen in hotpaths. It would be better to record
    the raw event mask and be able to parse it at print time.

    The following macros were added for use with the TRACE_EVENT() macro:

    __bitmask()
    __assign_bitmask()
    __get_bitmask()

    To test this, I added this to the sched_migrate_task event, which
    looked like this:

    TRACE_EVENT(sched_migrate_task,

    TP_PROTO(struct task_struct *p, int dest_cpu, const struct cpumask *cpus),

    TP_ARGS(p, dest_cpu, cpus),

    TP_STRUCT__entry(
    __array( char, comm, TASK_COMM_LEN )
    __field( pid_t, pid )
    __field( int, prio )
    __field( int, orig_cpu )
    __field( int, dest_cpu )
    __bitmask( cpumask, num_possible_cpus() )
    ),

    TP_fast_assign(
    memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
    __entry->pid = p->pid;
    __entry->prio = p->prio;
    __entry->orig_cpu = task_cpu(p);
    __entry->dest_cpu = dest_cpu;
    __assign_bitmask(cpumask, cpumask_bits(cpus), num_possible_cpus());
    ),

    TP_printk("comm=%s pid=%d prio=%d orig_cpu=%d dest_cpu=%d cpumask=%s",
    __entry->comm, __entry->pid, __entry->prio,
    __entry->orig_cpu, __entry->dest_cpu,
    __get_bitmask(cpumask))
    );

    With the output of:

    ksmtuned-3613 [003] d..2 485.220508: sched_migrate_task: comm=ksmtuned pid=3615 prio=120 orig_cpu=3 dest_cpu=2 cpumask=00000000,0000000f
    migration/1-13 [001] d..5 485.221202: sched_migrate_task: comm=ksmtuned pid=3614 prio=120 orig_cpu=1 dest_cpu=0 cpumask=00000000,0000000f
    awk-3615 [002] d.H5 485.221747: sched_migrate_task: comm=rcu_preempt pid=7 prio=120 orig_cpu=0 dest_cpu=1 cpumask=00000000,000000ff
    migration/2-18 [002] d..5 485.222062: sched_migrate_task: comm=ksmtuned pid=3615 prio=120 orig_cpu=2 dest_cpu=3 cpumask=00000000,0000000f

    Link: http://lkml.kernel.org/r/1399377998-14870-6-git-send-email-javi.merino@arm.com
    Link: http://lkml.kernel.org/r/20140506132238.22e136d1@gandalf.local.home

    Suggested-by: Javi Merino
    Tested-by: Javi Merino
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     

14 May, 2014

3 commits

  • As the decision to what needs to be done (converting a call to the
    ftrace_caller to ftrace_caller_regs or to convert from ftrace_caller_regs
    to ftrace_caller) can easily be determined from the rec->flags of
    FTRACE_FL_REGS and FTRACE_FL_REGS_EN, there's no need to have the
    ftrace_check_record() return either a UPDATE_MODIFY_CALL_REGS or a
    UPDATE_MODIFY_CALL. Just he latter is enough. This added flag causes
    more complexity than is required. Remove it.

    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     
  • With the moving of the functions that determine what the mcount call site
    should be replaced with into the generic code, there is a few places
    in the generic code that can use them instead of hard coding it as it
    does.

    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     
  • Move and rename get_ftrace_addr() and get_ftrace_addr_old() to
    ftrace_get_addr_new() and ftrace_get_addr_curr() respectively.

    This moves these two helper functions in the generic code out from
    the arch specific code, and renames them to have a better generic
    name. This will allow other archs to use them as well as makes it
    a bit easier to work on getting separate trampolines for different
    functions.

    ftrace_get_addr_new() returns the trampoline address that the mcount
    call address will be converted to.

    ftrace_get_addr_curr() returns the trampoline address of what the
    mcount call address currently jumps to.

    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)