04 Aug, 2016

1 commit

  • Pull tracing fixes from Steven Rostedt:
    "A few updates and fixes:

    - move the suppressing of the __builtin_return_address >0 warning to
    the tracing directory only.

    - metag recordmcount fix for newer glibc's

    - two tracing histogram fixes that were reported by KASAN"

    * tag 'trace-v4.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing: Fix use-after-free in hist_register_trigger()
    tracing: Fix use-after-free in hist_unreg_all/hist_enable_unreg_all
    Makefile: Mute warning for __builtin_return_address(>0) for tracing only
    ftrace/recordmcount: Work around for addition of metag magic but not relocations

    Linus Torvalds
     

03 Aug, 2016

3 commits

  • This fixes a use-after-free case flagged by KASAN; make sure the test
    happens before the potential free in this case.

    Link: http://lkml.kernel.org/r/48fd74ab61bebd7dca9714386bb47d7c5ccd6a7b.1467247517.git.tom.zanussi@linux.intel.com

    Signed-off-by: Tom Zanussi
    Signed-off-by: Steven Rostedt

    Tom Zanussi
     
  • While running tools/testing/selftests test suite with KASAN, Dmitry
    Vyukov hit the following use-after-free report:

    ==================================================================
    BUG: KASAN: use-after-free in hist_unreg_all+0x1a1/0x1d0 at addr
    ffff880031632cc0
    Read of size 8 by task ftracetest/7413
    ==================================================================
    BUG kmalloc-128 (Not tainted): kasan: bad access detected
    ------------------------------------------------------------------

    This fixes the problem, along with the same problem in
    hist_enable_unreg_all().

    Link: http://lkml.kernel.org/r/c3d05b79e42555b6e36a3a99aae0e37315ee5304.1467247517.git.tom.zanussi@linux.intel.com

    Cc: Dmitry Vyukov
    [Copied Steve's hist_enable_unreg_all() fix to hist_unreg_all()]
    Signed-off-by: Tom Zanussi
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • With the latest gcc compilers, they give a warning if
    __builtin_return_address() parameter is greater than 0. That is because if
    it is used by a function called by a top level function (or in the case of
    the kernel, by assembly), it can try to access stack frames outside the
    stack and crash the system.

    The tracing system uses __builtin_return_address() of up to 2! But it is
    well aware of the dangers that it may have, and has even added precautions
    to protect against it (see the thunk code in arch/x86/entry/thunk*.S)

    Linus originally added KBUILD_CFLAGS that would suppress the warning for the
    entire kernel, as simply adding KBUILD_CFLAGS to the tracing directory
    wouldn't work. The tracing directory plays a bit with the CFLAGS and
    requires a little more logic.

    This adds that special logic to only suppress the warning for the tracing
    directory. If it is used anywhere else outside of tracing, the warning will
    still be triggered.

    Link: http://lkml.kernel.org/r/20160728223043.51996267@grimm.local.home

    Tested-by: Linus Torvalds
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

29 Jul, 2016

1 commit

  • Pull tracing updates from Steven Rostedt:
    "This is mostly clean ups and small fixes. Some of the more visible
    changes are:

    - The function pid code uses the event pid filtering logic
    - [ku]probe events have access to current->comm
    - trace_printk now has sample code
    - PCI devices now trace physical addresses
    - stack tracing has less unnessary functions traced"

    * tag 'trace-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    printk, tracing: Avoiding unneeded blank lines
    tracing: Use __get_str() when manipulating strings
    tracing, RAS: Cleanup on __get_str() usage
    tracing: Use outer () on __get_str() definition
    ftrace: Reduce size of function graph entries
    tracing: Have HIST_TRIGGERS select TRACING
    tracing: Using for_each_set_bit() to simplify trace_pid_write()
    ftrace: Move toplevel init out of ftrace_init_tracefs()
    tracing/function_graph: Fix filters for function_graph threshold
    tracing: Skip more functions when doing stack tracing of events
    tracing: Expose CPU physical addresses (resource values) for PCI devices
    tracing: Show the preempt count of when the event was called
    tracing: Add trace_printk sample code
    tracing: Choose static tp_printk buffer by explicit nesting count
    tracing: expose current->comm to [ku]probe events
    ftrace: Have set_ftrace_pid use the bitmap like events do
    tracing: Move pid_list write processing into its own function
    tracing: Move the pid_list seq_file functions to be global
    tracing: Move filtered_pid helper functions into trace.c
    tracing: Make the pid filtering helper functions global

    Linus Torvalds
     

28 Jul, 2016

1 commit

  • Pull networking updates from David Miller:

    1) Unified UDP encapsulation offload methods for drivers, from
    Alexander Duyck.

    2) Make DSA binding more sane, from Andrew Lunn.

    3) Support QCA9888 chips in ath10k, from Anilkumar Kolli.

    4) Several workqueue usage cleanups, from Bhaktipriya Shridhar.

    5) Add XDP (eXpress Data Path), essentially running BPF programs on RX
    packets as soon as the device sees them, with the option to mirror
    the packet on TX via the same interface. From Brenden Blanco and
    others.

    6) Allow qdisc/class stats dumps to run lockless, from Eric Dumazet.

    7) Add VLAN support to b53 and bcm_sf2, from Florian Fainelli.

    8) Simplify netlink conntrack entry layout, from Florian Westphal.

    9) Add ipv4 forwarding support to mlxsw spectrum driver, from Ido
    Schimmel, Yotam Gigi, and Jiri Pirko.

    10) Add SKB array infrastructure and convert tun and macvtap over to it.
    From Michael S Tsirkin and Jason Wang.

    11) Support qdisc packet injection in pktgen, from John Fastabend.

    12) Add neighbour monitoring framework to TIPC, from Jon Paul Maloy.

    13) Add NV congestion control support to TCP, from Lawrence Brakmo.

    14) Add GSO support to SCTP, from Marcelo Ricardo Leitner.

    15) Allow GRO and RPS to function on macsec devices, from Paolo Abeni.

    16) Support MPLS over IPV4, from Simon Horman.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
    xgene: Fix build warning with ACPI disabled.
    be2net: perform temperature query in adapter regardless of its interface state
    l2tp: Correctly return -EBADF from pppol2tp_getname.
    net/mlx5_core/health: Remove deprecated create_singlethread_workqueue
    net: ipmr/ip6mr: update lastuse on entry change
    macsec: ensure rx_sa is set when validation is disabled
    tipc: dump monitor attributes
    tipc: add a function to get the bearer name
    tipc: get monitor threshold for the cluster
    tipc: make cluster size threshold for monitoring configurable
    tipc: introduce constants for tipc address validation
    net: neigh: disallow transition to NUD_STALE if lladdr is unchanged in neigh_update()
    MAINTAINERS: xgene: Add driver and documentation path
    Documentation: dtb: xgene: Add MDIO node
    dtb: xgene: Add MDIO node
    drivers: net: xgene: ethtool: Use phy_ethtool_gset and sset
    drivers: net: xgene: Use exported functions
    drivers: net: xgene: Enable MDIO driver
    drivers: net: xgene: Add backward compatibility
    drivers: net: phy: xgene: Add MDIO driver
    ...

    Linus Torvalds
     

27 Jul, 2016

2 commits

  • Pull block driver updates from Jens Axboe:
    "This branch also contains core changes. I've come to the conclusion
    that from 4.9 and forward, I'll be doing just a single branch. We
    often have dependencies between core and drivers, and it's hard to
    always split them up appropriately without pulling core into drivers
    when that happens.

    That said, this contains:

    - separate secure erase type for the core block layer, from
    Christoph.

    - set of discard fixes, from Christoph.

    - bio shrinking fixes from Christoph, as a followup up to the
    op/flags change in the core branch.

    - map and append request fixes from Christoph.

    - NVMeF (NVMe over Fabrics) code from Christoph. This is pretty
    exciting!

    - nvme-loop fixes from Arnd.

    - removal of ->driverfs_dev from Dan, after providing a
    device_add_disk() helper.

    - bcache fixes from Bhaktipriya and Yijing.

    - cdrom subchannel read fix from Vchannaiah.

    - set of lightnvm updates from Wenwei, Matias, Johannes, and Javier.

    - set of drbd updates and fixes from Fabian, Lars, and Philipp.

    - mg_disk error path fix from Bart.

    - user notification for failed device add for loop, from Minfei.

    - NVMe in general:
    + NVMe delay quirk from Guilherme.
    + SR-IOV support and command retry limits from Keith.
    + fix for memory-less NUMA node from Masayoshi.
    + use UINT_MAX for discard sectors, from Minfei.
    + cancel IO fixes from Ming.
    + don't allocate unused major, from Neil.
    + error code fixup from Dan.
    + use constants for PSDT/FUSE from James.
    + variable init fix from Jay.
    + fabrics fixes from Ming, Sagi, and Wei.
    + various fixes"

    * 'for-4.8/drivers' of git://git.kernel.dk/linux-block: (115 commits)
    nvme/pci: Provide SR-IOV support
    nvme: initialize variable before logical OR'ing it
    block: unexport various bio mapping helpers
    scsi/osd: open code blk_make_request
    target: stop using blk_make_request
    block: simplify and export blk_rq_append_bio
    block: ensure bios return from blk_get_request are properly initialized
    virtio_blk: use blk_rq_map_kern
    memstick: don't allow REQ_TYPE_BLOCK_PC requests
    block: shrink bio size again
    block: simplify and cleanup bvec pool handling
    block: get rid of bio_rw and READA
    block: don't ignore -EOPNOTSUPP blkdev_issue_write_same
    block: introduce BLKDEV_DISCARD_ZERO to fix zeroout
    NVMe: don't allocate unused nvme_major
    nvme: avoid crashes when node 0 is memoryless node.
    nvme: Limit command retries
    loop: Make user notify for adding loop device failed
    nvme-loop: fix nvme-loop Kconfig dependencies
    nvmet: fix return value check in nvmet_subsys_alloc()
    ...

    Linus Torvalds
     
  • Pull core block updates from Jens Axboe:

    - the big change is the cleanup from Mike Christie, cleaning up our
    uses of command types and modified flags. This is what will throw
    some merge conflicts

    - regression fix for the above for btrfs, from Vincent

    - following up to the above, better packing of struct request from
    Christoph

    - a 2038 fix for blktrace from Arnd

    - a few trivial/spelling fixes from Bart Van Assche

    - a front merge check fix from Damien, which could cause issues on
    SMR drives

    - Atari partition fix from Gabriel

    - convert cfq to highres timers, since jiffies isn't granular enough
    for some devices these days. From Jan and Jeff

    - CFQ priority boost fix idle classes, from me

    - cleanup series from Ming, improving our bio/bvec iteration

    - a direct issue fix for blk-mq from Omar

    - fix for plug merging not involving the IO scheduler, like we do for
    other types of merges. From Tahsin

    - expose DAX type internally and through sysfs. From Toshi and Yigal

    * 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
    block: Fix front merge check
    block: do not merge requests without consulting with io scheduler
    block: Fix spelling in a source code comment
    block: expose QUEUE_FLAG_DAX in sysfs
    block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
    Btrfs: fix comparison in __btrfs_map_block()
    block: atari: Return early for unsupported sector size
    Doc: block: Fix a typo in queue-sysfs.txt
    cfq-iosched: Charge at least 1 jiffie instead of 1 ns
    cfq-iosched: Fix regression in bonnie++ rewrite performance
    cfq-iosched: Convert slice_resid from u64 to s64
    block: Convert fifo_time from ulong to u64
    blktrace: avoid using timespec
    block/blk-cgroup.c: Declare local symbols static
    block/bio-integrity.c: Add #include "blk.h"
    block/partition-generic.c: Remove a set-but-not-used variable
    block: bio: kill BIO_MAX_SIZE
    cfq-iosched: temporarily boost queue priority for idle classes
    block: drbd: avoid to use BIO_MAX_SIZE
    block: bio: remove BIO_MAX_SECTORS
    ...

    Linus Torvalds
     

26 Jul, 2016

1 commit

  • This allows user memory to be written to during the course of a kprobe.
    It shouldn't be used to implement any kind of security mechanism
    because of TOC-TOU attacks, but rather to debug, divert, and
    manipulate execution of semi-cooperative processes.

    Although it uses probe_kernel_write, we limit the address space
    the probe can write into by checking the space with access_ok.
    We do this as opposed to calling copy_to_user directly, in order
    to avoid sleeping. In addition we ensure the threads's current fs
    / segment is USER_DS and the thread isn't exiting nor a kernel thread.

    Given this feature is meant for experiments, and it has a risk of
    crashing the system, and running programs, we print a warning on
    when a proglet that attempts to use this helper is installed,
    along with the pid and process name.

    Signed-off-by: Sargun Dhillon
    Cc: Alexei Starovoitov
    Cc: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Sargun Dhillon
     

20 Jul, 2016

1 commit

  • kernel/trace/bpf_trace.c: In function 'bpf_event_output':
    kernel/trace/bpf_trace.c:312: error: unknown field 'next' specified in initializer
    kernel/trace/bpf_trace.c:312: warning: missing braces around initializer
    kernel/trace/bpf_trace.c:312: warning: (near initialization for 'raw.frag.')

    Fixes: 555c8a8623a3a87 ("bpf: avoid stack copy and use skb ctx for event output")
    Acked-by: Daniel Borkmann
    Cc: Alexei Starovoitov
    Cc: David S. Miller
    Signed-off-by: Andrew Morton
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Andrew Morton
     

16 Jul, 2016

3 commits

  • This work addresses a couple of issues bpf_skb_event_output()
    helper currently has: i) We need two copies instead of just a
    single one for the skb data when it should be part of a sample.
    The data can be non-linear and thus needs to be extracted via
    bpf_skb_load_bytes() helper first, and then copied once again
    into the ring buffer slot. ii) Since bpf_skb_load_bytes()
    currently needs to be used first, the helper needs to see a
    constant size on the passed stack buffer to make sure BPF
    verifier can do sanity checks on it during verification time.
    Thus, just passing skb->len (or any other non-constant value)
    wouldn't work, but changing bpf_skb_load_bytes() is also not
    the proper solution, since the two copies are generally still
    needed. iii) bpf_skb_load_bytes() is just for rather small
    buffers like headers, since they need to sit on the limited
    BPF stack anyway. Instead of working around in bpf_skb_load_bytes(),
    this work improves the bpf_skb_event_output() helper to address
    all 3 at once.

    We can make use of the passed in skb context that we have in
    the helper anyway, and use some of the reserved flag bits as
    a length argument. The helper will use the new __output_custom()
    facility from perf side with bpf_skb_copy() as callback helper
    to walk and extract the data. It will pass the data for setup
    to bpf_event_output(), which generates and pushes the raw record
    with an additional frag part. The linear data used in the first
    frag of the record serves as programmatically defined meta data
    passed along with the appended sample.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Split the bpf_perf_event_output() helper as a preparation into
    two parts. The new bpf_perf_event_output() will prepare the raw
    record itself and test for unknown flags from BPF trace context,
    where the __bpf_perf_event_output() does the core work. The
    latter will be reused later on from bpf_event_output() directly.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • This patch adds support for non-linear data on raw records. It
    extends raw records to have one or multiple fragments that will
    be written linearly into the ring slot, where each fragment can
    optionally have a custom callback handler to walk and extract
    complex, possibly non-linear data.

    If a callback handler is provided for a fragment, then the new
    __output_custom() will be used instead of __output_copy() for
    the perf_output_sample() part. perf_prepare_sample() does all
    the size calculation only once, so perf_output_sample() doesn't
    need to redo the same work anymore, meaning real_size and padding
    will be cached in the raw record. The raw record becomes 32 bytes
    in size without holes; to not increase it further and to avoid
    doing unnecessary recalculations in fast-path, we can reuse
    next pointer of the last fragment, idea here is borrowed from
    ZERO_OR_NULL_PTR(), which should keep the perf_output_sample()
    path for PERF_SAMPLE_RAW minimal.

    This facility is needed for BPF's event output helper as a first
    user that will, in a follow-up, add an additional perf_raw_frag
    to its perf_raw_record in order to be able to more efficiently
    dump skb context after a linear head meta data related to it.
    skbs can be non-linear and thus need a custom output function to
    dump buffers. Currently, the skb data needs to be copied twice;
    with the help of __output_custom() this work only needs to be
    done once. Future users could be things like XDP/BPF programs
    that work on different context though and would thus also have
    a different callback function.

    The few users of raw records are adapted to initialize their frag
    data from the raw record itself, no change in behavior for them.
    The code is based upon a PoC diff provided by Peter Zijlstra [1].

    [1] http://thread.gmane.org/gmane.linux.network/421294

    Suggested-by: Peter Zijlstra
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

09 Jul, 2016

1 commit

  • over time there were multiple requests to access different data
    structures and fields of task_struct current, so finally add
    the helper to access 'current' as-is. Tracing bpf programs will do
    the rest of walking the pointers via bpf_probe_read().
    Note that current can be null and bpf program has to deal it with,
    but even dumb passing null into bpf_probe_read() is still safe.

    Suggested-by: Brendan Gregg
    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

06 Jul, 2016

2 commits

  • Currently ftrace_graph_ent{,_entry} and ftrace_graph_ret{,_entry} struct
    can have padding bytes at the end due to alignment in 64-bit data type.
    As these data are recorded so frequently, those paddings waste
    non-negligible space. As the ring buffer maintains alignment properly
    for each architecture, just to remove the extra padding using 'packed'
    attribute.

    ftrace_graph_ent_entry: 24 -> 20
    ftrace_graph_ret_entry: 48 -> 44

    Also I moved the 'overrun' field in struct ftrace_graph_ret to minimize
    the padding in the middle.

    Tested on x86_64 only.

    Link: http://lkml.kernel.org/r/1467197808-13578-1-git-send-email-namhyung@kernel.org

    Cc: Ingo Molnar
    Cc: linux-arch@vger.kernel.org
    Signed-off-by: Namhyung Kim
    Signed-off-by: Steven Rostedt

    Namhyung Kim
     
  • The kbuild test robot reported a compile error if HIST_TRIGGERS was
    enabled but nothing else that selected TRACING was configured in.

    HIST_TRIGGERS should directly select it and not rely on anything else
    to do it.

    Link: http://lkml.kernel.org/r/57791866.8080505@linux.intel.com

    Reported-by: kbuild test robot
    Fixes: 7ef224d1d0e3a ("tracing: Add 'hist' event trigger command")
    Signed-off-by: Tom Zanussi
    Signed-off-by: Steven Rostedt

    Tom Zanussi
     

05 Jul, 2016

2 commits

  • Using for_each_set_bit() to simplify the code.

    Link: http://lkml.kernel.org/r/1467645004-11169-1-git-send-email-weiyj_lk@163.com

    Signed-off-by: Wei Yongjun
    Signed-off-by: Steven Rostedt

    Wei Yongjun
     
  • Commit 345ddcc882d8 ("ftrace: Have set_ftrace_pid use the bitmap like events
    do") placed ftrace_init_tracefs into the instance creation, and encapsulated
    the top level updating with an if conditional, as the top level only gets
    updated at boot up. Unfortunately, this triggers section mismatch errors as
    the init functions are called from a function that can be called later, and
    the section mismatch logic is unaware of the if conditional that would
    prevent it from happening at run time.

    To make everyone happy, create a separate ftrace_init_tracefs_toplevel()
    routine that only gets called by init functions, and this will be what calls
    other init functions for the toplevel directory.

    Link: http://lkml.kernel.org/r/20160704102139.19cbc0d9@gandalf.local.home

    Reported-by: kbuild test robot
    Reported-by: Arnd Bergmann
    Fixes: 345ddcc882d8 ("ftrace: Have set_ftrace_pid use the bitmap like events do")
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     

30 Jun, 2016

5 commits

  • Follow-up commit to 1e33759c788c ("bpf, trace: add BPF_F_CURRENT_CPU
    flag for bpf_perf_event_output") to add the same functionality into
    bpf_perf_event_read() helper. The split of index into flags and index
    component is also safe here, since such large maps are rejected during
    map allocation time.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • We currently have two invocations, which is unnecessary. Fetch it only
    once and use the smp_processor_id() variant, so we also get preemption
    checks along with it when DEBUG_PREEMPT is set.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Some minor cleanups: i) Remove the unlikely() from fd array map lookups
    and let the CPU branch predictor do its job, scenarios where there is not
    always a map entry are very well valid. ii) Move the attribute type check
    in the bpf_perf_event_read() helper a bit earlier so it's consistent wrt
    checks with bpf_perf_event_output() helper as well. iii) remove some
    comments that are self-documenting in kprobe_prog_is_valid_access() and
    therefore make it consistent to tp_prog_is_valid_access() as well.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Several cases of overlapping changes, except the packet scheduler
    conflicts which deal with the addition of the free list parameter
    to qdisc_enqueue().

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Pull networking fixes from David Miller:
    "I've been traveling so this accumulates more than week or so of bug
    fixing. It perhaps looks a little worse than it really is.

    1) Fix deadlock in ath10k driver, from Ben Greear.

    2) Increase scan timeout in iwlwifi, from Luca Coelho.

    3) Unbreak STP by properly reinjecting STP packets back into the
    stack. Regression fix from Ido Schimmel.

    4) Mediatek driver fixes (missing malloc failure checks, leaking of
    scratch memory, wrong indexing when mapping TX buffers, etc.) from
    John Crispin.

    5) Fix endianness bug in icmpv6_err() handler, from Hannes Frederic
    Sowa.

    6) Fix hashing of flows in UDP in the ruseport case, from Xuemin Su.

    7) Fix netlink notifications in ovs for tunnels, delete link messages
    are never emitted because of how the device registry state is
    handled. From Nicolas Dichtel.

    8) Conntrack module leaks kmemcache on unload, from Florian Westphal.

    9) Prevent endless jump loops in nft rules, from Liping Zhang and
    Pablo Neira Ayuso.

    10) Not early enough spinlock initialization in mlx4, from Eric
    Dumazet.

    11) Bind refcount leak in act_ipt, from Cong WANG.

    12) Missing RCU locking in HTB scheduler, from Florian Westphal.

    13) Several small MACSEC bug fixes from Sabrina Dubroca (missing RCU
    barrier, using heap for SG and IV, and erroneous use of async flag
    when allocating AEAD conext.)

    14) RCU handling fix in TIPC, from Ying Xue.

    15) Pass correct protocol down into ipv4_{update_pmtu,redirect}() in
    SIT driver, from Simon Horman.

    16) Socket timer deadlock fix in TIPC from Jon Paul Maloy.

    17) Fix potential deadlock in team enslave, from Ido Schimmel.

    18) Memory leak in KCM procfs handling, from Jiri Slaby.

    19) ESN generation fix in ipv4 ESP, from Herbert Xu.

    20) Fix GFP_KERNEL allocations with locks held in act_ife, from Cong
    WANG.

    21) Use after free in netem, from Eric Dumazet.

    22) Uninitialized last assert time in multicast router code, from Tom
    Goff.

    23) Skip raw sockets in sock_diag destruction broadcast, from Willem
    de Bruijn.

    24) Fix link status reporting in thunderx, from Sunil Goutham.

    25) Limit resegmentation of retransmit queue so that we do not
    retransmit too large GSO frames. From Eric Dumazet.

    26) Delay bpf program release after grace period, from Daniel
    Borkmann"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (141 commits)
    openvswitch: fix conntrack netlink event delivery
    qed: Protect the doorbell BAR with the write barriers.
    neigh: Explicitly declare RCU-bh read side critical section in neigh_xmit()
    e1000e: keep VLAN interfaces functional after rxvlan off
    cfg80211: fix proto in ieee80211_data_to_8023 for frames without LLC header
    qlcnic: use the correct ring in qlcnic_83xx_process_rcv_ring_diag()
    bpf, perf: delay release of BPF prog after grace period
    net: bridge: fix vlan stats continue counter
    tcp: do not send too big packets at retransmit time
    ibmvnic: fix to use list_for_each_safe() when delete items
    net: thunderx: Fix TL4 configuration for secondary Qsets
    net: thunderx: Fix link status reporting
    net/mlx5e: Reorganize ethtool statistics
    net/mlx5e: Fix number of PFC counters reported to ethtool
    net/mlx5e: Prevent adding the same vxlan port
    net/mlx5e: Check for BlueFlame capability before allocating SQ uar
    net/mlx5e: Change enum to better reflect usage
    net/mlx5: Add ConnectX-5 PCIe 4.0 to list of supported devices
    net/mlx5: Update command strings
    net: marvell: Add separate config ANEG function for Marvell 88E1111
    ...

    Linus Torvalds
     

28 Jun, 2016

1 commit

  • Function graph tracer currently ignores filters if tracing_thresh is set.
    For example, even if set_ftrace_pid is set, then its ignored if tracing_thresh
    set, resulting in all processes being traced.

    To fix this, we reuse the same entry function as when tracing_thresh is not
    set and do everything as in the regular case except for writing the function entry
    to the ring buffer.

    Link: http://lkml.kernel.org/r/1466228694-2677-1-git-send-email-agnel.joel@gmail.com

    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Signed-off-by: Joel Fernandes
    Signed-off-by: Steven Rostedt

    Joel Fernandes
     

24 Jun, 2016

1 commit

  • # echo 1 > options/stacktrace
    # echo 1 > events/sched/sched_switch/enable
    # cat trace
    -0 [002] d..2 1982.525169:
    => save_stack_trace
    => __ftrace_trace_stack
    => trace_buffer_unlock_commit_regs
    => event_trigger_unlock_commit
    => trace_event_buffer_commit
    => trace_event_raw_event_sched_switch
    => __schedule
    => schedule
    => schedule_preempt_disabled
    => cpu_startup_entry
    => start_secondary

    The above shows that we are seeing 6 functions before ever making it to the
    caller of the sched_switch event.

    # echo stacktrace > events/sched/sched_switch/trigger
    # cat trace
    -0 [002] d..3 2146.335208:
    => trace_event_buffer_commit
    => trace_event_raw_event_sched_switch
    => __schedule
    => schedule
    => schedule_preempt_disabled
    => cpu_startup_entry
    => start_secondary

    The stacktrace trigger isn't as bad, because it adds its own skip to the
    stacktracing, but still has two events extra.

    One issue is that if the stacktrace passes its own "regs" then there should
    be no addition to the skip, as the regs will not include the functions being
    called. This was an issue that was fixed by commit 7717c6be6999 ("tracing:
    Fix stacktrace skip depth in trace_buffer_unlock_commit_regs()" as adding
    the skip number for kprobes made the probes not have any stack at all.

    But since this is only an issue when regs is being used, a skip should be
    added if regs is NULL. Now we have:

    # echo 1 > options/stacktrace
    # echo 1 > events/sched/sched_switch/enable
    # cat trace
    -0 [000] d..2 1297.676333:
    => __schedule
    => schedule
    => schedule_preempt_disabled
    => cpu_startup_entry
    => rest_init
    => start_kernel
    => x86_64_start_reservations
    => x86_64_start_kernel

    # echo stacktrace > events/sched/sched_switch/trigger
    # cat trace
    -0 [002] d..3 1370.759745:
    => __schedule
    => schedule
    => schedule_preempt_disabled
    => cpu_startup_entry
    => start_secondary

    And kprobes are not touched.

    Reported-by: Peter Zijlstra
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     

20 Jun, 2016

10 commits

  • Previously, mmio_print_pcidev() put "user" addresses in the trace buffer.
    On most architectures, these are the same as CPU physical addresses, but on
    microblaze, mips, powerpc, and sparc, they may be something else, typically
    a raw BAR value (a bus address as opposed to a CPU address).

    Always expose the CPU physical address to avoid this arch-dependent
    behavior.

    This change should have no user-visible effect because this file currently
    depends on CONFIG_HAVE_MMIOTRACE_SUPPORT, which is only defined for x86,
    and pci_resource_to_user() is a no-op on x86.

    Link: http://lkml.kernel.org/r/20160511190657.5898.4248.stgit@bhelgaas-glaptop2.roam.corp.google.com

    Signed-off-by: Bjorn Helgaas
    Signed-off-by: Steven Rostedt

    Bjorn Helgaas
     
  • Because tracepoint callbacks are done with preemption enabled, the trace
    events are always called with preempt disable due to the
    rcu_read_lock_sched_notrace() in __DO_TRACE(). This causes the preempt count
    shown in the recorded trace event to be inaccurate. It is always one more
    that what the preempt_count was when the tracepoint was called.

    If CONFIG_PREEMPT is enabled, subtract 1 from the preempt_count before
    recording it in the trace buffer.

    Link: http://lkml.kernel.org/r/20160525132537.GA10808@linutronix.de

    Reported-by: Sebastian Andrzej Siewior
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     
  • Currently, the trace_printk code chooses which static buffer to use based
    on what type of atomic context (NMI, IRQ, etc) it's in. Simplify the
    code and make it more robust: simply count the nesting depth and choose
    a buffer based on the current nesting depth.

    The new code will only drop an event if we nest more than 4 deep,
    and the old code was guaranteed to malfunction if that happened.

    Link: http://lkml.kernel.org/r/07ab03aecfba25fcce8f9a211b14c9c5e2865c58.1464289095.git.luto@kernel.org

    Acked-by: Namhyung Kim
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Steven Rostedt

    Andy Lutomirski
     
  • ftrace is very quick to give up on saving the task command line (see
    `trace_save_cmdline()`). The workaround for events which really care
    about the command line is to explicitly assign it as part of the entry.
    However, this doesn't work for kprobe events, as there's no
    straightforward way to get access to current->comm. Add a kprobe/uprobe
    event variable $comm which provides exactly that.

    Link: http://lkml.kernel.org/r/f59b472033b943a370f5f48d0af37698f409108f.1465435894.git.osandov@fb.com

    Acked-by: Masami Hiramatsu
    Signed-off-by: Omar Sandoval
    Signed-off-by: Steven Rostedt

    Omar Sandoval
     
  • Convert set_ftrace_pid to use the bitmap like set_event_pid does. This
    allows for instances to use the pid filtering as well, and will allow for
    function-fork option to set if the children of a traced function should be
    traced or not.

    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     
  • The addition of PIDs into a pid_list via the write operation of
    set_event_pid is a bit complex. The same operation will be needed for
    function tracing pids. Move the code into its own generic function in
    trace.c, so that we can avoid duplication of this code.

    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     
  • To allow other aspects of ftrace to use the pid_list logic, we need to reuse
    the seq_file functions. Making the generic part into functions that can be
    called by other files will help in this regard.

    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     
  • As the filtered_pid functions are going to be used by function tracer as
    well as trace_events, move the code into the generic trace.c file.

    The functions moved are:

    trace_find_filtered_pid()
    trace_ignore_this_task()
    trace_filter_add_remove_task()

    Kernel Doc text was also added.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • Make the functions used for pid filtering global for tracing, such that the
    function tracer can use the pid code as well.

    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • If a task uses a non constant string for the format parameter in
    trace_printk(), then the trace_printk_fmt variable is set to NULL. This
    variable is then saved in the __trace_printk_fmt section.

    The function hold_module_trace_bprintk_format() checks to see if duplicate
    formats are used by modules, and reuses them if so (saves them to the list
    if it is new). But this function calls lookup_format() that does a strcmp()
    to the value (which is now NULL) and can cause a kernel oops.

    This wasn't an issue till 3debb0a9ddb ("tracing: Fix trace_printk() to print
    when not using bprintk()") which added "__used" to the trace_printk_fmt
    variable, and before that, the kernel simply optimized it out (no NULL value
    was saved).

    The fix is simply to handle the NULL pointer in lookup_format() and have the
    caller ignore the value if it was NULL.

    Link: http://lkml.kernel.org/r/1464769870-18344-1-git-send-email-zhengjun.xing@intel.com

    Reported-by: xingzhen
    Acked-by: Namhyung Kim
    Fixes: 3debb0a9ddb ("tracing: Fix trace_printk() to print when not using bprintk()")
    Cc: stable@vger.kernel.org # v3.5+
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     

18 Jun, 2016

1 commit

  • The blktrace code stores the current time in a 32-bit word in its
    user interface. This is a bad idea because 32-bit seconds overflow
    at some point.

    We probably have until 2106 before this one overflows, as it seems
    to use an 'unsigned' variable, but we should confirm that user
    space treats it the same way.

    Aside from this, we want to stop using 'struct timespec' here,
    so I'm adding a comment about the overflow and change the code
    to use timespec64 instead to make the loss of range more obvious.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Jens Axboe

    Arnd Bergmann
     

16 Jun, 2016

3 commits

  • The behavior of perf event arrays are quite different from all
    others as they are tightly coupled to perf event fds, f.e. shown
    recently by commit e03e7ee34fdd ("perf/bpf: Convert perf_event_array
    to use struct file") to make refcounting on perf event more robust.
    A remaining issue that the current code still has is that since
    additions to the perf event array take a reference on the struct
    file via perf_event_get() and are only released via fput() (that
    cleans up the perf event eventually via perf_event_release_kernel())
    when the element is either manually removed from the map from user
    space or automatically when the last reference on the perf event
    map is dropped. However, this leads us to dangling struct file's
    when the map gets pinned after the application owning the perf
    event descriptor exits, and since the struct file reference will
    in such case only be manually dropped or via pinned file removal,
    it leads to the perf event living longer than necessary, consuming
    needlessly resources for that time.

    Relations between perf event fds and bpf perf event map fds can be
    rather complex. F.e. maps can act as demuxers among different perf
    event fds that can possibly be owned by different threads and based
    on the index selection from the program, events get dispatched to
    one of the per-cpu fd endpoints. One perf event fd (or, rather a
    per-cpu set of them) can also live in multiple perf event maps at
    the same time, listening for events. Also, another requirement is
    that perf event fds can get closed from application side after they
    have been attached to the perf event map, so that on exit perf event
    map will take care of dropping their references eventually. Likewise,
    when such maps are pinned, the intended behavior is that a user
    application does bpf_obj_get(), puts its fds in there and on exit
    when fd is released, they are dropped from the map again, so the map
    acts rather as connector endpoint. This also makes perf event maps
    inherently different from program arrays as described in more detail
    in commit c9da161c6517 ("bpf: fix clearing on persistent program
    array maps").

    To tackle this, map entries are marked by the map struct file that
    added the element to the map. And when the last reference to that map
    struct file is released from user space, then the tracked entries
    are purged from the map. This is okay, because new map struct files
    instances resp. frontends to the anon inode are provided via
    bpf_map_new_fd() that is called when we invoke bpf_obj_get_user()
    for retrieving a pinned map, but also when an initial instance is
    created via map_create(). The rest is resolved by the vfs layer
    automatically for us by keeping reference count on the map's struct
    file. Any concurrent updates on the map slot are fine as well, it
    just means that perf_event_fd_array_release() needs to delete less
    of its own entires.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • similar to bpf_perf_event_output() the bpf_perf_event_read() helper
    needs to check the type of the perf_event before reading the counter.

    Fixes: a43eec304259 ("bpf: introduce bpf_perf_event_output() helper")
    Reported-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • The ctx structure passed into bpf programs is different depending on bpf
    program type. The verifier incorrectly marked ctx->data and ctx->data_end
    access based on ctx offset only. That caused loads in tracing programs
    int bpf_prog(struct pt_regs *ctx) { .. ctx->ax .. }
    to be incorrectly marked as PTR_TO_PACKET which later caused verifier
    to reject the program that was actually valid in tracing context.
    Fix this by doing program type specific matching of ctx offsets.

    Fixes: 969bf05eb3ce ("bpf: direct packet access")
    Reported-by: Sasha Goldshtein
    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

09 Jun, 2016

1 commit