11 Sep, 2015

1 commit

  • There are two kexec load syscalls, kexec_load another and kexec_file_load.
    kexec_file_load has been splited as kernel/kexec_file.c. In this patch I
    split kexec_load syscall code to kernel/kexec.c.

    And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
    use kexec_file_load only, or vice verse.

    The original requirement is from Ted Ts'o, he want kexec kernel signature
    being checked with CONFIG_KEXEC_VERIFY_SIG enabled. But kexec-tools use
    kexec_load syscall can bypass the checking.

    Vivek Goyal proposed to create a common kconfig option so user can compile
    in only one syscall for loading kexec kernel. KEXEC/KEXEC_FILE selects
    KEXEC_CORE so that old config files still work.

    Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
    architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
    KEXEC_CORE in arch Kconfig. Also updated general kernel code with to
    kexec_load syscall.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Dave Young
    Cc: Eric W. Biederman
    Cc: Vivek Goyal
    Cc: Petr Tesarik
    Cc: Theodore Ts'o
    Cc: Josh Boyer
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Young
     

03 Sep, 2015

1 commit

  • Pull networking updates from David Miller:
    "Another merge window, another set of networking changes. I've heard
    rumblings that the lightweight tunnels infrastructure has been voted
    networking change of the year. But what do I know?

    1) Add conntrack support to openvswitch, from Joe Stringer.

    2) Initial support for VRF (Virtual Routing and Forwarding), which
    allows the segmentation of routing paths without using multiple
    devices. There are some semantic kinks to work out still, but
    this is a reasonably strong foundation. From David Ahern.

    3) Remove spinlock fro act_bpf fast path, from Alexei Starovoitov.

    4) Ignore route nexthops with a link down state in ipv6, just like
    ipv4. From Andy Gospodarek.

    5) Remove spinlock from fast path of act_gact and act_mirred, from
    Eric Dumazet.

    6) Document the DSA layer, from Florian Fainelli.

    7) Add netconsole support to bcmgenet, systemport, and DSA. Also
    from Florian Fainelli.

    8) Add Mellanox Switch Driver and core infrastructure, from Jiri
    Pirko.

    9) Add support for "light weight tunnels", which allow for
    encapsulation and decapsulation without bearing the overhead of a
    full blown netdevice. From Thomas Graf, Jiri Benc, and a cast of
    others.

    10) Add Identifier Locator Addressing support for ipv6, from Tom
    Herbert.

    11) Support fragmented SKBs in iwlwifi, from Johannes Berg.

    12) Allow perf PMUs to be accessed from eBPF programs, from Kaixu Xia.

    13) Add BQL support to 3c59x driver, from Loganaden Velvindron.

    14) Stop using a zero TX queue length to mean that a device shouldn't
    have a qdisc attached, use an explicit flag instead. From Phil
    Sutter.

    15) Use generic geneve netdevice infrastructure in openvswitch, from
    Pravin B Shelar.

    16) Add infrastructure to avoid re-forwarding a packet in software
    that was already forwarded by a hardware switch. From Scott
    Feldman.

    17) Allow AF_PACKET fanout function to be implemented in a bpf
    program, from Willem de Bruijn"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1458 commits)
    netfilter: nf_conntrack: make nf_ct_zone_dflt built-in
    netfilter: nf_dup{4, 6}: fix build error when nf_conntrack disabled
    net: fec: clear receive interrupts before processing a packet
    ipv6: fix exthdrs offload registration in out_rt path
    xen-netback: add support for multicast control
    bgmac: Update fixed_phy_register()
    sock, diag: fix panic in sock_diag_put_filterinfo
    flow_dissector: Use 'const' where possible.
    flow_dissector: Fix function argument ordering dependency
    ixgbe: Resolve "initialized field overwritten" warnings
    ixgbe: Remove bimodal SR-IOV disabling
    ixgbe: Add support for reporting 2.5G link speed
    ixgbe: fix bounds checking in ixgbe_setup_tc for 82598
    ixgbe: support for ethtool set_rxfh
    ixgbe: Avoid needless PHY access on copper phys
    ixgbe: cleanup to use cached mask value
    ixgbe: Remove second instance of lan_id variable
    ixgbe: use kzalloc for allocating one thing
    flow: Move __get_hash_from_flowi{4,6} into flow_dissector.c
    ixgbe: Remove unused PCI bus types
    ...

    Linus Torvalds
     

22 Aug, 2015

1 commit


12 Aug, 2015

4 commits

  • A question [1] was raised about the use of page::private in AUX buffer
    allocations, so let's add a clarification about its intended use.

    The private field and flag are used by perf's rb_alloc_aux() path to
    tell the pmu driver the size of each high-order allocation, so that the
    driver can program those appropriately into its hardware. This only
    matters for PMUs that don't support hardware scatter tables. Otherwise,
    every page in the buffer is just a page.

    This patch adds a comment about the private field to the AUX buffer
    allocation path.

    [1] http://marc.info/?l=linux-kernel&m=143803696607968

    Reported-by: Mathieu Poirier
    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1438063204-665-1-git-send-email-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • I ran the perf fuzzer, which triggered some WARN()s which are due to
    trying to stop/restart an event on the wrong CPU.

    Use the normal IPI pattern to ensure we run the code on the correct CPU.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Vince Weaver
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: bad7192b842c ("perf: Fix PERF_EVENT_IOC_PERIOD to force-reset the period")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • If rb->aux_refcount is decremented to zero before rb->refcount,
    __rb_free_aux() may be called twice resulting in a double free of
    rb->aux_pages. Fix this by adding a check to __rb_free_aux().

    Signed-off-by: Ben Hutchings
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: stable@vger.kernel.org
    Fixes: 57ffc5ca679f ("perf: Fix AUX buffer refcounting")
    Link: http://lkml.kernel.org/r/1437953468.12842.17.camel@decadent.org.uk
    Signed-off-by: Ingo Molnar

    Ben Hutchings
     

10 Aug, 2015

1 commit

  • This patch add three core perf APIs:
    - perf_event_attrs(): export the struct perf_event_attr from struct
    perf_event;
    - perf_event_get(): get the struct perf_event from the given fd;
    - perf_event_read_local(): read the events counters active on the
    current CPU;
    These APIs are needed when accessing events counters in eBPF programs.

    The API perf_event_read_local() comes from Peter and I add the
    corresponding SOB.

    Signed-off-by: Kaixu Xia
    Signed-off-by: Peter Zijlstra
    Signed-off-by: David S. Miller

    Kaixu Xia
     

07 Aug, 2015

1 commit

  • By copying BPF related operation to uprobe processing path, this patch
    allow users attach BPF programs to uprobes like what they are already
    doing on kprobes.

    After this patch, users are allowed to use PERF_EVENT_IOC_SET_BPF on a
    uprobe perf event. Which make it possible to profile user space programs
    and kernel events together using BPF.

    Because of this patch, CONFIG_BPF_EVENTS should be selected by
    CONFIG_UPROBE_EVENT to ensure trace_call_bpf() is compiled even if
    KPROBE_EVENT is not set.

    Signed-off-by: Wang Nan
    Acked-by: Alexei Starovoitov
    Cc: Brendan Gregg
    Cc: Daniel Borkmann
    Cc: David Ahern
    Cc: He Kuang
    Cc: Jiri Olsa
    Cc: Kaixu Xia
    Cc: Masami Hiramatsu
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Zefan Li
    Cc: pi3orama@163.com
    Link: http://lkml.kernel.org/r/1435716878-189507-3-git-send-email-wangnan0@huawei.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Wang Nan
     

04 Aug, 2015

2 commits

  • Currently, the PT driver zeroes out the status register every time before
    starting the event. However, all the writable bits are already taken care
    of in pt_handle_status() function, except the new PacketByteCnt field,
    which in new versions of PT contains the number of packet bytes written
    since the last sync (PSB) packet. Zeroing it out before enabling PT forces
    a sync packet to be written. This means that, with the existing code, a
    sync packet (PSB and PSBEND, 18 bytes in total) will be generated every
    time a PT event is scheduled in.

    To avoid these unnecessary syncs and save a WRMSR in the fast path, this
    patch changes the default behavior to not clear PacketByteCnt field, so
    that the sync packets will be generated with the period specified as
    "psb_period" attribute config field. This has little impact on the trace
    data as the other packets that are normally sent within PSB+ (between PSB
    and PSBEND) have their own generation scenarios which do not depend on the
    sync packets.

    One exception where we do need to force PSB like this when tracing starts,
    so that the decoder has a clear sync point in the trace. For this purpose
    we aready have hw::itrace_started flag, which we are currently using to
    output PERF_RECORD_ITRACE_START. This patch moves setting itrace_started
    from perf core to the pmu::start, where it should still be 0 on the very
    first run.

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: acme@infradead.org
    Cc: adrian.hunter@intel.com
    Cc: hpa@zytor.com
    Link: http://lkml.kernel.org/r/1438264104-16189-1-git-send-email-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • Vince reported that the fasync signal stuff doesn't work proper for
    inherited events. So fix that.

    Installing fasync allocates memory and sets filp->f_flags |= FASYNC,
    which upon the demise of the file descriptor ensures the allocation is
    freed and state is updated.

    Now for perf, we can have the events stick around for a while after the
    original FD is dead because of references from child events. So we
    cannot copy the fasync pointer around. We can however consistently use
    the parent's fasync, as that will be updated.

    Reported-and-Tested-by: Vince Weaver
    Signed-off-by: Peter Zijlstra (Intel)
    Cc:
    Cc: Arnaldo Carvalho deMelo
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: eranian@google.com
    Link: http://lkml.kernel.org/r/1434011521.1495.71.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

31 Jul, 2015

15 commits

  • The xol_free_insn_slot()->waitqueue_active() check is buggy. We
    need mb() after we set the conditon for wait_event(), or
    xol_take_insn_slot() can miss the wakeup.

    Signed-off-by: Oleg Nesterov
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Pratyush Anand
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150721134036.GA4799@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • Change xol_add_vma() to use _install_special_mapping(), this way
    we can name the vma installed by uprobes. Currently it looks
    like private anonymous mapping, this is confusing and
    complicates the debugging. With this change /proc/$pid/maps
    reports "[uprobes]".

    As a side effect this will cause core dumps to include the XOL vma
    and I think this is good; this can help to debug the problem if
    the app crashed because it was probed.

    Signed-off-by: Oleg Nesterov
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Pratyush Anand
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150721134033.GA4796@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • install_special_mapping(pages) expects that "pages" is the zero-
    terminated array while xol_add_vma() passes &area->page, this
    means that special_mapping_fault() can wrongly use the next
    member in xol_area (vaddr) as "struct page *".

    Fortunately, this area is not expandable so pgoff != 0 isn't
    possible (modulo bugs in special_mapping_vmops), but still this
    does not look good.

    Signed-off-by: Oleg Nesterov
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Pratyush Anand
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150721134031.GA4789@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • The previous change documents that cleanup_return_instances()
    can't always detect the dead frames, the stack can grow. But
    there is one special case which imho worth fixing:
    arch_uretprobe_is_alive() can return true when the stack didn't
    actually grow, but the next "call" insn uses the already
    invalidated frame.

    Test-case:

    #include
    #include

    jmp_buf jmp;
    int nr = 1024;

    void func_2(void)
    {
    if (--nr == 0)
    return;
    longjmp(jmp, 1);
    }

    void func_1(void)
    {
    setjmp(jmp);
    func_2();
    }

    int main(void)
    {
    func_1();
    return 0;
    }

    If you ret-probe func_1() and func_2() prepare_uretprobe() hits
    the MAX_URETPROBE_DEPTH limit and "return" from func_2() is not
    reported.

    When we know that the new call is not chained, we can do the
    more strict check. In this case "sp" points to the new ret-addr,
    so every frame which uses the same "sp" must be dead. The only
    complication is that arch_uretprobe_is_alive() needs to know was
    it chained or not, so we add the new RP_CHECK_CHAIN_CALL enum
    and change prepare_uretprobe() to pass RP_CHECK_CALL only if
    !chained.

    Note: arch_uretprobe_is_alive() could also re-read *sp and check
    if this word is still trampoline_vaddr. This could obviously
    improve the logic, but I would like to avoid another
    copy_from_user() especially in the case when we can't avoid the
    false "alive == T" positives.

    Tested-by: Pratyush Anand
    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju
    Acked-by: Anton Arapov
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150721134028.GA4786@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • arch/x86 doesn't care (so far), but as Pratyush Anand pointed
    out other architectures might want why arch_uretprobe_is_alive()
    was called and use different checks depending on the context.
    Add the new argument to distinguish 2 callers.

    Tested-by: Pratyush Anand
    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju
    Acked-by: Anton Arapov
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150721134026.GA4779@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • Change prepare_uretprobe() to flush the !arch_uretprobe_is_alive()
    return_instance's. This is not needed correctness-wise, but can help
    to avoid the failure caused by MAX_URETPROBE_DEPTH.

    Note: in this case arch_uretprobe_is_alive() can be false
    positive, the stack can grow after longjmp(). Unfortunately, the
    kernel can't 100% solve this problem, but see the next patch.

    Tested-by: Pratyush Anand
    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju
    Acked-by: Anton Arapov
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150721134023.GA4776@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • Test-case:

    #include
    #include

    jmp_buf jmp;

    void func_2(void)
    {
    longjmp(jmp, 1);
    }

    void func_1(void)
    {
    if (setjmp(jmp))
    return;
    func_2();
    printf("ERR!! I am running on the caller's stack\n");
    }

    int main(void)
    {
    func_1();
    return 0;
    }

    fails if you probe func_1() and func_2() because
    handle_trampoline() assumes that the probed function should must
    return and hit the bp installed be prepare_uretprobe(). But in
    this case func_2() does not return, so when func_1() returns the
    kernel uses the no longer valid return_instance of func_2().

    Change handle_trampoline() to unwind ->return_instances until we
    know that the next chain is alive or NULL, this ensures that the
    current chain is the last we need to report and free.

    Alternatively, every return_instance could use unique
    trampoline_vaddr, in this case we could use it as a key. And
    this could solve the problem with sigaltstack() automatically.

    But this approach needs more changes, and it puts the "hard"
    limit on MAX_URETPROBE_DEPTH. Plus it can not solve another
    problem partially fixed by the next patch.

    Note: this change has no effect on !x86, the arch-agnostic
    version of arch_uretprobe_is_alive() just returns "true".

    TODO: as documented by the previous change, arch_uretprobe_is_alive()
    can be fooled by sigaltstack/etc.

    Tested-by: Pratyush Anand
    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju
    Acked-by: Anton Arapov
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150721134021.GA4773@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • Add the x86 specific version of arch_uretprobe_is_alive()
    helper. It returns true if the stack frame mangled by
    prepare_uretprobe() is still on stack. So if it returns false,
    we know that the probed function has already returned.

    We add the new return_instance->stack member and change the
    generic code to initialize it in prepare_uretprobe, but it
    should be equally useful for other architectures.

    TODO: this assumes that the probed application can't use
    multiple stacks (say sigaltstack). We will try to improve
    this logic later.

    Tested-by: Pratyush Anand
    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju
    Acked-by: Anton Arapov
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150721134018.GA4766@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • Add the new "weak" helper, arch_uretprobe_is_alive(), used by
    the next patches. It should return true if this return_instance
    is still valid. The arch agnostic version just always returns
    true.

    The patch exports "struct return_instance" for the architectures
    which want to override this hook. We can also cleanup
    prepare_uretprobe() if we pass the new return_instance to
    arch_uretprobe_hijack_return_addr().

    Tested-by: Pratyush Anand
    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju
    Acked-by: Anton Arapov
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150721134016.GA4762@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • No functional changes, preparation.

    Add the new helper, find_next_ret_chain(), which finds the first
    !chained entry and returns its ->next. Yes, it is suboptimal. We
    probably want to turn ->chained into ->start_of_this_chain
    pointer and avoid another loop. But this needs the boring
    changes in dup_utask(), so lets do this later.

    Change the main loop in handle_trampoline() to unwind the stack
    until ri is equal to the pointer returned by this new helper.

    Tested-by: Pratyush Anand
    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju
    Acked-by: Anton Arapov
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150721134013.GA4755@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • Turn the last pr_warn() in uprobes.c into uprobe_warn().

    While at it:

    - s/kzalloc/kmalloc, we initialize every member of 'ri'

    - remove the pointless comment above the obvious code

    Tested-by: Pratyush Anand
    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju
    Acked-by: Anton Arapov
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150721134010.GA4752@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • 1. It doesn't make sense to continue if handle_trampoline()
    fails, change handle_swbp() to always return after this call.

    2. Turn pr_warn() into uprobe_warn(), and change
    handle_trampoline() to send SIGILL on failure. It is pointless to
    return to user mode with the corrupted instruction_pointer() which
    we can't restore.

    Tested-by: Pratyush Anand
    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju
    Acked-by: Anton Arapov
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150721134008.GA4745@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • We can simplify uprobe_free_utask() and handle_uretprobe_chain()
    if we add a simple helper which does put_uprobe/kfree and
    returns the ->next return_instance.

    Tested-by: Pratyush Anand
    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju
    Acked-by: Anton Arapov
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150721134006.GA4740@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • Cosmetic. Add the new trivial helper, get_uprobe(). It matches
    put_uprobe() we already have and we can simplify a couple of its
    users.

    Tested-by: Pratyush Anand
    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju
    Acked-by: Anton Arapov
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150721134003.GA4736@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     

27 Jul, 2015

1 commit

  • A recent fix to the shadow timestamp inadvertly broke the running time
    accounting.

    We must not update the running timestamp if we fail to schedule the
    event, the event will not have ran. This can (and did) result in
    negative total runtime because the stopped timestamp was before the
    running timestamp (we 'started' but never stopped the event -- because
    it never really started we didn't have to stop it either).

    Reported-and-Tested-by: Vince Weaver
    Fixes: 72f669c0086f ("perf: Update shadow timestamp before add event")
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: stable@vger.kernel.org # 4.1
    Cc: Shaohua Li
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     

24 Jul, 2015

1 commit

  • There are already two events for context switches, namely the tracepoint
    sched:sched_switch and the software event context_switches.
    Unfortunately neither are suitable for use by non-privileged users for
    the purpose of synchronizing hardware trace data (e.g. Intel PT) to the
    context switch.

    Tracepoints are no good at all for non-privileged users because they
    need either CAP_SYS_ADMIN or /proc/sys/kernel/perf_event_paranoid
    Acked-by: Peter Zijlstra (Intel)
    Tested-by: Jiri Olsa
    Cc: Andi Kleen
    Cc: Mathieu Poirier
    Cc: Pawel Moll
    Cc: Stephane Eranian
    Link: http://lkml.kernel.org/r/1437471846-26995-2-git-send-email-adrian.hunter@intel.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Adrian Hunter
     

06 Jul, 2015

1 commit

  • Its currently possible to drop the last refcount to the aux buffer
    from NMI context, which results in the expected fireworks.

    The refcounting needs a bigger overhaul, but to cure the immediate
    problem, delay the freeing by using an irq_work.

    Reviewed-and-tested-by: Alexander Shishkin
    Reported-by: Vince Weaver
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150618103249.GK19282@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

05 Jul, 2015

1 commit

  • Pull more vfs updates from Al Viro:
    "Assorted VFS fixes and related cleanups (IMO the most interesting in
    that part are f_path-related things and Eric's descriptor-related
    stuff). UFS regression fixes (it got broken last cycle). 9P fixes.
    fs-cache series, DAX patches, Jan's file_remove_suid() work"

    [ I'd say this is much more than "fixes and related cleanups". The
    file_table locking rule change by Eric Dumazet is a rather big and
    fundamental update even if the patch isn't huge. - Linus ]

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
    9p: cope with bogus responses from server in p9_client_{read,write}
    p9_client_write(): avoid double p9_free_req()
    9p: forgetting to cancel request on interrupted zero-copy RPC
    dax: bdev_direct_access() may sleep
    block: Add support for DAX reads/writes to block devices
    dax: Use copy_from_iter_nocache
    dax: Add block size note to documentation
    fs/file.c: __fget() and dup2() atomicity rules
    fs/file.c: don't acquire files->file_lock in fd_install()
    fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
    vfs: avoid creation of inode number 0 in get_next_ino
    namei: make set_root_rcu() return void
    make simple_positive() public
    ufs: use dir_pages instead of ufs_dir_pages()
    pagemap.h: move dir_pages() over there
    remove the pointless include of lglock.h
    fs: cleanup slight list_entry abuse
    xfs: Correctly lock inode when removing suid and file capabilities
    fs: Call security_ops->inode_killpriv on truncate
    fs: Provide function telling whether file_remove_privs() will do anything
    ...

    Linus Torvalds
     

27 Jun, 2015

2 commits

  • Pull tracing updates from Steven Rostedt:
    "This patch series contains several clean ups and even a new trace
    clock "monitonic raw". Also some enhancements to make the ring buffer
    even faster. But the biggest and most noticeable change is the
    renaming of the ftrace* files, structures and variables that have to
    deal with trace events.

    Over the years I've had several developers tell me about their
    confusion with what ftrace is compared to events. Technically,
    "ftrace" is the infrastructure to do the function hooks, which include
    tracing and also helps with live kernel patching. But the trace
    events are a separate entity altogether, and the files that affect the
    trace events should not be named "ftrace". These include:

    include/trace/ftrace.h -> include/trace/trace_events.h
    include/linux/ftrace_event.h -> include/linux/trace_events.h

    Also, functions that are specific for trace events have also been renamed:

    ftrace_print_*() -> trace_print_*()
    (un)register_ftrace_event() -> (un)register_trace_event()
    ftrace_event_name() -> trace_event_name()
    ftrace_trigger_soft_disabled() -> trace_trigger_soft_disabled()
    ftrace_define_fields_##call() -> trace_define_fields_##call()
    ftrace_get_offsets_##call() -> trace_get_offsets_##call()

    Structures have been renamed:

    ftrace_event_file -> trace_event_file
    ftrace_event_{call,class} -> trace_event_{call,class}
    ftrace_event_buffer -> trace_event_buffer
    ftrace_subsystem_dir -> trace_subsystem_dir
    ftrace_event_raw_##call -> trace_event_raw_##call
    ftrace_event_data_offset_##call-> trace_event_data_offset_##call
    ftrace_event_type_funcs_##call -> trace_event_type_funcs_##call

    And a few various variables and flags have also been updated.

    This has been sitting in linux-next for some time, and I have not
    heard a single complaint about this rename breaking anything. Mostly
    because these functions, variables and structures are mostly internal
    to the tracing system and are seldom (if ever) used by anything
    external to that"

    * tag 'trace-v4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (33 commits)
    ring_buffer: Allow to exit the ring buffer benchmark immediately
    ring-buffer-benchmark: Fix the wrong type
    ring-buffer-benchmark: Fix the wrong param in module_param
    ring-buffer: Add enum names for the context levels
    ring-buffer: Remove useless unused tracing_off_permanent()
    ring-buffer: Give NMIs a chance to lock the reader_lock
    ring-buffer: Add trace_recursive checks to ring_buffer_write()
    ring-buffer: Allways do the trace_recursive checks
    ring-buffer: Move recursive check to per_cpu descriptor
    ring-buffer: Add unlikelys to make fast path the default
    tracing: Rename ftrace_get_offsets_##call() to trace_event_get_offsets_##call()
    tracing: Rename ftrace_define_fields_##call() to trace_event_define_fields_##call()
    tracing: Rename ftrace_event_type_funcs_##call to trace_event_type_funcs_##call
    tracing: Rename ftrace_data_offset_##call to trace_event_data_offset_##call
    tracing: Rename ftrace_raw_##call event structures to trace_event_raw_##call
    tracing: Rename ftrace_trigger_soft_disabled() to trace_trigger_soft_disabled()
    tracing: Rename FTRACE_EVENT_FL_* flags to EVENT_FILE_FL_*
    tracing: Rename struct ftrace_subsystem_dir to trace_subsystem_dir
    tracing: Rename ftrace_event_name() to trace_event_name()
    tracing: Rename FTRACE_MAX_EVENT to TRACE_EVENT_TYPE_MAX
    ...

    Linus Torvalds
     
  • Pull ARM updates from Russell King:
    "Bigger items included in this update are:

    - A series of updates from Arnd for ARM randconfig build failures
    - Updates from Dmitry for StrongARM SA-1100 to move IRQ handling to
    drivers/irqchip/
    - Move ARMs SP804 timer to drivers/clocksource/
    - Perf updates from Mark Rutland in preparation to move the ARM perf
    code into drivers/ so it can be shared with ARM64.
    - MCPM updates from Nicolas
    - Add support for taking platform serial number from DT
    - Re-implement Keystone2 physical address space switch to conform to
    architecture requirements
    - Clean up ARMv7 LPAE code, which goes in hand with the Keystone2
    changes.
    - L2C cleanups to avoid unlocking caches if we're prevented by the
    secure support to unlock.
    - Avoid cleaning a potentially dirty cache containing stale data on
    CPU initialisation
    - Add ARM-only entry point for secondary startup (for machines that
    can only call into a Thumb kernel in ARM mode). Same thing is also
    done for the resume entry point.
    - Provide arch_irqs_disabled via asm-generic
    - Enlarge ARMv7M vector table
    - Always use BFD linker for VDSO, as gold doesn't accept some of the
    options we need.
    - Fix an incorrect BSYM (for Thumb symbols) usage, and convert all
    BSYM compiler macros to a "badr" (for branch address).
    - Shut up compiler warnings provoked by our cmpxchg() implementation.
    - Ensure bad xchg sizes fail to link"

    * 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-arm: (75 commits)
    ARM: Fix build if CLKDEV_LOOKUP is not configured
    ARM: fix new BSYM() usage introduced via for-arm-soc branch
    ARM: 8383/1: nommu: avoid deprecated source register on mov
    ARM: 8391/1: l2c: add options to overwrite prefetching behavior
    ARM: 8390/1: irqflags: Get arch_irqs_disabled from asm-generic
    ARM: 8387/1: arm/mm/dma-mapping.c: Add arm_coherent_dma_mmap
    ARM: 8388/1: tcm: Don't crash when TCM banks are protected by TrustZone
    ARM: 8384/1: VDSO: force use of BFD linker
    ARM: 8385/1: VDSO: group link options
    ARM: cmpxchg: avoid warnings from macro-ized cmpxchg() implementations
    ARM: remove __bad_xchg definition
    ARM: 8369/1: ARMv7M: define size of vector table for Vybrid
    ARM: 8382/1: clocksource: make ARM_TIMER_SP804 depend on GENERIC_SCHED_CLOCK
    ARM: 8366/1: move Dual-Timer SP804 driver to drivers/clocksource
    ARM: 8365/1: introduce sp804_timer_disable and remove arm_timer.h inclusion
    ARM: 8364/1: fix BE32 module loading
    ARM: 8360/1: add secondary_startup_arm prototype in header file
    ARM: 8359/1: correct secondary_startup_arm mode
    ARM: proc-v7: sanitise and document registers around errata
    ARM: proc-v7: clean up MIDR access
    ...

    Linus Torvalds
     

24 Jun, 2015

1 commit


23 Jun, 2015

4 commits

  • Pull timer updates from Thomas Gleixner:
    "A rather largish update for everything time and timer related:

    - Cache footprint optimizations for both hrtimers and timer wheel

    - Lower the NOHZ impact on systems which have NOHZ or timer migration
    disabled at runtime.

    - Optimize run time overhead of hrtimer interrupt by making the clock
    offset updates smarter

    - hrtimer cleanups and removal of restrictions to tackle some
    problems in sched/perf

    - Some more leap second tweaks

    - Another round of changes addressing the 2038 problem

    - First step to change the internals of clock event devices by
    introducing the necessary infrastructure

    - Allow constant folding for usecs/msecs_to_jiffies()

    - The usual pile of clockevent/clocksource driver updates

    The hrtimer changes contain updates to sched, perf and x86 as they
    depend on them plus changes all over the tree to cleanup API changes
    and redundant code, which got copied all over the place. The y2038
    changes touch s390 to remove the last non 2038 safe code related to
    boot/persistant clock"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
    clocksource: Increase dependencies of timer-stm32 to limit build wreckage
    timer: Minimize nohz off overhead
    timer: Reduce timer migration overhead if disabled
    timer: Stats: Simplify the flags handling
    timer: Replace timer base by a cpu index
    timer: Use hlist for the timer wheel hash buckets
    timer: Remove FIFO "guarantee"
    timers: Sanitize catchup_timer_jiffies() usage
    hrtimer: Allow hrtimer::function() to free the timer
    seqcount: Introduce raw_write_seqcount_barrier()
    seqcount: Rename write_seqcount_barrier()
    hrtimer: Fix hrtimer_is_queued() hole
    hrtimer: Remove HRTIMER_STATE_MIGRATE
    selftest: Timers: Avoid signal deadlock in leap-a-day
    timekeeping: Copy the shadow-timekeeper over the real timekeeper last
    clockevents: Check state instead of mode in suspend/resume path
    selftests: timers: Add leap-second timer edge testing to leap-a-day.c
    ntp: Do leapsecond adjustment in adjtimex read path
    time: Prevent early expiry of hrtimers[CLOCK_REALTIME] at the leap second edge
    ntp: Introduce and use SECS_PER_DAY macro instead of 86400
    ...

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "These are the left over fixes from the v4.1 cycle"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf tools: Fix build breakage if prefix= is specified
    perf/x86: Honor the architectural performance monitoring version
    perf/x86/intel: Fix PMI handling for Intel PT
    perf/x86/intel/bts: Fix DS area sharing with x86_pmu events
    perf/x86: Add more Broadwell model numbers
    perf: Fix ring_buffer_attach() RCU sync, again

    Linus Torvalds
     
  • Pull perf updates from Ingo Molnar:
    "Kernel side changes mostly consist of work on x86 PMU drivers:

    - x86 Intel PT (hardware CPU tracer) improvements (Alexander
    Shishkin)

    - x86 Intel CQM (cache quality monitoring) improvements (Thomas
    Gleixner)

    - x86 Intel PEBSv3 support (Peter Zijlstra)

    - x86 Intel PEBS interrupt batching support for lower overhead
    sampling (Zheng Yan, Kan Liang)

    - x86 PMU scheduler fixes and improvements (Peter Zijlstra)

    There's too many tooling improvements to list them all - here are a
    few select highlights:

    'perf bench':

    - Introduce new 'perf bench futex' benchmark: 'wake-parallel', to
    measure parallel waker threads generating contention for kernel
    locks (hb->lock). (Davidlohr Bueso)

    'perf top', 'perf report':

    - Allow disabling/enabling events dynamicaly in 'perf top':
    a 'perf top' session can instantly become a 'perf report'
    one, i.e. going from dynamic analysis to a static one,
    returning to a dynamic one is possible, to toogle the
    modes, just press 'f' to 'freeze/unfreeze' the sampling. (Arnaldo Carvalho de Melo)

    - Make Ctrl-C stop processing on TUI, allowing interrupting the load of big
    perf.data files (Namhyung Kim)

    'perf probe': (Masami Hiramatsu)

    - Support glob wildcards for function name
    - Support $params special probe argument: Collect all function arguments
    - Make --line checks validate C-style function name.
    - Add --no-inlines option to avoid searching inline functions
    - Greatly speed up 'perf probe --list' by caching debuginfo.
    - Improve --filter support for 'perf probe', allowing using its arguments
    on other commands, as --add, --del, etc.

    'perf sched':

    - Add option in 'perf sched' to merge like comms to lat output (Josef Bacik)

    Plus tons of infrastructure work - in particular preparation for
    upcoming threaded perf report support, but also lots of other work -
    and fixes and other improvements. See (much) more details in the
    shortlog and in the git log"

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (305 commits)
    perf tools: Configurable per thread proc map processing time out
    perf tools: Add time out to force stop proc map processing
    perf report: Fix sort__sym_cmp to also compare end of symbol
    perf hists browser: React to unassigned hotkey pressing
    perf top: Tell the user how to unfreeze events after pressing 'f'
    perf hists browser: Honour the help line provided by builtin-{top,report}.c
    perf hists browser: Do not exit when 'f' is pressed in 'report' mode
    perf top: Replace CTRL+z with 'f' as hotkey for enable/disable events
    perf annotate: Rename source_line_percent to source_line_samples
    perf annotate: Display total number of samples with --show-total-period
    perf tools: Ensure thread-stack is flushed
    perf top: Allow disabling/enabling events dynamicly
    perf evlist: Add toggle_enable() method
    perf trace: Fix race condition at the end of started workloads
    perf probe: Speed up perf probe --list by caching debuginfo
    perf probe: Show usage even if the last event is skipped
    perf tools: Move libtraceevent dynamic list to separated LDFLAGS variable
    perf tools: Fix a problem when opening old perf.data with different byte order
    perf tools: Ignore .config-detected in .gitignore
    perf probe: Fix to return error if no probe is added
    ...

    Linus Torvalds
     
  • Pull RCU updates from Ingo Molnar:

    - Continued initialization/Kconfig updates: hide most Kconfig options
    from unsuspecting users.

    There's now a single high level configuration option:

    *
    * RCU Subsystem
    *
    Make expert-level adjustments to RCU configuration (RCU_EXPERT) [N/y/?] (NEW)

    Which if answered in the negative, leaves us with a single
    interactive configuration option:

    Offload RCU callback processing from boot-selected CPUs (RCU_NOCB_CPU) [N/y/?] (NEW)

    All the rest of the RCU options are configured automatically. Later
    on we'll remove this single leftover configuration option as well.

    - Remove all uses of RCU-protected array indexes: replace the
    rcu_[access|dereference]_index_check() APIs with READ_ONCE() and
    rcu_lockdep_assert()

    - RCU CPU-hotplug cleanups

    - Updates to Tiny RCU: a race fix and further code shrinkage.

    - RCU torture-testing updates: fixes, speedups, cleanups and
    documentation updates.

    - Miscellaneous fixes

    - Documentation updates

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
    rcutorture: Allow repetition factors in Kconfig-fragment lists
    rcutorture: Display "make oldconfig" errors
    rcutorture: Update TREE_RCU-kconfig.txt
    rcutorture: Make rcutorture scripts force RCU_EXPERT
    rcutorture: Update configuration fragments for rcutree.rcu_fanout_exact
    rcutorture: TASKS_RCU set directly, so don't explicitly set it
    rcutorture: Test SRCU cleanup code path
    rcutorture: Replace barriers with smp_store_release() and smp_load_acquire()
    locktorture: Change longdelay_us to longdelay_ms
    rcutorture: Allow negative values of nreaders to oversubscribe
    rcutorture: Exchange TREE03 and TREE08 NR_CPUS, speed up CPU hotplug
    rcutorture: Exchange TREE03 and TREE04 geometries
    locktorture: fix deadlock in 'rw_lock_irq' type
    rcu: Correctly handle non-empty Tiny RCU callback list with none ready
    rcutorture: Test both RCU-sched and RCU-bh for Tiny RCU
    rcu: Further shrink Tiny RCU by making empty functions static inlines
    rcu: Conditionally compile RCU's eqs warnings
    rcu: Remove prompt for RCU implementation
    rcu: Make RCU able to tolerate undefined CONFIG_RCU_KTHREAD_PRIO
    rcu: Make RCU able to tolerate undefined CONFIG_RCU_FANOUT_LEAF
    ...

    Linus Torvalds
     

19 Jun, 2015

1 commit

  • While looking for other users of get_state/cond_sync. I Found
    ring_buffer_attach() and it looks obviously buggy?

    Don't we need to ensure that we have "synchronize" _between_
    list_del() and list_add() ?

    IOW. Suppose that ring_buffer_attach() preempts right_after
    get_state_synchronize_rcu() and gp completes before spin_lock().

    In this case cond_synchronize_rcu() does nothing and we reuse
    ->rb_entry without waiting for gp in between?

    It also moves the ->rcu_pending check under "if (rb)", to make it
    more readable imo.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dave@stgolabs.net
    Cc: der.herr@hofr.at
    Cc: josh@joshtriplett.org
    Cc: tj@kernel.org
    Fixes: b69cf53640da ("perf: Fix a race between ring_buffer_detach() and ring_buffer_attach()")
    Link: http://lkml.kernel.org/r/20150530200425.GA15748@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     

07 Jun, 2015

2 commits

  • After enlarging the PEBS interrupt threshold, there may be some mixed up
    PEBS samples which are discarded by the kernel.

    This patch makes the kernel emit a PERF_RECORD_LOST_SAMPLES record with
    the number of possible discarded records when it is impossible to demux
    the samples.

    It makes sure the user is not left in the dark about such discards.

    Signed-off-by: Kan Liang
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: acme@infradead.org
    Cc: eranian@google.com
    Link: http://lkml.kernel.org/r/1431285195-14269-8-git-send-email-kan.liang@intel.com
    Signed-off-by: Ingo Molnar

    Kan Liang
     
  • When the PEBS interrupt threshold is larger than one record and the
    machine supports multiple PEBS events, the records of these events are
    mixed up and we need to demultiplex them.

    Demuxing the records is hard because the hardware is deficient. The
    hardware has two issues that, when combined, create impossible
    scenarios to demux.

    The first issue is that the 'status' field of the PEBS record is a copy
    of the GLOBAL_STATUS MSR at PEBS assist time. To see why this is a
    problem let us first describe the regular PEBS cycle:

    A) the CTRn value reaches 0:
    - the corresponding bit in GLOBAL_STATUS gets set
    - we start arming the hardware assist
    < some unspecified amount of time later -- this could cover multiple
    events of interest >

    B) the hardware assist is armed, any next event will trigger it

    C) a matching event happens:
    - the hardware assist triggers and generates a PEBS record
    this includes a copy of GLOBAL_STATUS at this moment
    - if we auto-reload we (re)set CTRn
    - we clear the relevant bit in GLOBAL_STATUS

    Now consider the following chain of events:

    A0, B0, A1, C0

    The event generated for counter 0 will include a status with counter 1
    set, even though its not at all related to the record. A similar thing
    can happen with a !PEBS event if it just happens to overflow at the
    right moment.

    The second issue is that the hardware will only emit one record for two
    or more counters if the event that triggers the assist is 'close'. The
    'close' can be several cycles. In some cases even the complete assist,
    if the event is something that doesn't need retirement.

    For instance, consider this chain of events:

    A0, B0, A1, B1, C01

    Where C01 is an event that triggers both hardware assists, we will
    generate but a single record, but again with both counters listed in the
    status field.

    This time the record pertains to both events.

    Note that these two cases are different but undistinguishable with the
    data as generated. Therefore demuxing records with multiple PEBS bits
    (we can safely ignore status bits for !PEBS counters) is impossible.

    Furthermore we cannot emit the record to both events because that might
    cause a data leak -- the events might not have the same privileges -- so
    what this patch does is discard such events.

    The assumption/hope is that such discards will be rare.

    Here lists some possible ways you may get high discard rate.

    - when you count the same thing multiple times. But it is not a useful
    configuration.
    - you can be unfortunate if you measure with a userspace only PEBS
    event along with either a kernel or unrestricted PEBS event. Imagine
    the event triggering and setting the overflow flag right before
    entering the kernel. Then all kernel side events will end up with
    multiple bits set.

    Signed-off-by: Yan, Zheng
    Signed-off-by: Kan Liang
    [ Changelog improvements. ]
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: acme@infradead.org
    Cc: eranian@google.com
    Link: http://lkml.kernel.org/r/1430940834-8964-4-git-send-email-kan.liang@intel.com
    Signed-off-by: Ingo Molnar

    Signed-off-by: Ingo Molnar

    Yan, Zheng