Eric Lee / smarc-fsl-linux-kernel

11 Sep, 2015

1 commit

2965faa5e kexec: split kexec_load syscall from kexec core code ... Browse Code »

There are two kexec load syscalls, kexec_load another and kexec_file_load.
kexec_file_load has been splited as kernel/kexec_file.c. In this patch I
split kexec_load syscall code to kernel/kexec.c.

And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
use kexec_file_load only, or vice verse.

The original requirement is from Ted Ts'o, he want kexec kernel signature
being checked with CONFIG_KEXEC_VERIFY_SIG enabled. But kexec-tools use
kexec_load syscall can bypass the checking.

Vivek Goyal proposed to create a common kconfig option so user can compile
in only one syscall for loading kexec kernel. KEXEC/KEXEC_FILE selects
KEXEC_CORE so that old config files still work.

Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
KEXEC_CORE in arch Kconfig. Also updated general kernel code with to
kexec_load syscall.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Dave Young
Cc: Eric W. Biederman
Cc: Vivek Goyal
Cc: Petr Tesarik
Cc: Theodore Ts'o
Cc: Josh Boyer
Cc: David Howells
Cc: Geert Uytterhoeven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Young
2015-09-11 04:29:01 +0800

03 Sep, 2015

1 commit

dd5cdb48e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking updates from David Miller:
"Another merge window, another set of networking changes. I've heard
rumblings that the lightweight tunnels infrastructure has been voted
networking change of the year. But what do I know?

1) Add conntrack support to openvswitch, from Joe Stringer.

2) Initial support for VRF (Virtual Routing and Forwarding), which
allows the segmentation of routing paths without using multiple
devices. There are some semantic kinks to work out still, but
this is a reasonably strong foundation. From David Ahern.

3) Remove spinlock fro act_bpf fast path, from Alexei Starovoitov.

4) Ignore route nexthops with a link down state in ipv6, just like
ipv4. From Andy Gospodarek.

5) Remove spinlock from fast path of act_gact and act_mirred, from
Eric Dumazet.

6) Document the DSA layer, from Florian Fainelli.

7) Add netconsole support to bcmgenet, systemport, and DSA. Also
from Florian Fainelli.

8) Add Mellanox Switch Driver and core infrastructure, from Jiri
Pirko.

9) Add support for "light weight tunnels", which allow for
encapsulation and decapsulation without bearing the overhead of a
full blown netdevice. From Thomas Graf, Jiri Benc, and a cast of
others.

10) Add Identifier Locator Addressing support for ipv6, from Tom
Herbert.

11) Support fragmented SKBs in iwlwifi, from Johannes Berg.

12) Allow perf PMUs to be accessed from eBPF programs, from Kaixu Xia.

13) Add BQL support to 3c59x driver, from Loganaden Velvindron.

14) Stop using a zero TX queue length to mean that a device shouldn't
have a qdisc attached, use an explicit flag instead. From Phil
Sutter.

15) Use generic geneve netdevice infrastructure in openvswitch, from
Pravin B Shelar.

16) Add infrastructure to avoid re-forwarding a packet in software
that was already forwarded by a hardware switch. From Scott
Feldman.

17) Allow AF_PACKET fanout function to be implemented in a bpf
program, from Willem de Bruijn"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1458 commits)
netfilter: nf_conntrack: make nf_ct_zone_dflt built-in
netfilter: nf_dup{4, 6}: fix build error when nf_conntrack disabled
net: fec: clear receive interrupts before processing a packet
ipv6: fix exthdrs offload registration in out_rt path
xen-netback: add support for multicast control
bgmac: Update fixed_phy_register()
sock, diag: fix panic in sock_diag_put_filterinfo
flow_dissector: Use 'const' where possible.
flow_dissector: Fix function argument ordering dependency
ixgbe: Resolve "initialized field overwritten" warnings
ixgbe: Remove bimodal SR-IOV disabling
ixgbe: Add support for reporting 2.5G link speed
ixgbe: fix bounds checking in ixgbe_setup_tc for 82598
ixgbe: support for ethtool set_rxfh
ixgbe: Avoid needless PHY access on copper phys
ixgbe: cleanup to use cached mask value
ixgbe: Remove second instance of lan_id variable
ixgbe: use kzalloc for allocating one thing
flow: Move __get_hash_from_flowi{4,6} into flow_dissector.c
ixgbe: Remove unused PCI bus types
...

Linus Torvalds
2015-09-03 23:08:17 +0800

22 Aug, 2015

1 commit

dc25b2589 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/usb/qmi_wwan.c

Overlapping additions of new device IDs to qmi_wwan.c

Signed-off-by: David S. Miller

David S. Miller
2015-08-22 02:44:04 +0800

12 Aug, 2015

4 commits

c2ad6b51e perf/ring-buffer: Clarify the use of page::private for high-order AUX allocations ... Browse Code »

A question [1] was raised about the use of page::private in AUX buffer
allocations, so let's add a clarification about its intended use.

The private field and flag are used by perf's rb_alloc_aux() path to
tell the pmu driver the size of each high-order allocation, so that the
driver can program those appropriately into its hardware. This only
matters for PMUs that don't support hardware scatter tables. Otherwise,
every page in the buffer is just a page.

This patch adds a comment about the private field to the AUX buffer
allocation path.

[1] http://marc.info/?l=linux-kernel&m=143803696607968

Reported-by: Mathieu Poirier
Signed-off-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1438063204-665-1-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar

Alexander Shishkin
2015-08-12 17:43:20 +0800
3d325bf0d Merge branch 'perf/urgent' into perf/core, to pick up fixes before applying new changes ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2015-08-12 17:39:19 +0800
c7999c6f3 perf: Fix PERF_EVENT_IOC_PERIOD migration race ... Browse Code »

I ran the perf fuzzer, which triggered some WARN()s which are due to
trying to stop/restart an event on the wrong CPU.

Use the normal IPI pattern to ensure we run the code on the correct CPU.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Vince Weaver
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Fixes: bad7192b842c ("perf: Fix PERF_EVENT_IOC_PERIOD to force-reset the period")
Signed-off-by: Ingo Molnar

Peter Zijlstra
2015-08-12 17:37:22 +0800
ee9397a6f perf: Fix double-free of the AUX buffer ... Browse Code »

If rb->aux_refcount is decremented to zero before rb->refcount,
__rb_free_aux() may be called twice resulting in a double free of
rb->aux_pages. Fix this by adding a check to __rb_free_aux().

Signed-off-by: Ben Hutchings
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: stable@vger.kernel.org
Fixes: 57ffc5ca679f ("perf: Fix AUX buffer refcounting")
Link: http://lkml.kernel.org/r/1437953468.12842.17.camel@decadent.org.uk
Signed-off-by: Ingo Molnar

Ben Hutchings
2015-08-12 17:37:21 +0800

10 Aug, 2015

1 commit

ffe8690c8 perf: add the necessary core perf APIs when accessing events counters in eBPF programs ... Browse Code »

This patch add three core perf APIs:
- perf_event_attrs(): export the struct perf_event_attr from struct
perf_event;
- perf_event_get(): get the struct perf_event from the given fd;
- perf_event_read_local(): read the events counters active on the
current CPU;
These APIs are needed when accessing events counters in eBPF programs.

The API perf_event_read_local() comes from Peter and I add the
corresponding SOB.

Signed-off-by: Kaixu Xia
Signed-off-by: Peter Zijlstra
Signed-off-by: David S. Miller

Kaixu Xia
2015-08-10 13:50:05 +0800

07 Aug, 2015

1 commit

04a22fae4 tracing, perf: Implement BPF programs attached to uprobes ... Browse Code »

By copying BPF related operation to uprobe processing path, this patch
allow users attach BPF programs to uprobes like what they are already
doing on kprobes.

After this patch, users are allowed to use PERF_EVENT_IOC_SET_BPF on a
uprobe perf event. Which make it possible to profile user space programs
and kernel events together using BPF.

Because of this patch, CONFIG_BPF_EVENTS should be selected by
CONFIG_UPROBE_EVENT to ensure trace_call_bpf() is compiled even if
KPROBE_EVENT is not set.

Signed-off-by: Wang Nan
Acked-by: Alexei Starovoitov
Cc: Brendan Gregg
Cc: Daniel Borkmann
Cc: David Ahern
Cc: He Kuang
Cc: Jiri Olsa
Cc: Kaixu Xia
Cc: Masami Hiramatsu
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Zefan Li
Cc: pi3orama@163.com
Link: http://lkml.kernel.org/r/1435716878-189507-3-git-send-email-wangnan0@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo

Wang Nan
2015-08-07 02:29:14 +0800

04 Aug, 2015

2 commits

9a6694cfa perf/x86/intel/pt: Do not force sync packets on every schedule-in ... Browse Code »

Currently, the PT driver zeroes out the status register every time before
starting the event. However, all the writable bits are already taken care
of in pt_handle_status() function, except the new PacketByteCnt field,
which in new versions of PT contains the number of packet bytes written
since the last sync (PSB) packet. Zeroing it out before enabling PT forces
a sync packet to be written. This means that, with the existing code, a
sync packet (PSB and PSBEND, 18 bytes in total) will be generated every
time a PT event is scheduled in.

To avoid these unnecessary syncs and save a WRMSR in the fast path, this
patch changes the default behavior to not clear PacketByteCnt field, so
that the sync packets will be generated with the period specified as
"psb_period" attribute config field. This has little impact on the trace
data as the other packets that are normally sent within PSB+ (between PSB
and PSBEND) have their own generation scenarios which do not depend on the
sync packets.

One exception where we do need to force PSB like this when tracing starts,
so that the decoder has a clear sync point in the trace. For this purpose
we aready have hw::itrace_started flag, which we are currently using to
output PERF_RECORD_ITRACE_START. This patch moves setting itrace_started
from perf core to the pmu::start, where it should still be 0 on the very
first run.

Signed-off-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/1438264104-16189-1-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar

Alexander Shishkin
2015-08-04 16:16:55 +0800
fed66e2cd perf: Fix fasync handling on inherited events ... Browse Code »

Vince reported that the fasync signal stuff doesn't work proper for
inherited events. So fix that.

Installing fasync allocates memory and sets filp->f_flags |= FASYNC,
which upon the demise of the file descriptor ensures the allocation is
freed and state is updated.

Now for perf, we can have the events stick around for a while after the
original FD is dead because of references from child events. So we
cannot copy the fasync pointer around. We can however consistently use
the parent's fasync, as that will be updated.

Reported-and-Tested-by: Vince Weaver
Signed-off-by: Peter Zijlstra (Intel)
Cc:
Cc: Arnaldo Carvalho deMelo
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1434011521.1495.71.camel@twins
Signed-off-by: Ingo Molnar

Peter Zijlstra
2015-08-04 15:57:52 +0800

31 Jul, 2015

15 commits

2a742cedc uprobes: Fix the waitqueue_active() check in xol_free_insn_slot() ... Browse Code »

The xol_free_insn_slot()->waitqueue_active() check is buggy. We
need mb() after we set the conditon for wait_event(), or
xol_take_insn_slot() can miss the wakeup.

Signed-off-by: Oleg Nesterov
Cc: Andy Lutomirski
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Pratyush Anand
Cc: Srikar Dronamraju
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20150721134036.GA4799@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2015-07-31 16:38:07 +0800
704bde3cc uprobes: Use vm_special_mapping to name the XOL vma ... Browse Code »

Change xol_add_vma() to use _install_special_mapping(), this way
we can name the vma installed by uprobes. Currently it looks
like private anonymous mapping, this is confusing and
complicates the debugging. With this change /proc/$pid/maps
reports "[uprobes]".

As a side effect this will cause core dumps to include the XOL vma
and I think this is good; this can help to debug the problem if
the app crashed because it was probed.

Signed-off-by: Oleg Nesterov
Cc: Andy Lutomirski
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Pratyush Anand
Cc: Srikar Dronamraju
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20150721134033.GA4796@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2015-07-31 16:38:06 +0800
f58bea2fe uprobes: Fix the usage of install_special_mapping() ... Browse Code »

install_special_mapping(pages) expects that "pages" is the zero-
terminated array while xol_add_vma() passes &area->page, this
means that special_mapping_fault() can wrongly use the next
member in xol_area (vaddr) as "struct page *".

Fortunately, this area is not expandable so pgoff != 0 isn't
possible (modulo bugs in special_mapping_vmops), but still this
does not look good.

Signed-off-by: Oleg Nesterov
Cc: Andy Lutomirski
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Pratyush Anand
Cc: Srikar Dronamraju
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20150721134031.GA4789@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2015-07-31 16:38:06 +0800
db087ef69 uprobes/x86: Make arch_uretprobe_is_alive(RP_CHECK_CALL) more clever ... Browse Code »

The previous change documents that cleanup_return_instances()
can't always detect the dead frames, the stack can grow. But
there is one special case which imho worth fixing:
arch_uretprobe_is_alive() can return true when the stack didn't
actually grow, but the next "call" insn uses the already
invalidated frame.

Test-case:

#include
#include

jmp_buf jmp;
int nr = 1024;

void func_2(void)
{
if (--nr == 0)
return;
longjmp(jmp, 1);
}

void func_1(void)
{
setjmp(jmp);
func_2();
}

int main(void)
{
func_1();
return 0;
}

If you ret-probe func_1() and func_2() prepare_uretprobe() hits
the MAX_URETPROBE_DEPTH limit and "return" from func_2() is not
reported.

When we know that the new call is not chained, we can do the
more strict check. In this case "sp" points to the new ret-addr,
so every frame which uses the same "sp" must be dead. The only
complication is that arch_uretprobe_is_alive() needs to know was
it chained or not, so we add the new RP_CHECK_CHAIN_CALL enum
and change prepare_uretprobe() to pass RP_CHECK_CALL only if
!chained.

Note: arch_uretprobe_is_alive() could also re-read *sp and check
if this word is still trampoline_vaddr. This could obviously
improve the logic, but I would like to avoid another
copy_from_user() especially in the case when we can't avoid the
false "alive == T" positives.

Tested-by: Pratyush Anand
Signed-off-by: Oleg Nesterov
Acked-by: Srikar Dronamraju
Acked-by: Anton Arapov
Cc: Andy Lutomirski
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20150721134028.GA4786@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2015-07-31 16:38:06 +0800
86dcb702e uprobes: Add the "enum rp_check ctx" arg to arch_uretprobe_is_alive() ... Browse Code »

arch/x86 doesn't care (so far), but as Pratyush Anand pointed
out other architectures might want why arch_uretprobe_is_alive()
was called and use different checks depending on the context.
Add the new argument to distinguish 2 callers.

Tested-by: Pratyush Anand
Signed-off-by: Oleg Nesterov
Acked-by: Srikar Dronamraju
Acked-by: Anton Arapov
Cc: Andy Lutomirski
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20150721134026.GA4779@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2015-07-31 16:38:06 +0800
a5b7e1a89 uprobes: Change prepare_uretprobe() to (try to) flush the dead frames ... Browse Code »

Change prepare_uretprobe() to flush the !arch_uretprobe_is_alive()
return_instance's. This is not needed correctness-wise, but can help
to avoid the failure caused by MAX_URETPROBE_DEPTH.

Note: in this case arch_uretprobe_is_alive() can be false
positive, the stack can grow after longjmp(). Unfortunately, the
kernel can't 100% solve this problem, but see the next patch.

Tested-by: Pratyush Anand
Signed-off-by: Oleg Nesterov
Acked-by: Srikar Dronamraju
Acked-by: Anton Arapov
Cc: Andy Lutomirski
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20150721134023.GA4776@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2015-07-31 16:38:05 +0800
5eeb50de4 uprobes: Change handle_trampoline() to flush the frames invalidated by longjmp() ... Browse Code »

Test-case:

#include
#include

jmp_buf jmp;

void func_2(void)
{
longjmp(jmp, 1);
}

void func_1(void)
{
if (setjmp(jmp))
return;
func_2();
printf("ERR!! I am running on the caller's stack\n");
}

int main(void)
{
func_1();
return 0;
}

fails if you probe func_1() and func_2() because
handle_trampoline() assumes that the probed function should must
return and hit the bp installed be prepare_uretprobe(). But in
this case func_2() does not return, so when func_1() returns the
kernel uses the no longer valid return_instance of func_2().

Change handle_trampoline() to unwind ->return_instances until we
know that the next chain is alive or NULL, this ensures that the
current chain is the last we need to report and free.

Alternatively, every return_instance could use unique
trampoline_vaddr, in this case we could use it as a key. And
this could solve the problem with sigaltstack() automatically.

But this approach needs more changes, and it puts the "hard"
limit on MAX_URETPROBE_DEPTH. Plus it can not solve another
problem partially fixed by the next patch.

Note: this change has no effect on !x86, the arch-agnostic
version of arch_uretprobe_is_alive() just returns "true".

TODO: as documented by the previous change, arch_uretprobe_is_alive()
can be fooled by sigaltstack/etc.

Tested-by: Pratyush Anand
Signed-off-by: Oleg Nesterov
Acked-by: Srikar Dronamraju
Acked-by: Anton Arapov
Cc: Andy Lutomirski
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20150721134021.GA4773@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2015-07-31 16:38:05 +0800
7b868e480 uprobes/x86: Reimplement arch_uretprobe_is_alive() ... Browse Code »

Add the x86 specific version of arch_uretprobe_is_alive()
helper. It returns true if the stack frame mangled by
prepare_uretprobe() is still on stack. So if it returns false,
we know that the probed function has already returned.

We add the new return_instance->stack member and change the
generic code to initialize it in prepare_uretprobe, but it
should be equally useful for other architectures.

TODO: this assumes that the probed application can't use
multiple stacks (say sigaltstack). We will try to improve
this logic later.

Tested-by: Pratyush Anand
Signed-off-by: Oleg Nesterov
Acked-by: Srikar Dronamraju
Acked-by: Anton Arapov
Cc: Andy Lutomirski
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20150721134018.GA4766@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2015-07-31 16:38:05 +0800
97da89767 uprobes: Export 'struct return_instance', introduce arch_uretprobe_is_alive() ... Browse Code »

Add the new "weak" helper, arch_uretprobe_is_alive(), used by
the next patches. It should return true if this return_instance
is still valid. The arch agnostic version just always returns
true.

The patch exports "struct return_instance" for the architectures
which want to override this hook. We can also cleanup
prepare_uretprobe() if we pass the new return_instance to
arch_uretprobe_hijack_return_addr().

Tested-by: Pratyush Anand
Signed-off-by: Oleg Nesterov
Acked-by: Srikar Dronamraju
Acked-by: Anton Arapov
Cc: Andy Lutomirski
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20150721134016.GA4762@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2015-07-31 16:38:04 +0800
a83cfeb92 uprobes: Change handle_trampoline() to find the next chain beforehand ... Browse Code »

No functional changes, preparation.

Add the new helper, find_next_ret_chain(), which finds the first
!chained entry and returns its ->next. Yes, it is suboptimal. We
probably want to turn ->chained into ->start_of_this_chain
pointer and avoid another loop. But this needs the boring
changes in dup_utask(), so lets do this later.

Change the main loop in handle_trampoline() to unwind the stack
until ri is equal to the pointer returned by this new helper.

Tested-by: Pratyush Anand
Signed-off-by: Oleg Nesterov
Acked-by: Srikar Dronamraju
Acked-by: Anton Arapov
Cc: Andy Lutomirski
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20150721134013.GA4755@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2015-07-31 16:38:04 +0800
6c58d0e4c uprobes: Change prepare_uretprobe() to use uprobe_warn() ... Browse Code »

Turn the last pr_warn() in uprobes.c into uprobe_warn().

While at it:

- s/kzalloc/kmalloc, we initialize every member of 'ri'

- remove the pointless comment above the obvious code

Tested-by: Pratyush Anand
Signed-off-by: Oleg Nesterov
Acked-by: Srikar Dronamraju
Acked-by: Anton Arapov
Cc: Andy Lutomirski
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20150721134010.GA4752@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2015-07-31 16:38:04 +0800
0b5256c7f uprobes: Send SIGILL if handle_trampoline() fails ... Browse Code »

1. It doesn't make sense to continue if handle_trampoline()
fails, change handle_swbp() to always return after this call.

2. Turn pr_warn() into uprobe_warn(), and change
handle_trampoline() to send SIGILL on failure. It is pointless to
return to user mode with the corrupted instruction_pointer() which
we can't restore.

Tested-by: Pratyush Anand
Signed-off-by: Oleg Nesterov
Acked-by: Srikar Dronamraju
Acked-by: Anton Arapov
Cc: Andy Lutomirski
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20150721134008.GA4745@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2015-07-31 16:38:03 +0800
2bb5e840e uprobes: Introduce free_ret_instance() ... Browse Code »

We can simplify uprobe_free_utask() and handle_uretprobe_chain()
if we add a simple helper which does put_uprobe/kfree and
returns the ->next return_instance.

Tested-by: Pratyush Anand
Signed-off-by: Oleg Nesterov
Acked-by: Srikar Dronamraju
Acked-by: Anton Arapov
Cc: Andy Lutomirski
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20150721134006.GA4740@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2015-07-31 16:38:03 +0800
f231722a2 uprobes: Introduce get_uprobe() ... Browse Code »

Cosmetic. Add the new trivial helper, get_uprobe(). It matches
put_uprobe() we already have and we can simplify a couple of its
users.

Tested-by: Pratyush Anand
Signed-off-by: Oleg Nesterov
Acked-by: Srikar Dronamraju
Acked-by: Anton Arapov
Cc: Andy Lutomirski
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20150721134003.GA4736@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2015-07-31 16:38:03 +0800
acd632eb6 Merge branch 'perf/urgent' into perf/core, to merge fixes before pulling more changes ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2015-07-31 15:59:28 +0800

27 Jul, 2015

1 commit

00a2916f7 perf: Fix running time accounting ... Browse Code »

A recent fix to the shadow timestamp inadvertly broke the running time
accounting.

We must not update the running timestamp if we fail to schedule the
event, the event will not have ran. This can (and did) result in
negative total runtime because the stopped timestamp was before the
running timestamp (we 'started' but never stopped the event -- because
it never really started we didn't have to stop it either).

Reported-and-Tested-by: Vince Weaver
Fixes: 72f669c0086f ("perf: Update shadow timestamp before add event")
Signed-off-by: Peter Zijlstra (Intel)
Cc: stable@vger.kernel.org # 4.1
Cc: Shaohua Li
Signed-off-by: Thomas Gleixner

Peter Zijlstra
2015-07-27 19:52:19 +0800

24 Jul, 2015

1 commit

45ac1403f perf: Add PERF_RECORD_SWITCH to indicate context switches ... Browse Code »

There are already two events for context switches, namely the tracepoint
sched:sched_switch and the software event context_switches.
Unfortunately neither are suitable for use by non-privileged users for
the purpose of synchronizing hardware trace data (e.g. Intel PT) to the
context switch.

Tracepoints are no good at all for non-privileged users because they
need either CAP_SYS_ADMIN or /proc/sys/kernel/perf_event_paranoid
Acked-by: Peter Zijlstra (Intel)
Tested-by: Jiri Olsa
Cc: Andi Kleen
Cc: Mathieu Poirier
Cc: Pawel Moll
Cc: Stephane Eranian
Link: http://lkml.kernel.org/r/1437471846-26995-2-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo

Adrian Hunter
2015-07-24 09:51:12 +0800

06 Jul, 2015

1 commit

57ffc5ca6 perf: Fix AUX buffer refcounting ... Browse Code »

Its currently possible to drop the last refcount to the aux buffer
from NMI context, which results in the expected fireworks.

The refcounting needs a bigger overhaul, but to cure the immediate
problem, delay the freeing by using an irq_work.

Reviewed-and-tested-by: Alexander Shishkin
Reported-by: Vince Weaver
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20150618103249.GK19282@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2015-07-06 20:08:30 +0800

05 Jul, 2015

1 commit

1dc51b828 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull more vfs updates from Al Viro:
"Assorted VFS fixes and related cleanups (IMO the most interesting in
that part are f_path-related things and Eric's descriptor-related
stuff). UFS regression fixes (it got broken last cycle). 9P fixes.
fs-cache series, DAX patches, Jan's file_remove_suid() work"

[ I'd say this is much more than "fixes and related cleanups". The
file_table locking rule change by Eric Dumazet is a rather big and
fundamental update even if the patch isn't huge. - Linus ]

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
9p: cope with bogus responses from server in p9_client_{read,write}
p9_client_write(): avoid double p9_free_req()
9p: forgetting to cancel request on interrupted zero-copy RPC
dax: bdev_direct_access() may sleep
block: Add support for DAX reads/writes to block devices
dax: Use copy_from_iter_nocache
dax: Add block size note to documentation
fs/file.c: __fget() and dup2() atomicity rules
fs/file.c: don't acquire files->file_lock in fd_install()
fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
vfs: avoid creation of inode number 0 in get_next_ino
namei: make set_root_rcu() return void
make simple_positive() public
ufs: use dir_pages instead of ufs_dir_pages()
pagemap.h: move dir_pages() over there
remove the pointless include of lglock.h
fs: cleanup slight list_entry abuse
xfs: Correctly lock inode when removing suid and file capabilities
fs: Call security_ops->inode_killpriv on truncate
fs: Provide function telling whether file_remove_privs() will do anything
...

Linus Torvalds
2015-07-05 10:36:06 +0800

27 Jun, 2015

2 commits

e38260825 Merge tag 'trace-v4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace ... Browse Code »

Pull tracing updates from Steven Rostedt:
"This patch series contains several clean ups and even a new trace
clock "monitonic raw". Also some enhancements to make the ring buffer
even faster. But the biggest and most noticeable change is the
renaming of the ftrace* files, structures and variables that have to
deal with trace events.

Over the years I've had several developers tell me about their
confusion with what ftrace is compared to events. Technically,
"ftrace" is the infrastructure to do the function hooks, which include
tracing and also helps with live kernel patching. But the trace
events are a separate entity altogether, and the files that affect the
trace events should not be named "ftrace". These include:

include/trace/ftrace.h -> include/trace/trace_events.h
include/linux/ftrace_event.h -> include/linux/trace_events.h

Also, functions that are specific for trace events have also been renamed:

ftrace_print_*() -> trace_print_*()
(un)register_ftrace_event() -> (un)register_trace_event()
ftrace_event_name() -> trace_event_name()
ftrace_trigger_soft_disabled() -> trace_trigger_soft_disabled()
ftrace_define_fields_##call() -> trace_define_fields_##call()
ftrace_get_offsets_##call() -> trace_get_offsets_##call()

Structures have been renamed:

ftrace_event_file -> trace_event_file
ftrace_event_{call,class} -> trace_event_{call,class}
ftrace_event_buffer -> trace_event_buffer
ftrace_subsystem_dir -> trace_subsystem_dir
ftrace_event_raw_##call -> trace_event_raw_##call
ftrace_event_data_offset_##call-> trace_event_data_offset_##call
ftrace_event_type_funcs_##call -> trace_event_type_funcs_##call

And a few various variables and flags have also been updated.

This has been sitting in linux-next for some time, and I have not
heard a single complaint about this rename breaking anything. Mostly
because these functions, variables and structures are mostly internal
to the tracing system and are seldom (if ever) used by anything
external to that"

* tag 'trace-v4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (33 commits)
ring_buffer: Allow to exit the ring buffer benchmark immediately
ring-buffer-benchmark: Fix the wrong type
ring-buffer-benchmark: Fix the wrong param in module_param
ring-buffer: Add enum names for the context levels
ring-buffer: Remove useless unused tracing_off_permanent()
ring-buffer: Give NMIs a chance to lock the reader_lock
ring-buffer: Add trace_recursive checks to ring_buffer_write()
ring-buffer: Allways do the trace_recursive checks
ring-buffer: Move recursive check to per_cpu descriptor
ring-buffer: Add unlikelys to make fast path the default
tracing: Rename ftrace_get_offsets_##call() to trace_event_get_offsets_##call()
tracing: Rename ftrace_define_fields_##call() to trace_event_define_fields_##call()
tracing: Rename ftrace_event_type_funcs_##call to trace_event_type_funcs_##call
tracing: Rename ftrace_data_offset_##call to trace_event_data_offset_##call
tracing: Rename ftrace_raw_##call event structures to trace_event_raw_##call
tracing: Rename ftrace_trigger_soft_disabled() to trace_trigger_soft_disabled()
tracing: Rename FTRACE_EVENT_FL_* flags to EVENT_FILE_FL_*
tracing: Rename struct ftrace_subsystem_dir to trace_subsystem_dir
tracing: Rename ftrace_event_name() to trace_event_name()
tracing: Rename FTRACE_MAX_EVENT to TRACE_EVENT_TYPE_MAX
...

Linus Torvalds
2015-06-27 05:02:43 +0800
e8a0b37d2 Merge branch 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-arm ... Browse Code »

Pull ARM updates from Russell King:
"Bigger items included in this update are:

- A series of updates from Arnd for ARM randconfig build failures
- Updates from Dmitry for StrongARM SA-1100 to move IRQ handling to
drivers/irqchip/
- Move ARMs SP804 timer to drivers/clocksource/
- Perf updates from Mark Rutland in preparation to move the ARM perf
code into drivers/ so it can be shared with ARM64.
- MCPM updates from Nicolas
- Add support for taking platform serial number from DT
- Re-implement Keystone2 physical address space switch to conform to
architecture requirements
- Clean up ARMv7 LPAE code, which goes in hand with the Keystone2
changes.
- L2C cleanups to avoid unlocking caches if we're prevented by the
secure support to unlock.
- Avoid cleaning a potentially dirty cache containing stale data on
CPU initialisation
- Add ARM-only entry point for secondary startup (for machines that
can only call into a Thumb kernel in ARM mode). Same thing is also
done for the resume entry point.
- Provide arch_irqs_disabled via asm-generic
- Enlarge ARMv7M vector table
- Always use BFD linker for VDSO, as gold doesn't accept some of the
options we need.
- Fix an incorrect BSYM (for Thumb symbols) usage, and convert all
BSYM compiler macros to a "badr" (for branch address).
- Shut up compiler warnings provoked by our cmpxchg() implementation.
- Ensure bad xchg sizes fail to link"

* 'for-linus' of git://ftp.arm.linux.org.uk/~rmk/linux-arm: (75 commits)
ARM: Fix build if CLKDEV_LOOKUP is not configured
ARM: fix new BSYM() usage introduced via for-arm-soc branch
ARM: 8383/1: nommu: avoid deprecated source register on mov
ARM: 8391/1: l2c: add options to overwrite prefetching behavior
ARM: 8390/1: irqflags: Get arch_irqs_disabled from asm-generic
ARM: 8387/1: arm/mm/dma-mapping.c: Add arm_coherent_dma_mmap
ARM: 8388/1: tcm: Don't crash when TCM banks are protected by TrustZone
ARM: 8384/1: VDSO: force use of BFD linker
ARM: 8385/1: VDSO: group link options
ARM: cmpxchg: avoid warnings from macro-ized cmpxchg() implementations
ARM: remove __bad_xchg definition
ARM: 8369/1: ARMv7M: define size of vector table for Vybrid
ARM: 8382/1: clocksource: make ARM_TIMER_SP804 depend on GENERIC_SCHED_CLOCK
ARM: 8366/1: move Dual-Timer SP804 driver to drivers/clocksource
ARM: 8365/1: introduce sp804_timer_disable and remove arm_timer.h inclusion
ARM: 8364/1: fix BE32 module loading
ARM: 8360/1: add secondary_startup_arm prototype in header file
ARM: 8359/1: correct secondary_startup_arm mode
ARM: proc-v7: sanitise and document registers around errata
ARM: proc-v7: clean up MIDR access
...

Linus Torvalds
2015-06-27 03:20:00 +0800

24 Jun, 2015

1 commit

9bf39ab2a vfs: add file_path() helper ... Browse Code »

Turn
d_path(&file->f_path, ...);
into
file_path(file, ...);

Signed-off-by: Miklos Szeredi
Signed-off-by: Al Viro

Miklos Szeredi
2015-06-24 06:00:05 +0800

23 Jun, 2015

4 commits

43224b96a Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull timer updates from Thomas Gleixner:
"A rather largish update for everything time and timer related:

- Cache footprint optimizations for both hrtimers and timer wheel

- Lower the NOHZ impact on systems which have NOHZ or timer migration
disabled at runtime.

- Optimize run time overhead of hrtimer interrupt by making the clock
offset updates smarter

- hrtimer cleanups and removal of restrictions to tackle some
problems in sched/perf

- Some more leap second tweaks

- Another round of changes addressing the 2038 problem

- First step to change the internals of clock event devices by
introducing the necessary infrastructure

- Allow constant folding for usecs/msecs_to_jiffies()

- The usual pile of clockevent/clocksource driver updates

The hrtimer changes contain updates to sched, perf and x86 as they
depend on them plus changes all over the tree to cleanup API changes
and redundant code, which got copied all over the place. The y2038
changes touch s390 to remove the last non 2038 safe code related to
boot/persistant clock"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
clocksource: Increase dependencies of timer-stm32 to limit build wreckage
timer: Minimize nohz off overhead
timer: Reduce timer migration overhead if disabled
timer: Stats: Simplify the flags handling
timer: Replace timer base by a cpu index
timer: Use hlist for the timer wheel hash buckets
timer: Remove FIFO "guarantee"
timers: Sanitize catchup_timer_jiffies() usage
hrtimer: Allow hrtimer::function() to free the timer
seqcount: Introduce raw_write_seqcount_barrier()
seqcount: Rename write_seqcount_barrier()
hrtimer: Fix hrtimer_is_queued() hole
hrtimer: Remove HRTIMER_STATE_MIGRATE
selftest: Timers: Avoid signal deadlock in leap-a-day
timekeeping: Copy the shadow-timekeeper over the real timekeeper last
clockevents: Check state instead of mode in suspend/resume path
selftests: timers: Add leap-second timer edge testing to leap-a-day.c
ntp: Do leapsecond adjustment in adjtimex read path
time: Prevent early expiry of hrtimers[CLOCK_REALTIME] at the leap second edge
ntp: Introduce and use SECS_PER_DAY macro instead of 86400
...

Linus Torvalds
2015-06-23 09:57:44 +0800
6bc4c3ad3 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf fixes from Ingo Molnar:
"These are the left over fixes from the v4.1 cycle"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf tools: Fix build breakage if prefix= is specified
perf/x86: Honor the architectural performance monitoring version
perf/x86/intel: Fix PMI handling for Intel PT
perf/x86/intel/bts: Fix DS area sharing with x86_pmu events
perf/x86: Add more Broadwell model numbers
perf: Fix ring_buffer_attach() RCU sync, again

Linus Torvalds
2015-06-23 06:45:41 +0800
c58267e9f Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf updates from Ingo Molnar:
"Kernel side changes mostly consist of work on x86 PMU drivers:

- x86 Intel PT (hardware CPU tracer) improvements (Alexander
Shishkin)

- x86 Intel CQM (cache quality monitoring) improvements (Thomas
Gleixner)

- x86 Intel PEBSv3 support (Peter Zijlstra)

- x86 Intel PEBS interrupt batching support for lower overhead
sampling (Zheng Yan, Kan Liang)

- x86 PMU scheduler fixes and improvements (Peter Zijlstra)

There's too many tooling improvements to list them all - here are a
few select highlights:

'perf bench':

- Introduce new 'perf bench futex' benchmark: 'wake-parallel', to
measure parallel waker threads generating contention for kernel
locks (hb->lock). (Davidlohr Bueso)

'perf top', 'perf report':

- Allow disabling/enabling events dynamicaly in 'perf top':
a 'perf top' session can instantly become a 'perf report'
one, i.e. going from dynamic analysis to a static one,
returning to a dynamic one is possible, to toogle the
modes, just press 'f' to 'freeze/unfreeze' the sampling. (Arnaldo Carvalho de Melo)

- Make Ctrl-C stop processing on TUI, allowing interrupting the load of big
perf.data files (Namhyung Kim)

'perf probe': (Masami Hiramatsu)

- Support glob wildcards for function name
- Support $params special probe argument: Collect all function arguments
- Make --line checks validate C-style function name.
- Add --no-inlines option to avoid searching inline functions
- Greatly speed up 'perf probe --list' by caching debuginfo.
- Improve --filter support for 'perf probe', allowing using its arguments
on other commands, as --add, --del, etc.

'perf sched':

- Add option in 'perf sched' to merge like comms to lat output (Josef Bacik)

Plus tons of infrastructure work - in particular preparation for
upcoming threaded perf report support, but also lots of other work -
and fixes and other improvements. See (much) more details in the
shortlog and in the git log"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (305 commits)
perf tools: Configurable per thread proc map processing time out
perf tools: Add time out to force stop proc map processing
perf report: Fix sort__sym_cmp to also compare end of symbol
perf hists browser: React to unassigned hotkey pressing
perf top: Tell the user how to unfreeze events after pressing 'f'
perf hists browser: Honour the help line provided by builtin-{top,report}.c
perf hists browser: Do not exit when 'f' is pressed in 'report' mode
perf top: Replace CTRL+z with 'f' as hotkey for enable/disable events
perf annotate: Rename source_line_percent to source_line_samples
perf annotate: Display total number of samples with --show-total-period
perf tools: Ensure thread-stack is flushed
perf top: Allow disabling/enabling events dynamicly
perf evlist: Add toggle_enable() method
perf trace: Fix race condition at the end of started workloads
perf probe: Speed up perf probe --list by caching debuginfo
perf probe: Show usage even if the last event is skipped
perf tools: Move libtraceevent dynamic list to separated LDFLAGS variable
perf tools: Fix a problem when opening old perf.data with different byte order
perf tools: Ignore .config-detected in .gitignore
perf probe: Fix to return error if no probe is added
...

Linus Torvalds
2015-06-23 06:19:21 +0800
fc934d401 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull RCU updates from Ingo Molnar:

- Continued initialization/Kconfig updates: hide most Kconfig options
from unsuspecting users.

There's now a single high level configuration option:

*
* RCU Subsystem
*
Make expert-level adjustments to RCU configuration (RCU_EXPERT) [N/y/?] (NEW)

Which if answered in the negative, leaves us with a single
interactive configuration option:

Offload RCU callback processing from boot-selected CPUs (RCU_NOCB_CPU) [N/y/?] (NEW)

All the rest of the RCU options are configured automatically. Later
on we'll remove this single leftover configuration option as well.

- Remove all uses of RCU-protected array indexes: replace the
rcu_[access|dereference]_index_check() APIs with READ_ONCE() and
rcu_lockdep_assert()

- RCU CPU-hotplug cleanups

- Updates to Tiny RCU: a race fix and further code shrinkage.

- RCU torture-testing updates: fixes, speedups, cleanups and
documentation updates.

- Miscellaneous fixes

- Documentation updates

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
rcutorture: Allow repetition factors in Kconfig-fragment lists
rcutorture: Display "make oldconfig" errors
rcutorture: Update TREE_RCU-kconfig.txt
rcutorture: Make rcutorture scripts force RCU_EXPERT
rcutorture: Update configuration fragments for rcutree.rcu_fanout_exact
rcutorture: TASKS_RCU set directly, so don't explicitly set it
rcutorture: Test SRCU cleanup code path
rcutorture: Replace barriers with smp_store_release() and smp_load_acquire()
locktorture: Change longdelay_us to longdelay_ms
rcutorture: Allow negative values of nreaders to oversubscribe
rcutorture: Exchange TREE03 and TREE08 NR_CPUS, speed up CPU hotplug
rcutorture: Exchange TREE03 and TREE04 geometries
locktorture: fix deadlock in 'rw_lock_irq' type
rcu: Correctly handle non-empty Tiny RCU callback list with none ready
rcutorture: Test both RCU-sched and RCU-bh for Tiny RCU
rcu: Further shrink Tiny RCU by making empty functions static inlines
rcu: Conditionally compile RCU's eqs warnings
rcu: Remove prompt for RCU implementation
rcu: Make RCU able to tolerate undefined CONFIG_RCU_KTHREAD_PRIO
rcu: Make RCU able to tolerate undefined CONFIG_RCU_FANOUT_LEAF
...

Linus Torvalds
2015-06-23 05:01:01 +0800

19 Jun, 2015

1 commit

2f993cf09 perf: Fix ring_buffer_attach() RCU sync, again ... Browse Code »

While looking for other users of get_state/cond_sync. I Found
ring_buffer_attach() and it looks obviously buggy?

Don't we need to ensure that we have "synchronize" _between_
list_del() and list_add() ?

IOW. Suppose that ring_buffer_attach() preempts right_after
get_state_synchronize_rcu() and gp completes before spin_lock().

In this case cond_synchronize_rcu() does nothing and we reuse
->rb_entry without waiting for gp in between?

It also moves the ->rcu_pending check under "if (rb)", to make it
more readable imo.

Signed-off-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Andrew Morton
Cc: Andy Lutomirski
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Denys Vlasenko
Cc: H. Peter Anvin
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: dave@stgolabs.net
Cc: der.herr@hofr.at
Cc: josh@joshtriplett.org
Cc: tj@kernel.org
Fixes: b69cf53640da ("perf: Fix a race between ring_buffer_detach() and ring_buffer_attach()")
Link: http://lkml.kernel.org/r/20150530200425.GA15748@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2015-06-19 15:38:45 +0800

07 Jun, 2015

2 commits

f38b0dbb4 perf/x86/intel: Introduce PERF_RECORD_LOST_SAMPLES ... Browse Code »

After enlarging the PEBS interrupt threshold, there may be some mixed up
PEBS samples which are discarded by the kernel.

This patch makes the kernel emit a PERF_RECORD_LOST_SAMPLES record with
the number of possible discarded records when it is impossible to demux
the samples.

It makes sure the user is not left in the dark about such discards.

Signed-off-by: Kan Liang
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: H. Peter Anvin
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: acme@infradead.org
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1431285195-14269-8-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar

Kan Liang
2015-06-07 22:09:02 +0800
21509084f perf/x86/intel: Handle multiple records in the PEBS buffer ... Browse Code »

When the PEBS interrupt threshold is larger than one record and the
machine supports multiple PEBS events, the records of these events are
mixed up and we need to demultiplex them.

Demuxing the records is hard because the hardware is deficient. The
hardware has two issues that, when combined, create impossible
scenarios to demux.

The first issue is that the 'status' field of the PEBS record is a copy
of the GLOBAL_STATUS MSR at PEBS assist time. To see why this is a
problem let us first describe the regular PEBS cycle:

A) the CTRn value reaches 0:
- the corresponding bit in GLOBAL_STATUS gets set
- we start arming the hardware assist
< some unspecified amount of time later -- this could cover multiple
events of interest >

B) the hardware assist is armed, any next event will trigger it

C) a matching event happens:
- the hardware assist triggers and generates a PEBS record
this includes a copy of GLOBAL_STATUS at this moment
- if we auto-reload we (re)set CTRn
- we clear the relevant bit in GLOBAL_STATUS

Now consider the following chain of events:

A0, B0, A1, C0

The event generated for counter 0 will include a status with counter 1
set, even though its not at all related to the record. A similar thing
can happen with a !PEBS event if it just happens to overflow at the
right moment.

The second issue is that the hardware will only emit one record for two
or more counters if the event that triggers the assist is 'close'. The
'close' can be several cycles. In some cases even the complete assist,
if the event is something that doesn't need retirement.

For instance, consider this chain of events:

A0, B0, A1, B1, C01

Where C01 is an event that triggers both hardware assists, we will
generate but a single record, but again with both counters listed in the
status field.

This time the record pertains to both events.

Note that these two cases are different but undistinguishable with the
data as generated. Therefore demuxing records with multiple PEBS bits
(we can safely ignore status bits for !PEBS counters) is impossible.

Furthermore we cannot emit the record to both events because that might
cause a data leak -- the events might not have the same privileges -- so
what this patch does is discard such events.

The assumption/hope is that such discards will be rare.

Here lists some possible ways you may get high discard rate.

- when you count the same thing multiple times. But it is not a useful
configuration.
- you can be unfortunate if you measure with a userspace only PEBS
event along with either a kernel or unrestricted PEBS event. Imagine
the event triggering and setting the overflow flag right before
entering the kernel. Then all kernel side events will end up with
multiple bits set.

Signed-off-by: Yan, Zheng
Signed-off-by: Kan Liang
[ Changelog improvements. ]
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: H. Peter Anvin
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: acme@infradead.org
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1430940834-8964-4-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar

Signed-off-by: Ingo Molnar

Yan, Zheng
2015-06-07 22:08:45 +0800