25 Mar, 2016

1 commit

  • Pull perf fixes from Ingo Molnar:
    "This tree contains various perf fixes on the kernel side, plus three
    hw/event-enablement late additions:

    - Intel Memory Bandwidth Monitoring events and handling
    - the AMD Accumulated Power Mechanism reporting facility
    - more IOMMU events

    ... and a final round of perf tooling updates/fixes"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
    perf llvm: Use strerror_r instead of the thread unsafe strerror one
    perf llvm: Use realpath to canonicalize paths
    perf tools: Unexport some methods unused outside strbuf.c
    perf probe: No need to use formatting strbuf method
    perf help: Use asprintf instead of adhoc equivalents
    perf tools: Remove unused perf_pathdup, xstrdup functions
    perf tools: Do not include stringify.h from the kernel sources
    tools include: Copy linux/stringify.h from the kernel
    tools lib traceevent: Remove redundant CPU output
    perf tools: Remove needless 'extern' from function prototypes
    perf tools: Simplify die() mechanism
    perf tools: Remove unused DIE_IF macro
    perf script: Remove lots of unused arguments
    perf thread: Rename perf_event__preprocess_sample_addr to thread__resolve
    perf machine: Rename perf_event__preprocess_sample to machine__resolve
    perf tools: Add cpumode to struct perf_sample
    perf tests: Forward the perf_sample in the dwarf unwind test
    perf tools: Remove misplaced __maybe_unused
    perf list: Fix documentation of :ppp
    perf bench numa: Fix assertion for nodes bitfield
    ...

    Linus Torvalds
     

21 Mar, 2016

5 commits

  • Document some of the hotplug notifier usage.

    Requested-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Andy Lutomirski
    Cc: Arnaldo Carvalho de Melo
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: David Ahern
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Sasha reported:

    [ 3494.030114] UBSAN: Undefined behaviour in kernel/events/ring_buffer.c:685:22
    [ 3494.030647] shift exponent -1 is negative

    Andrey spotted that this is because:

    It happens if nr_pages = 0:
    rb->page_order = ilog2(nr_pages);

    Fix it by making both assignments conditional on nr_pages; since
    otherwise they should both be 0 anyway, and will be because of the
    kzalloc() used to allocate the structure.

    Reported-by: Sasha Levin
    Reported-by: Andrey Ryabinin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Andy Lutomirski
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: David Ahern
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/20160129141751.GA407@worktop
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • There were two problems with the dynamic interrupt throttle mechanism,
    both triggered by the same action.

    When you (or perf_fuzzer) write a huge value into
    /proc/sys/kernel/perf_event_max_sample_rate the computed
    perf_sample_allowed_ns becomes 0. This effectively disables the whole
    dynamic throttle.

    This is fixed by ensuring update_perf_cpu_limits() never sets the
    value to 0. However, we allow disabling of the dynamic throttle by
    writing 100 to /proc/sys/kernel/perf_cpu_time_max_percent. This will
    generate a warning in dmesg.

    The second problem is that by setting the max_sample_rate to a huge
    number, the adaptive process can take a few tries, since it halfs the
    limit each time. Change that to directly compute a new value based on
    the observed duration.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Andy Lutomirski
    Cc: Arnaldo Carvalho de Melo
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: David Ahern
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Its possible to IOC_PERIOD while the event is throttled, this would
    re-start the event and the next tick would then try to unthrottle it,
    and find the event wasn't actually stopped anymore.

    This would tickle a WARN in the x86-pmu code which isn't expecting to
    start a !stopped event.

    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Alexander Shishkin
    Cc: Andy Lutomirski
    Cc: Arnaldo Carvalho de Melo
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: David Ahern
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: dvyukov@google.com
    Cc: oleg@redhat.com
    Cc: panand@redhat.com
    Cc: sasha.levin@oracle.com
    Cc: vince@deater.net
    Link: http://lkml.kernel.org/r/20160310143924.GR6356@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Pull x86 protection key support from Ingo Molnar:
    "This tree adds support for a new memory protection hardware feature
    that is available in upcoming Intel CPUs: 'protection keys' (pkeys).

    There's a background article at LWN.net:

    https://lwn.net/Articles/643797/

    The gist is that protection keys allow the encoding of
    user-controllable permission masks in the pte. So instead of having a
    fixed protection mask in the pte (which needs a system call to change
    and works on a per page basis), the user can map a (handful of)
    protection mask variants and can change the masks runtime relatively
    cheaply, without having to change every single page in the affected
    virtual memory range.

    This allows the dynamic switching of the protection bits of large
    amounts of virtual memory, via user-space instructions. It also
    allows more precise control of MMU permission bits: for example the
    executable bit is separate from the read bit (see more about that
    below).

    This tree adds the MM infrastructure and low level x86 glue needed for
    that, plus it adds a high level API to make use of protection keys -
    if a user-space application calls:

    mmap(..., PROT_EXEC);

    or

    mprotect(ptr, sz, PROT_EXEC);

    (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
    this special case, and will set a special protection key on this
    memory range. It also sets the appropriate bits in the Protection
    Keys User Rights (PKRU) register so that the memory becomes unreadable
    and unwritable.

    So using protection keys the kernel is able to implement 'true'
    PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
    PROT_READ as well. Unreadable executable mappings have security
    advantages: they cannot be read via information leaks to figure out
    ASLR details, nor can they be scanned for ROP gadgets - and they
    cannot be used by exploits for data purposes either.

    We know about no user-space code that relies on pure PROT_EXEC
    mappings today, but binary loaders could start making use of this new
    feature to map binaries and libraries in a more secure fashion.

    There is other pending pkeys work that offers more high level system
    call APIs to manage protection keys - but those are not part of this
    pull request.

    Right now there's a Kconfig that controls this feature
    (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
    (like most x86 CPU feature enablement code that has no runtime
    overhead), but it's not user-configurable at the moment. If there's
    any serious problem with this then we can make it configurable and/or
    flip the default"

    * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
    x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
    mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
    x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
    mm/core, x86/mm/pkeys: Add execute-only protection keys support
    x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
    x86/mm/pkeys: Allow kernel to modify user pkey rights register
    x86/fpu: Allow setting of XSAVE state
    x86/mm: Factor out LDT init from context init
    mm/core, x86/mm/pkeys: Add arch_validate_pkey()
    mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
    x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
    x86/mm/pkeys: Add Kconfig prompt to existing config option
    x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
    x86/mm/pkeys: Dump PKRU with other kernel registers
    mm/core, x86/mm/pkeys: Differentiate instruction fetches
    x86/mm/pkeys: Optimize fault handling in access_error()
    mm/core: Do not enforce PKEY permissions on remote mm access
    um, pkeys: Add UML arch_*_access_permitted() methods
    mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
    x86/mm/gup: Simplify get_user_pages() PTE bit handling
    ...

    Linus Torvalds
     

20 Mar, 2016

1 commit

  • Pull networking updates from David Miller:
    "Highlights:

    1) Support more Realtek wireless chips, from Jes Sorenson.

    2) New BPF types for per-cpu hash and arrap maps, from Alexei
    Starovoitov.

    3) Make several TCP sysctls per-namespace, from Nikolay Borisov.

    4) Allow the use of SO_REUSEPORT in order to do per-thread processing
    of incoming TCP/UDP connections. The muxing can be done using a
    BPF program which hashes the incoming packet. From Craig Gallek.

    5) Add a multiplexer for TCP streams, to provide a messaged based
    interface. BPF programs can be used to determine the message
    boundaries. From Tom Herbert.

    6) Add 802.1AE MACSEC support, from Sabrina Dubroca.

    7) Avoid factorial complexity when taking down an inetdev interface
    with lots of configured addresses. We were doing things like
    traversing the entire address less for each address removed, and
    flushing the entire netfilter conntrack table for every address as
    well.

    8) Add and use SKB bulk free infrastructure, from Jesper Brouer.

    9) Allow offloading u32 classifiers to hardware, and implement for
    ixgbe, from John Fastabend.

    10) Allow configuring IRQ coalescing parameters on a per-queue basis,
    from Kan Liang.

    11) Extend ethtool so that larger link mode masks can be supported.
    From David Decotigny.

    12) Introduce devlink, which can be used to configure port link types
    (ethernet vs Infiniband, etc.), port splitting, and switch device
    level attributes as a whole. From Jiri Pirko.

    13) Hardware offload support for flower classifiers, from Amir Vadai.

    14) Add "Local Checksum Offload". Basically, for a tunneled packet
    the checksum of the outer header is 'constant' (because with the
    checksum field filled into the inner protocol header, the payload
    of the outer frame checksums to 'zero'), and we can take advantage
    of that in various ways. From Edward Cree"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1548 commits)
    bonding: fix bond_get_stats()
    net: bcmgenet: fix dma api length mismatch
    net/mlx4_core: Fix backward compatibility on VFs
    phy: mdio-thunder: Fix some Kconfig typos
    lan78xx: add ndo_get_stats64
    lan78xx: handle statistics counter rollover
    RDS: TCP: Remove unused constant
    RDS: TCP: Add sysctl tunables for sndbuf/rcvbuf on rds-tcp socket
    net: smc911x: convert pxa dma to dmaengine
    team: remove duplicate set of flag IFF_MULTICAST
    bonding: remove duplicate set of flag IFF_MULTICAST
    net: fix a comment typo
    ethernet: micrel: fix some error codes
    ip_tunnels, bpf: define IP_TUNNEL_OPTS_MAX and use it
    bpf, dst: add and use dst_tclassid helper
    bpf: make skb->tc_classid also readable
    net: mvneta: bm: clarify dependencies
    cls_bpf: reset class and reuse major in da
    ldmvsw: Checkpatch sunvnet.c and sunvnet_common.c
    ldmvsw: Add ldmvsw.c driver code
    ...

    Linus Torvalds
     

16 Mar, 2016

1 commit

  • Pull x86 asm updates from Ingo Molnar:
    "This is another big update. Main changes are:

    - lots of x86 system call (and other traps/exceptions) entry code
    enhancements. In particular the complex parts of the 64-bit entry
    code have been migrated to C code as well, and a number of dusty
    corners have been refreshed. (Andy Lutomirski)

    - vDSO special mapping robustification and general cleanups (Andy
    Lutomirski)

    - cpufeature refactoring, cleanups and speedups (Borislav Petkov)

    - lots of other changes ..."

    * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
    x86/cpufeature: Enable new AVX-512 features
    x86/entry/traps: Show unhandled signal for i386 in do_trap()
    x86/entry: Call enter_from_user_mode() with IRQs off
    x86/entry/32: Change INT80 to be an interrupt gate
    x86/entry: Improve system call entry comments
    x86/entry: Remove TIF_SINGLESTEP entry work
    x86/entry/32: Add and check a stack canary for the SYSENTER stack
    x86/entry/32: Simplify and fix up the SYSENTER stack #DB/NMI fixup
    x86/entry: Only allocate space for tss_struct::SYSENTER_stack if needed
    x86/entry: Vastly simplify SYSENTER TF (single-step) handling
    x86/entry/traps: Clear DR6 early in do_debug() and improve the comment
    x86/entry/traps: Clear TIF_BLOCKSTEP on all debug exceptions
    x86/entry/32: Restore FLAGS on SYSEXIT
    x86/entry/32: Filter NT and speed up AC filtering in SYSENTER
    x86/entry/compat: In SYSENTER, sink AC clearing below the existing FLAGS test
    selftests/x86: In syscall_nt, test NT|TF as well
    x86/asm-offsets: Remove PARAVIRT_enabled
    x86/entry/32: Introduce and use X86_BUG_ESPFIX instead of paravirt_enabled
    uprobes: __create_xol_area() must nullify xol_mapping.fault
    x86/cpufeature: Create a new synthetic cpu capability for machine check recovery
    ...

    Linus Torvalds
     

15 Mar, 2016

1 commit

  • Pull NOHZ updates from Ingo Molnar:
    "NOHZ enhancements, by Frederic Weisbecker, which reorganizes/refactors
    the NOHZ 'can the tick be stopped?' infrastructure and related code to
    be data driven, and harmonizes the naming and handling of all the
    various properties"

    [ This makes the ugly "fetch_or()" macro that the scheduler used
    internally a new generic helper, and does a bad job at it.

    I'm pulling it, but I've asked Ingo and Frederic to get this
    fixed up ]

    * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched-clock: Migrate to use new tick dependency mask model
    posix-cpu-timers: Migrate to use new tick dependency mask model
    sched: Migrate sched to use new tick dependency mask model
    sched: Account rr tasks
    perf: Migrate perf to use new tick dependency mask model
    nohz: Use enum code for tick stop failure tracing message
    nohz: New tick dependency mask
    nohz: Implement wide kick on top of irq work
    atomic: Export fetch_or()

    Linus Torvalds
     

09 Mar, 2016

1 commit


08 Mar, 2016

2 commits

  • …rederic/linux-dynticks into timers/nohz

    Pull nohz enhancements from Frederic Weisbecker:

    "Currently in nohz full configs, the tick dependency is checked
    asynchronously by nohz code from interrupt and context switch for each
    concerned subsystem with a set of function provided by these. Such
    functions are made of many conditions and details that can be heavyweight
    as they are called on fastpath: sched_can_stop_tick(),
    posix_cpu_timer_can_stop_tick(), perf_event_can_stop_tick()...

    Thomas suggested a few months ago to make that tick dependency check
    synchronous. Instead of checking subsystems details from each interrupt
    to guess if the tick can be stopped, every subsystem that may have a tick
    dependency should set itself a flag specifying the state of that
    dependency. This way we can verify if we can stop the tick with a single
    lightweight mask check on fast path.

    This conversion from a pull to a push model to implement tick dependency
    is the core feature of this patchset that is split into:

    * Nohz wide kick simplification
    * Improve nohz tracing
    * Introduce tick dependency mask
    * Migrate scheduler, posix timers, perf events and sched clock tick
    dependencies to the tick dependency mask."

    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     
  • The error path in perf_event_open() is such that asking for a sampling
    event on a PMU that doesn't generate interrupts will end up in dropping
    the perf_sched_count even though it hasn't been incremented for this
    event yet.

    Given a sufficient amount of these calls, we'll end up disabling
    scheduler's jump label even though we'd still have active events in the
    system, thereby facilitating the arrival of the infernal regions upon us.

    I'm fixing this by moving account_event() inside perf_event_alloc().

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc:
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: vince@deater.net
    Link: http://lkml.kernel.org/r/1456917854-29427-1-git-send-email-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     

07 Mar, 2016

1 commit


02 Mar, 2016

1 commit

  • Instead of providing asynchronous checks for the nohz subsystem to verify
    perf event tick dependency, migrate perf to the new mask.

    Perf needs the tick for two situations:

    1) Freq events. We could set the tick dependency when those are
    installed on a CPU context. But setting a global dependency on top of
    the global freq events accounting is much easier. If people want that
    to be optimized, we can still refine that on the per-CPU tick dependency
    level. This patch dooesn't change the current behaviour anyway.

    2) Throttled events: this is a per-cpu dependency.

    Reviewed-by: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Chris Metcalf
    Cc: Ingo Molnar
    Cc: Luiz Capitulino
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Cc: Viresh Kumar
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     

29 Feb, 2016

3 commits

  • Required to use it in modular perf drivers.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andi Kleen
    Cc: Arnaldo Carvalho de Melo
    Cc: Borislav Petkov
    Cc: Harish Chegondi
    Cc: Jacob Pan
    Cc: Jiri Olsa
    Cc: Kan Liang
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Vince Weaver
    Cc: linux-kernel@vger.kernel.org
    Link: http://lkml.kernel.org/r/20160222221012.930735780@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • As Jiri pointed out, this recent commit:

    f872f5400cc0 ("mm: Add a vm_special_mapping.fault() method")

    breaks uprobes: __create_xol_area() doesn't initialize the new ->fault()
    method and this obviously leads to kernel crash when the application
    tries to execute the probed insn after bp hit.

    We probably want to add uprobes_special_mapping_fault(), this allows to
    turn xol_area->xol_mapping into a single instance of vm_special_mapping.
    But we need a simple fix, so lets change __create_xol() to nullify the
    new member as Jiri suggests.

    Suggested-by: Jiri Olsa
    Reported-by: Jiri Olsa
    Signed-off-by: Oleg Nesterov
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Pratyush Anand
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20160227221128.GA29565@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     

25 Feb, 2016

12 commits

  • Since there is no serialization between task_function_call() doing
    task_curr() and the other CPU doing context switches, we could end
    up not sending an IPI even if we had to.

    And I'm not sure I still buy my own argument we're OK.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dvyukov@google.com
    Cc: eranian@google.com
    Cc: oleg@redhat.com
    Cc: panand@redhat.com
    Cc: sasha.levin@oracle.com
    Cc: vince@deater.net
    Link: http://lkml.kernel.org/r/20160224174948.340031200@infradead.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Completely reworks perf_install_in_context() (again!) in order to
    ensure that there will be no ctx time hole between add_event_to_ctx()
    and any potential ctx_sched_in().

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dvyukov@google.com
    Cc: eranian@google.com
    Cc: oleg@redhat.com
    Cc: panand@redhat.com
    Cc: sasha.levin@oracle.com
    Cc: vince@deater.net
    Link: http://lkml.kernel.org/r/20160224174948.279399438@infradead.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Similar to the perf_enable_on_exec(), ensure that event timings are
    consistent across perf_event_enable().

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dvyukov@google.com
    Cc: eranian@google.com
    Cc: oleg@redhat.com
    Cc: panand@redhat.com
    Cc: sasha.levin@oracle.com
    Cc: vince@deater.net
    Link: http://lkml.kernel.org/r/20160224174948.218288698@infradead.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The recent commit 3e349507d12d ("perf: Fix perf_enable_on_exec() event
    scheduling") caused this by moving task_ctx_sched_out() from before
    __perf_event_mask_enable() to after it.

    The overlooked consequence of that change is that task_ctx_sched_out()
    would update the ctx time fields, and now __perf_event_mask_enable()
    uses stale time.

    In order to fix this, explicitly stop our context's time before
    enabling the event(s).

    Reported-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dvyukov@google.com
    Cc: eranian@google.com
    Cc: panand@redhat.com
    Cc: sasha.levin@oracle.com
    Cc: vince@deater.net
    Fixes: 3e349507d12d ("perf: Fix perf_enable_on_exec() event scheduling")
    Link: http://lkml.kernel.org/r/20160224174948.159242158@infradead.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Currently any ctx_sched_in() call will re-start the ctx time tracking,
    this means that calls like:

    ctx_sched_in(.event_type = EVENT_PINNED);
    ctx_sched_in(.event_type = EVENT_FLEXIBLE);

    will have a hole in their ctx time tracking. This is likely harmless
    but can confuse things a little. By adding EVENT_TIME, we can have the
    first ctx_sched_in() (is_active: 0 -> !0) start the time and any
    further ctx_sched_in() will leave the timestamps alone.

    Secondly, this allows for an early disable like:

    ctx_sched_out(.event_type = EVENT_TIME);

    which would update the ctx time (if the ctx is active) and any further
    calls to ctx_sched_out() would not further modify the ctx time.

    For ctx_sched_in() any 0 -> !0 transition will automatically include
    EVENT_TIME.

    For ctx_sched_out(), any transition that clears EVENT_ALL will
    automatically clear EVENT_TIME.

    These two rules ensure that under normal circumstances we need not
    bother with EVENT_TIME and get natural ctx time behaviour.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dvyukov@google.com
    Cc: eranian@google.com
    Cc: oleg@redhat.com
    Cc: panand@redhat.com
    Cc: sasha.levin@oracle.com
    Cc: vince@deater.net
    Link: http://lkml.kernel.org/r/20160224174948.100446561@infradead.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Because event_sched_out() checks event->pending_disable _before_
    actually disabling the event, it can happen that the event fires after
    it checks but before it gets disabled.

    This would leave event->pending_disable set and the queued irq_work
    will try and process it.

    However, if the event trigger was during schedule(), the event might
    have been de-scheduled by the time the irq_work runs, and
    perf_event_disable_local() will fail.

    Fix this by checking event->pending_disable _after_ we call
    event->pmu->del(). This depends on the latter being a compiler
    barrier, such that the compiler does not lift the load and re-creates
    the problem.

    Tested-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dvyukov@google.com
    Cc: eranian@google.com
    Cc: oleg@redhat.com
    Cc: panand@redhat.com
    Cc: sasha.levin@oracle.com
    Cc: vince@deater.net
    Link: http://lkml.kernel.org/r/20160224174948.040469884@infradead.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • perf_install_in_context() relies upon the context switch hooks to have
    scheduled in events when the IPI misses its target -- after all, if
    the task has moved from the CPU (or wasn't running at all), it will
    have to context switch to run elsewhere.

    This however doesn't appear to be happening.

    It is possible for the IPI to not happen (task wasn't running) only to
    later observe the task running with an inactive context.

    The only possible explanation is that the context switch hooks are not
    called. Therefore put in a sync_sched() after toggling the jump_label
    to guarantee all CPUs will have them enabled before we install an
    event.

    A simple if (0->1) sync_sched() will not in fact work, because any
    further increment can race and complete before the sync_sched().
    Therefore we must jump through some hoops.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dvyukov@google.com
    Cc: eranian@google.com
    Cc: oleg@redhat.com
    Cc: panand@redhat.com
    Cc: sasha.levin@oracle.com
    Cc: vince@deater.net
    Link: http://lkml.kernel.org/r/20160224174947.980211985@infradead.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Alexander reported that when the 'original' context gets destroyed, no
    new clones happen.

    This can happen irrespective of the ctx switch optimization, any task
    can die, even the parent, and we want to continue monitoring the task
    hierarchy until we either close the event or no tasks are left in the
    hierarchy.

    perf_event_init_context() will attempt to pin the 'parent' context
    during clone(). At that point current is the parent, and since current
    cannot have exited while executing clone(), its context cannot have
    passed through perf_event_exit_task_context(). Therefore
    perf_pin_task_context() cannot observe ctx->task == TASK_TOMBSTONE.

    However, since inherit_event() does:

    if (parent_event->parent)
    parent_event = parent_event->parent;

    it looks at the 'original' event when it does: is_orphaned_event().
    This can return true if the context that contains the this event has
    passed through perf_event_exit_task_context(). And thus we'll fail to
    clone the perf context.

    Fix this by adding a new state: STATE_DEAD, which is set by
    perf_release() to indicate that the filedesc (or kernel reference) is
    dead and there are no observers for our data left.

    Only for STATE_DEAD will is_orphaned_event() be true and inhibit
    cloning.

    STATE_EXIT is otherwise preserved such that is_event_hup() remains
    functional and will report when the observed task hierarchy becomes
    empty.

    Reported-by: Alexander Shishkin
    Tested-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dvyukov@google.com
    Cc: eranian@google.com
    Cc: oleg@redhat.com
    Cc: panand@redhat.com
    Cc: sasha.levin@oracle.com
    Cc: vince@deater.net
    Fixes: c6e5b73242d2 ("perf: Synchronously clean up child events")
    Link: http://lkml.kernel.org/r/20160224174947.919845295@infradead.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dvyukov@google.com
    Cc: eranian@google.com
    Cc: oleg@redhat.com
    Cc: panand@redhat.com
    Cc: sasha.levin@oracle.com
    Cc: vince@deater.net
    Link: http://lkml.kernel.org/r/20160224174947.860690919@infradead.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • In the err_file: fput(event_file) case, the event will not yet have
    been attached to a context. However perf_release() does assume it has
    been. Cure this.

    Tested-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dvyukov@google.com
    Cc: eranian@google.com
    Cc: oleg@redhat.com
    Cc: panand@redhat.com
    Cc: sasha.levin@oracle.com
    Cc: vince@deater.net
    Link: http://lkml.kernel.org/r/20160224174947.793996260@infradead.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • In case of: err_file: fput(event_file), we'll end up calling
    perf_release() which in turn will free the event.

    Do not then free the event _again_.

    Tested-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dvyukov@google.com
    Cc: eranian@google.com
    Cc: oleg@redhat.com
    Cc: panand@redhat.com
    Cc: sasha.levin@oracle.com
    Cc: vince@deater.net
    Link: http://lkml.kernel.org/r/20160224174947.697350349@infradead.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Consider the following scenario:

    CPU0 CPU1

    ctx = find_get_ctx();
    perf_event_exit_task_context()
    mutex_lock(&ctx->mutex);
    perf_install_in_context(ctx, ...);
    /* NO-OP */
    mutex_unlock(&ctx->mutex);

    ...

    perf_release()
    WARN_ON_ONCE(event->state != STATE_EXIT);

    Since the event doesn't pass through perf_remove_from_context()
    because perf_install_in_context() NO-OPs because the ctx is dead, and
    perf_event_exit_task_context() will not observe the event because its
    not attached yet, the event->state will not be set.

    Solve this by revalidating ctx->task after we acquire ctx->mutex and
    failing the event creation as a whole.

    Tested-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dvyukov@google.com
    Cc: eranian@google.com
    Cc: oleg@redhat.com
    Cc: panand@redhat.com
    Cc: sasha.levin@oracle.com
    Cc: vince@deater.net
    Link: http://lkml.kernel.org/r/20160224174947.626853419@infradead.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

23 Feb, 2016

1 commit


20 Feb, 2016

1 commit


17 Feb, 2016

4 commits

  • No functional change, just less confusing to read.

    Signed-off-by: Thomas Gleixner
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/20160209201007.921540566@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • If CPU_UP_PREPARE is called it is not guaranteed, that a previously allocated
    and assigned hash has been freed already, but perf_event_init_cpu()
    unconditionally allocates and assignes a new hash if the swhash is referenced.
    By overwriting the pointer the existing hash is not longer accessible.

    Verify that there is no hash assigned on this cpu before allocating and
    assigning a new one.

    Signed-off-by: Thomas Gleixner
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/20160209201007.843269966@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • If CPU_DOWN_PREPARE fails the perf hotplug notifier is called for
    CPU_DOWN_FAILED and calls perf_event_init_cpu(), which checks whether the
    swhash is referenced. If yes it allocates a new hash and stores the pointer in
    the per cpu data structure.

    But at this point the cpu is still online, so there must be a valid hash
    already. By overwriting the pointer the existing hash is not longer
    accessible.

    Remove the CPU_DOWN_FAILED state, as there is nothing to (re)allocate.

    Signed-off-by: Thomas Gleixner
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/20160209201007.763417379@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • If CPU_UP_PREPARE fails the perf hotplug code calls perf_event_exit_cpu(),
    which is a pointless exercise. The cpu is not online, so the smp function
    calls return -ENXIO. So the result is a list walk to call noops.

    Remove it.

    Signed-off-by: Thomas Gleixner
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/20160209201007.682184765@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

16 Feb, 2016

1 commit

  • For protection keys, we need to understand whether protections
    should be enforced in software or not. In general, we enforce
    protections when working on our own task, but not when on others.
    We call these "current" and "remote" operations.

    This patch introduces a new get_user_pages() variant:

    get_user_pages_remote()

    Which is a replacement for when get_user_pages() is called on
    non-current tsk/mm.

    We also introduce a new gup flag: FOLL_REMOTE which can be used
    for the "__" gup variants to get this new behavior.

    The uprobes is_trap_at_addr() location holds mmap_sem and
    calls get_user_pages(current->mm) on an instruction address. This
    makes it a pretty unique gup caller. Being an instruction access
    and also really originating from the kernel (vs. the app), I opted
    to consider this a 'remote' access where protection keys will not
    be enforced.

    Without protection keys, this patch should not change any behavior.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Andrea Arcangeli
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Kirill A. Shutemov
    Cc: Linus Torvalds
    Cc: Naoya Horiguchi
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Srikar Dronamraju
    Cc: Vlastimil Babka
    Cc: jack@suse.cz
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160212210154.3F0E51EA@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

01 Feb, 2016

1 commit

  • Pull perf fixes from Thomas Gleixner:
    "This is much bigger than typical fixes, but Peter found a category of
    races that spurred more fixes and more debugging enhancements. Work
    started before the merge window, but got finished only now.

    Aside of that this contains the usual small fixes to perf and tools.
    Nothing particular exciting"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
    perf: Remove/simplify lockdep annotation
    perf: Synchronously clean up child events
    perf: Untangle 'owner' confusion
    perf: Add flags argument to perf_remove_from_context()
    perf: Clean up sync_child_event()
    perf: Robustify event->owner usage and SMP ordering
    perf: Fix STATE_EXIT usage
    perf: Update locking order
    perf: Remove __free_event()
    perf/bpf: Convert perf_event_array to use struct file
    perf: Fix NULL deref
    perf/x86: De-obfuscate code
    perf/x86: Fix uninitialized value usage
    perf: Fix race in perf_event_exit_task_context()
    perf: Fix orphan hole
    perf stat: Do not clean event's private stats
    perf hists: Fix HISTC_MEM_DCACHELINE width setting
    perf annotate browser: Fix behaviour of Shift-Tab with nothing focussed
    perf tests: Remove wrong semicolon in while loop in CQM test
    perf: Synchronously free aux pages in case of allocation failure
    ...

    Linus Torvalds
     

29 Jan, 2016

3 commits

  • Now that the perf_event_ctx_lock_nested() call has moved from
    put_event() into perf_event_release_kernel() the first reason is no
    longer valid as that can no longer happen.

    The second reason seems to have been invalidated when Al Viro made fput()
    unconditionally async in the following commit:

    4a9d4b024a31 ("switch fput to task_work_add")

    such that munmap()->fput()->release()->perf_release() would no longer happen.

    Therefore, remove the annotation. This should increase the efficiency
    of lockdep coverage of perf locking.

    Suggested-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: David Ahern
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The orphan cleanup workqueue doesn't always catch orphans, for example,
    if they never schedule after they are orphaned. IOW, the event leak is
    still very real. It also wouldn't work for kernel counters.

    Doing it synchonously is a little hairy due to lock inversion issues,
    but is made to work.

    Patch based on work by Alexander Shishkin.

    Suggested-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: David Ahern
    Cc: Jiri Olsa
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: vince@deater.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • There are two concepts of owner wrt an event and they are conflated:

    - event::owner / event::owner_list,
    used by prctl(.option = PR_TASK_PERF_EVENTS_{EN,DIS}ABLE).

    - the 'owner' of the event object, typically the file descriptor.

    Currently these two concepts are conflated, which gives trouble with
    scm_rights passing of file descriptors. Passing the event and then
    closing the creating task would render the event 'orphan' and would
    have it cleared out. Unlikely what is expectd.

    This patch untangles these two concepts by using PERF_EVENT_STATE_EXIT
    to denote the second type.

    Reported-by: Alexei Starovoitov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: David Ahern
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Signed-off-by: Ingo Molnar

    Peter Zijlstra