16 Apr, 2015

2 commits

  • When MAP_HUGETLB memory is unmapped, the length must be hugepage aligned,
    otherwise it fails with -EINVAL.

    All tests currently behave correctly, but it's better to explcitly test
    the return value for completeness and document the requirement, especially
    if users copy map_hugetlb.c as a sample implementation.

    Signed-off-by: David Rientjes
    Cc: Jonathan Corbet
    Cc: Davide Libenzi
    Cc: Luiz Capitulino
    Cc: Shuah Khan
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: Joern Engel
    Cc: Jianguo Wu
    Cc: Eric B Munson
    Acked-by: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Pull networking updates from David Miller:

    1) Add BQL support to via-rhine, from Tino Reichardt.

    2) Integrate SWITCHDEV layer support into the DSA layer, so DSA drivers
    can support hw switch offloading. From Floria Fainelli.

    3) Allow 'ip address' commands to initiate multicast group join/leave,
    from Madhu Challa.

    4) Many ipv4 FIB lookup optimizations from Alexander Duyck.

    5) Support EBPF in cls_bpf classifier and act_bpf action, from Daniel
    Borkmann.

    6) Remove the ugly compat support in ARP for ugly layers like ax25,
    rose, etc. And use this to clean up the neigh layer, then use it to
    implement MPLS support. All from Eric Biederman.

    7) Support L3 forwarding offloading in switches, from Scott Feldman.

    8) Collapse the LOCAL and MAIN ipv4 FIB tables when possible, to speed
    up route lookups even further. From Alexander Duyck.

    9) Many improvements and bug fixes to the rhashtable implementation,
    from Herbert Xu and Thomas Graf. In particular, in the case where
    an rhashtable user bulk adds a large number of items into an empty
    table, we expand the table much more sanely.

    10) Don't make the tcp_metrics hash table per-namespace, from Eric
    Biederman.

    11) Extend EBPF to access SKB fields, from Alexei Starovoitov.

    12) Split out new connection request sockets so that they can be
    established in the main hash table. Much less false sharing since
    hash lookups go direct to the request sockets instead of having to
    go first to the listener then to the request socks hashed
    underneath. From Eric Dumazet.

    13) Add async I/O support for crytpo AF_ALG sockets, from Tadeusz Struk.

    14) Support stable privacy address generation for RFC7217 in IPV6. From
    Hannes Frederic Sowa.

    15) Hash network namespace into IP frag IDs, also from Hannes Frederic
    Sowa.

    16) Convert PTP get/set methods to use 64-bit time, from Richard
    Cochran.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1816 commits)
    fm10k: Bump driver version to 0.15.2
    fm10k: corrected VF multicast update
    fm10k: mbx_update_max_size does not drop all oversized messages
    fm10k: reset head instead of calling update_max_size
    fm10k: renamed mbx_tx_dropped to mbx_tx_oversized
    fm10k: update xcast mode before synchronizing multicast addresses
    fm10k: start service timer on probe
    fm10k: fix function header comment
    fm10k: comment next_vf_mbx flow
    fm10k: don't handle mailbox events in iov_event path and always process mailbox
    fm10k: use separate workqueue for fm10k driver
    fm10k: Set PF queues to unlimited bandwidth during virtualization
    fm10k: expose tx_timeout_count as an ethtool stat
    fm10k: only increment tx_timeout_count in Tx hang path
    fm10k: remove extraneous "Reset interface" message
    fm10k: separate PF only stats so that VF does not display them
    fm10k: use hw->mac.max_queues for stats
    fm10k: only show actual queues, not the maximum in hardware
    fm10k: allow creation of VLAN on default vid
    fm10k: fix unused warnings
    ...

    Linus Torvalds
     

15 Apr, 2015

4 commits

  • Pull perf changes from Ingo Molnar:
    "Core kernel changes:

    - One of the more interesting features in this cycle is the ability
    to attach eBPF programs (user-defined, sandboxed bytecode executed
    by the kernel) to kprobes.

    This allows user-defined instrumentation on a live kernel image
    that can never crash, hang or interfere with the kernel negatively.
    (Right now it's limited to root-only, but in the future we might
    allow unprivileged use as well.)

    (Alexei Starovoitov)

    - Another non-trivial feature is per event clockid support: this
    allows, amongst other things, the selection of different clock
    sources for event timestamps traced via perf.

    This feature is sought by people who'd like to merge perf generated
    events with external events that were measured with different
    clocks:

    - cluster wide profiling

    - for system wide tracing with user-space events,

    - JIT profiling events

    etc. Matching perf tooling support is added as well, available via
    the -k, --clockid parameter to perf record et al.

    (Peter Zijlstra)

    Hardware enablement kernel changes:

    - x86 Intel Processor Trace (PT) support: which is a hardware tracer
    on steroids, available on Broadwell CPUs.

    The hardware trace stream is directly output into the user-space
    ring-buffer, using the 'AUX' data format extension that was added
    to the perf core to support hardware constraints such as the
    necessity to have the tracing buffer physically contiguous.

    This patch-set was developed for two years and this is the result.
    A simple way to make use of this is to use BTS tracing, the PT
    driver emulates BTS output - available via the 'intel_bts' PMU.
    More explicit PT specific tooling support is in the works as well -
    will probably be ready by 4.2.

    (Alexander Shishkin, Peter Zijlstra)

    - x86 Intel Cache QoS Monitoring (CQM) support: this is a hardware
    feature of Intel Xeon CPUs that allows the measurement and
    allocation/partitioning of caches to individual workloads.

    These kernel changes expose the measurement side as a new PMU
    driver, which exposes various QoS related PMU events. (The
    partitioning change is work in progress and is planned to be merged
    as a cgroup extension.)

    (Matt Fleming, Peter Zijlstra; CPU feature detection by Peter P
    Waskiewicz Jr)

    - x86 Intel Haswell LBR call stack support: this is a new Haswell
    feature that allows the hardware recording of call chains, plus
    tooling support. To activate this feature you have to enable it
    via the new 'lbr' call-graph recording option:

    perf record --call-graph lbr
    perf report

    or:

    perf top --call-graph lbr

    This hardware feature is a lot faster than stack walk or dwarf
    based unwinding, but has some limitations:

    - It reuses the current LBR facility, so LBR call stack and
    branch record can not be enabled at the same time.

    - It is only available for user-space callchains.

    (Yan, Zheng)

    - x86 Intel Broadwell CPU support and various event constraints and
    event table fixes for earlier models.

    (Andi Kleen)

    - x86 Intel HT CPUs event scheduling workarounds. This is a complex
    CPU bug affecting the SNB,IVB,HSW families that results in counter
    value corruption. The mitigation code is automatically enabled and
    is transparent.

    (Maria Dimakopoulou, Stephane Eranian)

    The perf tooling side had a ton of changes in this cycle as well, so
    I'm only able to list the user visible changes here, in addition to
    the tooling changes outlined above:

    User visible changes affecting all tools:

    - Improve support of compressed kernel modules (Jiri Olsa)
    - Save DSO loading errno to better report errors (Arnaldo Carvalho de Melo)
    - Bash completion for subcommands (Yunlong Song)
    - Add 'I' event modifier for perf_event_attr.exclude_idle bit (Jiri Olsa)
    - Support missing -f to override perf.data file ownership. (Yunlong Song)
    - Show the first event with an invalid filter (David Ahern, Arnaldo Carvalho de Melo)

    User visible changes in individual tools:

    'perf data':

    New tool for converting perf.data to other formats, initially
    for the CTF (Common Trace Format) from LTTng (Jiri Olsa,
    Sebastian Siewior)

    'perf diff':

    Add --kallsyms option (David Ahern)

    'perf list':

    Allow listing events with 'tracepoint' prefix (Yunlong Song)

    Sort the output of the command (Yunlong Song)

    'perf kmem':

    Respect -i option (Jiri Olsa)

    Print big numbers using thousands' group (Namhyung Kim)

    Allow -v option (Namhyung Kim)

    Fix alignment of slab result table (Namhyung Kim)

    'perf probe':

    Support multiple probes on different binaries on the same command line (Masami Hiramatsu)

    Support unnamed union/structure members data collection. (Masami Hiramatsu)

    Check kprobes blacklist when adding new events. (Masami Hiramatsu)

    'perf record':

    Teach 'perf record' about perf_event_attr.clockid (Peter Zijlstra)

    Support recording running/enabled time (Andi Kleen)

    'perf sched':

    Improve the performance of 'perf sched replay' on high CPU core count machines (Yunlong Song)

    'perf report' and 'perf top':

    Allow annotating entries in callchains in the hists browser (Arnaldo Carvalho de Melo)

    Indicate which callchain entries are annotated in the
    TUI hists browser (Arnaldo Carvalho de Melo)

    Add pid/tid filtering to 'report' and 'script' commands (David Ahern)

    Consider PERF_RECORD_ events with cpumode == 0 in 'perf top', removing one
    cause of long term memory usage buildup, i.e. not processing PERF_RECORD_EXIT
    events (Arnaldo Carvalho de Melo)

    'perf stat':

    Report unsupported events properly (Suzuki K. Poulose)

    Output running time and run/enabled ratio in CSV mode (Andi Kleen)

    'perf trace':

    Handle legacy syscalls tracepoints (David Ahern, Arnaldo Carvalho de Melo)

    Only insert blank duration bracket when tracing syscalls (Arnaldo Carvalho de Melo)

    Filter out the trace pid when no threads are specified (Arnaldo Carvalho de Melo)

    Dump stack on segfaults (Arnaldo Carvalho de Melo)

    No need to explicitely enable evsels for workload started from perf, let it
    be enabled via perf_event_attr.enable_on_exec, removing some events that take
    place in the 'perf trace' before a workload is really started by it.
    (Arnaldo Carvalho de Melo)

    Allow mixing with tracepoints and suppressing plain syscalls. (Arnaldo Carvalho de Melo)

    There's also been a ton of infrastructure work done, such as the
    split-out of perf's build system into tools/build/ and other changes -
    see the shortlog and changelog for details"

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (358 commits)
    perf/x86/intel/pt: Clean up the control flow in pt_pmu_hw_init()
    perf evlist: Fix type for references to data_head/tail
    perf probe: Check the orphaned -x option
    perf probe: Support multiple probes on different binaries
    perf buildid-list: Fix segfault when show DSOs with hits
    perf tools: Fix cross-endian analysis
    perf tools: Fix error path to do closedir() when synthesizing threads
    perf tools: Fix synthesizing fork_event.ppid for non-main thread
    perf tools: Add 'I' event modifier for exclude_idle bit
    perf report: Don't call map__kmap if map is NULL.
    perf tests: Fix attr tests
    perf probe: Fix ARM 32 building error
    perf tools: Merge all perf_event_attr print functions
    perf record: Add clockid parameter
    perf sched replay: Use replay_repeat to calculate the runavg of cpu usage instead of the default value 10
    perf sched replay: Support using -f to override perf.data file ownership
    perf sched replay: Fix the EMFILE error caused by the limitation of the maximum open files
    perf sched replay: Handle the dead halt of sem_wait when create_tasks() fails for any task
    perf sched replay: Fix the segmentation fault problem caused by pr_err in threads
    perf sched replay: Realloc the memory of pid_to_task stepwise to adapt to the different pid_max configurations
    ...

    Linus Torvalds
     
  • Pull RCU changes from Ingo Molnar:
    "The main changes in this cycle were:

    - changes permitting use of call_rcu() and friends very early in
    boot, for example, before rcu_init() is invoked.

    - add in-kernel API to enable and disable expediting of normal RCU
    grace periods.

    - improve RCU's handling of (hotplug-) outgoing CPUs.

    - NO_HZ_FULL_SYSIDLE fixes.

    - tiny-RCU updates to make it more tiny.

    - documentation updates.

    - miscellaneous fixes"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits)
    cpu: Provide smpboot_thread_init() on !CONFIG_SMP kernels as well
    cpu: Defer smpboot kthread unparking until CPU known to scheduler
    rcu: Associate quiescent-state reports with grace period
    rcu: Yet another fix for preemption and CPU hotplug
    rcu: Add diagnostics to grace-period cleanup
    rcutorture: Default to grace-period-initialization delays
    rcu: Handle outgoing CPUs on exit from idle loop
    cpu: Make CPU-offline idle-loop transition point more precise
    rcu: Eliminate ->onoff_mutex from rcu_node structure
    rcu: Process offlining and onlining only at grace-period start
    rcu: Move rcu_report_unblock_qs_rnp() to common code
    rcu: Rework preemptible expedited bitmask handling
    rcu: Remove event tracing from rcu_cpu_notify(), used by offline CPUs
    rcutorture: Enable slow grace-period initializations
    rcu: Provide diagnostic option to slow down grace-period initialization
    rcu: Detect stalls caused by failure to propagate up rcu_node tree
    rcu: Eliminate empty HOTPLUG_CPU ifdef
    rcu: Simplify sync_rcu_preempt_exp_init()
    rcu: Put all orphan-callback-related code under same comment
    rcu: Consolidate offline-CPU callback initialization
    ...

    Linus Torvalds
     
  • Pull tracing updates from Steven Rostedt:
    "Some clean ups and small fixes, but the biggest change is the addition
    of the TRACE_DEFINE_ENUM() macro that can be used by tracepoints.

    Tracepoints have helper functions for the TP_printk() called
    __print_symbolic() and __print_flags() that lets a numeric number be
    displayed as a a human comprehensible text. What is placed in the
    TP_printk() is also shown in the tracepoint format file such that user
    space tools like perf and trace-cmd can parse the binary data and
    express the values too. Unfortunately, the way the TRACE_EVENT()
    macro works, anything placed in the TP_printk() will be shown pretty
    much exactly as is. The problem arises when enums are used. That's
    because unlike macros, enums will not be changed into their values by
    the C pre-processor. Thus, the enum string is exported to the format
    file, and this makes it useless for user space tools.

    The TRACE_DEFINE_ENUM() solves this by converting the enum strings in
    the TP_printk() format into their number, and that is what is shown to
    user space. For example, the tracepoint tlb_flush currently has this
    in its format file:

    __print_symbolic(REC->reason,
    { TLB_FLUSH_ON_TASK_SWITCH, "flush on task switch" },
    { TLB_REMOTE_SHOOTDOWN, "remote shootdown" },
    { TLB_LOCAL_SHOOTDOWN, "local shootdown" },
    { TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" })

    After adding:

    TRACE_DEFINE_ENUM(TLB_FLUSH_ON_TASK_SWITCH);
    TRACE_DEFINE_ENUM(TLB_REMOTE_SHOOTDOWN);
    TRACE_DEFINE_ENUM(TLB_LOCAL_SHOOTDOWN);
    TRACE_DEFINE_ENUM(TLB_LOCAL_MM_SHOOTDOWN);

    Its format file will contain this:

    __print_symbolic(REC->reason,
    { 0, "flush on task switch" },
    { 1, "remote shootdown" },
    { 2, "local shootdown" },
    { 3, "local mm shootdown" })"

    * tag 'trace-v4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (27 commits)
    tracing: Add enum_map file to show enums that have been mapped
    writeback: Export enums used by tracepoint to user space
    v4l: Export enums used by tracepoints to user space
    SUNRPC: Export enums in tracepoints to user space
    mm: tracing: Export enums in tracepoints to user space
    irq/tracing: Export enums in tracepoints to user space
    f2fs: Export the enums in the tracepoints to userspace
    net/9p/tracing: Export enums in tracepoints to userspace
    x86/tlb/trace: Export enums in used by tlb_flush tracepoint
    tracing/samples: Update the trace-event-sample.h with TRACE_DEFINE_ENUM()
    tracing: Allow for modules to convert their enums to values
    tracing: Add TRACE_DEFINE_ENUM() macro to map enums to their values
    tracing: Update trace-event-sample with TRACE_SYSTEM_VAR documentation
    tracing: Give system name a pointer
    brcmsmac: Move each system tracepoints to their own header
    iwlwifi: Move each system tracepoints to their own header
    mac80211: Move message tracepoints to their own header
    tracing: Add TRACE_SYSTEM_VAR to xhci-hcd
    tracing: Add TRACE_SYSTEM_VAR to kvm-s390
    tracing: Add TRACE_SYSTEM_VAR to intel-sst
    ...

    Linus Torvalds
     
  • Pull trivial tree from Jiri Kosina:
    "Usual trivial tree updates. Nothing outstanding -- mostly printk()
    and comment fixes and unused identifier removals"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
    goldfish: goldfish_tty_probe() is not using 'i' any more
    powerpc: Fix comment in smu.h
    qla2xxx: Fix printks in ql_log message
    lib: correct link to the original source for div64_u64
    si2168, tda10071, m88ds3103: Fix firmware wording
    usb: storage: Fix printk in isd200_log_config()
    qla2xxx: Fix printk in qla25xx_setup_mode
    init/main: fix reset_device comment
    ipwireless: missing assignment
    goldfish: remove unreachable line of code
    coredump: Fix do_coredump() comment
    stacktrace.h: remove duplicate declaration task_struct
    smpboot.h: Remove unused function prototype
    treewide: Fix typo in printk messages
    treewide: Fix typo in printk messages
    mod_devicetable: fix comment for match_flags

    Linus Torvalds
     

14 Apr, 2015

2 commits

  • Pull staging driver updates from Greg KH:
    "Here's the big staging driver patchset for 4.1-rc1.

    There's a lot of patches here, the Outreachy application period
    happened during this development cycle, so that means that there was a
    lot of cleanup patches accepted. Other than the normal coding style
    and sparse fixes here, there are some driver updates and work toward
    making some of the drivers into "mergable" shape (like the Unisys
    drivers.)

    All of these have been in linux-next for a while"

    * tag 'staging-4.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (1214 commits)
    staging: lustre: orthography & coding style
    staging: lustre: lnet: lnet: fix error return code
    staging: lustre: fix sparse warning
    Revert "Staging: sm750fb: Fix C99 Comments"
    Staging: rtl8192u: use correct array for debug output
    staging: rtl8192e: Remove dead code
    staging: rtl8192e: Comment cleanup (style/format)
    staging: rtl8192e: Fix indentation in rtllib_rx_auth_resp()
    staging: rtl8192e: Decrease nesting of rtllib_rx_auth_resp()
    staging: rtl8192e: Divide rtllib_rx_auth()
    staging: rtl8192e: Fix PRINTK_WITHOUT_KERN_LEVEL warnings
    staging: rtl8192e: Fix DO_WHILE_MACRO_WITH_TRAILING_SEMICOLON warning
    staging: rtl8192e: Fix BRACES warning
    staging: rtl8192e: Fix LINE_CONTINUATIONS warning
    staging: rtl8192e: Fix UNNECESSARY_PARENTHESES warnings
    staging: rtl8192e: remove unused EXPORT_SYMBOL_RSL macro
    staging: rtl8192e: Fix RETURN_VOID warnings
    staging: rtl8192e: Fix UNNECESSARY_ELSE warning
    staging: rtl8723au: Remove unneeded comments
    staging: rtl8723au: Use __func__ in trace logs
    ...

    Linus Torvalds
     
  • Pull x86 asm changes from Ingo Molnar:
    "There were lots of changes in this development cycle:

    - over 100 separate cleanups, restructuring changes, speedups and
    fixes in the x86 system call, irq, trap and other entry code, part
    of a heroic effort to deobfuscate a decade old spaghetti asm code
    and its C code dependencies (Denys Vlasenko, Andy Lutomirski)

    - alternatives code fixes and enhancements (Borislav Petkov)

    - simplifications and cleanups to the compat code (Brian Gerst)

    - signal handling fixes and new x86 testcases (Andy Lutomirski)

    - various other fixes and cleanups

    By their nature many of these changes are risky - we tried to test
    them well on many different x86 systems (there are no known
    regressions), and they are split up finely to help bisection - but
    there's still a fair bit of residual risk left so caveat emptor"

    * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (148 commits)
    perf/x86/64: Report regs_user->ax too in get_regs_user()
    perf/x86/64: Simplify regs_user->abi setting code in get_regs_user()
    perf/x86/64: Do report user_regs->cx while we are in syscall, in get_regs_user()
    perf/x86/64: Do not guess user_regs->cs, ss, sp in get_regs_user()
    x86/asm/entry/32: Tidy up JNZ instructions after TESTs
    x86/asm/entry/64: Reduce padding in execve stubs
    x86/asm/entry/64: Remove GET_THREAD_INFO() in ret_from_fork
    x86/asm/entry/64: Simplify jumps in ret_from_fork
    x86/asm/entry/64: Remove a redundant jump
    x86/asm/entry/64: Optimize [v]fork/clone stubs
    x86/asm/entry: Zero EXTRA_REGS for stub32_execve() too
    x86/asm/entry/64: Move stub_x32_execvecloser() to stub_execveat()
    x86/asm/entry/64: Use common code for rt_sigreturn() epilogue
    x86/asm/entry/64: Add forgotten CFI annotation
    x86/asm/entry/irq: Simplify interrupt dispatch table (IDT) layout
    x86/asm/entry/64: Move opportunistic sysret code to syscall code path
    x86, selftests: Add sigreturn selftest
    x86/alternatives: Guard NOPs optimization
    x86/asm/entry: Clear EXTRA_REGS for all executable formats
    x86/signal: Remove pax argument from restore_sigcontext
    ...

    Linus Torvalds
     

13 Apr, 2015

1 commit

  • …/git/shuah/linux-kselftest

    Pull kselftest updates from Shuah Khan:
    "This is a milestone update in a sense. Several new tests and install
    and packaging support is added in this update.

    This update adds install and packaging tools developed on top of
    back-end shared logic enhancemnets to run and install tests. In
    addition several timer tests are added.

    - New timer tests from John Stultz

    - rtc test from Prarit Bhargava

    - Enhancements to un and install tests from Michael Ellerman

    - Install and packaging tools from Shuah Khan

    - Cross-compilation enablement from Tyler Baker

    - A couple of bug fixes"

    * tag 'linux-kselftest-4.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (42 commits)
    ftracetest: Do not use usleep directly
    selftest/mqueue: enable cross compilation
    selftest/ipc: enable cross compilation
    selftest/memfd: include default header install path
    selftest/mount: enable cross compilation
    selftest/memfd: enable cross compilation
    kselftests: timers: Make set-timer-lat fail more gracefully for !CAP_WAKE_ALARM
    selftests: Change memory on-off-test.sh name to be unique
    selftests: change cpu on-off-test.sh name to be unique
    selftests/mount: Make git ignore all binaries in mount test suite
    kselftests: timers: Reduce default runtime on inconsistency-check and set-timer-lat
    ftracetest: Convert exit -1 to exit $FAIL
    ftracetest: Cope properly with stack tracer not being enabled
    tools, update rtctest.c to verify passage of time
    Documentation, split up rtc.txt into documentation and test file
    selftests: Add tool to generate kselftest tar archive
    selftests: Add kselftest install tool
    selftests: Set CC using CROSS_COMPILE once in lib.mk
    selftests: Add install support for the powerpc tests
    selftests/timers: Use shared logic to run and install tests
    ...

    Linus Torvalds
     

10 Apr, 2015

7 commits

  • The data_head and data_tail fields are defined as __u64 in
    linux/perf_event.h, but perf userspace uses int and unsigned int.

    Convert all references to u64 for consistency.

    Signed-off-by: David Ahern
    Acked-by: Peter Zijlstra
    Cc: Frederic Weisbecker
    Cc: Jiri Olsa
    Cc: Namhyung Kim
    Cc: Stephane Eranian
    Link: http://lkml.kernel.org/r/1428420037-26599-1-git-send-email-dsahern@gmail.com
    Signed-off-by: Arnaldo Carvalho de Melo

    David Ahern
     
  • To avoid probing in unintended binary, the orphaned -x option must be
    checked and warned.

    Without this patch, following command sets up the probe in the kernel.

    -----
    # perf probe -a strcpy -x ./perf
    Added new event:
    probe:strcpy (on strcpy)

    You can now use it in all perf tools, such as:

    perf record -e probe:strcpy -aR sleep 1
    -----

    But in this case, it seems that the user may want to probe in the perf
    binary. With this patch, perf-probe correctly handles the orphaned -x.

    -----
    # perf probe -a strcpy -x ./perf
    Error: -x/-m must follow the probe definitions.
    ...
    -----

    Reported-by: Jiri Olsa
    Acked-by: Jiri Olsa
    Cc: David Ahern
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20150401102541.17137.75477.stgit@localhost.localdomain
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Arnaldo Carvalho de Melo

    Masami Hiramatsu
     
  • Support multiple probes on different binaries with just
    one command.

    In the result, this example sets up the probes on icmp_rcv in
    kernel, on main and set_target in perf, and on pcspkr_event
    in pcspker.ko driver.
    -----
    # perf probe -a icmp_rcv -x ./perf -a main -a set_target \
    -m /lib/modules/4.0.0-rc5+/kernel/drivers/input/misc/pcspkr.ko \
    -a pcspkr_event
    Added new event:
    probe:icmp_rcv (on icmp_rcv)

    You can now use it in all perf tools, such as:

    perf record -e probe:icmp_rcv -aR sleep 1

    Added new event:
    probe_perf:main (on main in /home/mhiramat/ksrc/linux-3/tools/perf/perf)

    You can now use it in all perf tools, such as:

    perf record -e probe_perf:main -aR sleep 1

    Added new event:
    probe_perf:set_target (on set_target in /home/mhiramat/ksrc/linux-3/tools/perf/perf)

    You can now use it in all perf tools, such as:

    perf record -e probe_perf:set_target -aR sleep 1

    Added new event:
    probe:pcspkr_event (on pcspkr_event in pcspkr)

    You can now use it in all perf tools, such as:

    perf record -e probe:pcspkr_event -aR sleep 1
    -----

    Reported-by: Arnaldo Carvalho de Melo
    Signed-off-by: Masami Hiramatsu
    Tested-by: Arnaldo Carvalho de Melo
    Cc: David Ahern
    Cc: Jiri Olsa
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20150401102539.17137.46454.stgit@localhost.localdomain
    Signed-off-by: Arnaldo Carvalho de Melo

    Masami Hiramatsu
     
  • commit: f3b623b8490a ("perf tools: Reference count struct thread")
    appends every thread->node to dead_threads in machine__remove_thread()
    and list_del_init() this node in thread__put().

    perf_event__exit_del_thread() releases thread wihout using
    machine__remove_thread(), and causes a NULL pointer crash when
    list_del_init(&thread->node) is called. Fix this by using
    machine_remove_thread() instead of using thread__put() directly.

    This problem can be reproduced as following:

    $ perf record ls
    $ perf buildid-list --with-hits
    [ 3874.195070] perf[1018]: segfault at 0 ip 00000000004b0b15 sp
    00007ffc35b44780 error 6 in perf[400000+166000]
    Segmentation fault

    After this patch:
    $ perf record ls
    $ perf buildid-list --with-hits
    bc23e7c3281e542650ba4324421d6acf78f4c23e /proc/kcore
    643324cb0e969f30c56d660f167f84a150845511 [vdso]
    0000000000000000000000000000000000000000 /bin/busybox
    ...

    Signed-off-by: He Kuang
    Tested-by: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Wang Nan
    Link: http://lkml.kernel.org/r/1428658500-6483-1-git-send-email-hekuang@huawei.com
    Signed-off-by: Arnaldo Carvalho de Melo

    He Kuang
     
  • Trying to analyze a big endian data file on little endian system fails
    with the error:

    0xa9b40 [0x70]: failed to process type: 9

    The problem is that header parsing is not done correctly because the
    file attributes are not swapped. Make it so. With this patch able to
    analyze a sparc64 data file on x86_64.

    Signed-off-by: David Ahern
    Acked-by: Jiri Olsa
    Cc: Namhyung Kim
    Link: http://lkml.kernel.org/r/1428610546-178789-1-git-send-email-david.ahern@oracle.com
    Signed-off-by: Arnaldo Carvalho de Melo

    David Ahern
     
  • When traversing /proc to synthesize the PERF_RECORD_FORK et al events we
    were bailing out on errors without calling closedir(), fix it.

    Reported-by: David Ahern
    Cc: Adrian Hunter
    Cc: Borislav Petkov
    Cc: Don Zickus
    Cc: Frederic Weisbecker
    Cc: Jiri Olsa
    Cc: Mike Galbraith
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Link: http://lkml.kernel.org/n/tip-vxtp593rfztgbi8noy0m967p@git.kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Arnaldo Carvalho de Melo
     
  • Commit ca6c41c59b9 sets the ppid based on what is read from the
    /proc/pid/status file when synthesizing fork events.

    This is correct thing to do for new processes but not threads of a
    process.

    Fix ppid for threads to be the main thread when synthesizing fork events
    (ie., assume main thread spawned all sub-threads in a process).

    Reported-by: Arnaldo Carvalho de Melo
    Signed-off-by: David Ahern
    Tested-by: Arnaldo Carvalho de Melo
    Acked-by: Don Zickus
    Link: http://lkml.kernel.org/r/1428598107-178999-1-git-send-email-david.ahern@oracle.com
    Signed-off-by: Arnaldo Carvalho de Melo

    David Ahern
     

08 Apr, 2015

21 commits

  • Adding 'I' event modifier to have complete set of modifiers for
    perf_event_attr:exclude_* bits.

    Any event specified with 'I' modifier will have the
    perf_event_attr:exclude_idle bit set.

    $ perf record -e cycles:I -vv ls 2>&1 | grep exclude_idle
    exclude_hv 0 exclude_idle 1

    Adding automated tests.

    Signed-off-by: Jiri Olsa
    Cc: Andi Kleen
    Cc: David Ahern
    Cc: Namhyung Kim
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: William Cohen
    Link: http://lkml.kernel.org/r/1428441919-23099-2-git-send-email-jolsa@kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Jiri Olsa
     
  • report__warn_kptr_restrict() calls map__kmap(kernel_map) before checking
    kernel_map againest NULL.

    Which is dangerous, since map__kmap() will return a invalid and not NULL
    address.

    It will trigger a warning message in map__kmap() after the patch "perf:
    kmaps: enforce usage of kmaps to protect futher bugs." was applied.

    This patch fixes it by adding the missing checking.

    Signed-off-by: Wang Nan
    Cc: Adrian Hunter
    Cc: Jiri Olsa
    Cc: Zefan Li
    Cc: pi3orama@163.com
    Link: http://lkml.kernel.org/r/1428490772-135393-1-git-send-email-wangnan0@huawei.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Wang Nan
     
  • Following commit:
    1a5941312414 perf: Add wakeup watermark control to the AUX area

    enlarged perf_event_attr, but did not updated attr tests.

    Reported-by: Arnaldo Carvalho de Melo
    Signed-off-by: Jiri Olsa
    Cc: "H. Peter Anvin"
    Cc: Adrian Hunter
    Cc: Alexander Shishkin
    Cc: Borislav Petkov
    Cc: Frederic Weisbecker
    Cc: Kaixu Xia
    Cc: Kan Liang
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Markus T Metzger
    Cc: Mathieu Poirier
    Link: http://lkml.kernel.org/n/20150407171715.GA22603@krava.redhat.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Jiri Olsa
     
  • Commit 9b118acae310f57baee770b5db402500d8695e50 ("perf probe: Fix to
    handle aliased symbols in glibc") uses an absolute format '%lx' to
    print u64 argument, which causes compiling error on ARM 32.

    This patch replaces it with PRIx64.

    Signed-off-by: Wang Nan
    Acked-by: Masami Hiramatsu
    Cc: Jiri Olsa
    Cc: Namhyung Kim
    Cc: Zefan Li
    Cc: pi3orama@163.com
    Link: http://lkml.kernel.org/r/1428459274-138470-1-git-send-email-wangnan0@huawei.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Wang Nan
     
  • Currently there's 3 (that I found) different and incomplete
    implementations of printing perf_event_attr.

    This is quite silly. Merge the lot.

    While this patch does not retain the exact form all printing that I
    found is debug output and thus it should not be critical.

    Also, I cannot find a single print_event_desc() caller.

    Pre:

    $ perf record -vv -e cycles -- sleep 1
    ------------------------------------------------------------
    perf_event_attr:
    type 0
    size 104
    config 0
    sample_period 4000
    sample_freq 4000
    sample_type 0x107
    read_format 0
    disabled 1 inherit 1
    pinned 0 exclusive 0
    exclude_user 0 exclude_kernel 0
    exclude_hv 0 exclude_idle 0
    mmap 1 comm 1
    mmap2 1 comm_exec 1
    freq 1 inherit_stat 0
    enable_on_exec 1 task 1
    watermark 0 precise_ip 0
    mmap_data 0 sample_id_all 1
    exclude_host 0 exclude_guest 1
    excl.callchain_kern 0 excl.callchain_user 0
    wakeup_events 0
    wakeup_watermark 0
    bp_type 0
    bp_addr 0
    config1 0
    bp_len 0
    config2 0
    branch_sample_type 0
    sample_regs_user 0
    sample_stack_user 0
    sample_regs_intr 0
    ------------------------------------------------------------

    $ perf evlist -vv
    cycles: sample_freq=4000, size: 104, sample_type: IP|TID|TIME|PERIOD,
    disabled: 1, inherit: 1, mmap: 1, mmap2: 1, comm: 1, comm_exec: 1,
    freq: 1, enable_on_exec: 1, task: 1, sample_id_all: 1, exclude_guest: 1

    Post:

    $ ./perf record -vv -e cycles -- sleep 1
    ------------------------------------------------------------
    perf_event_attr:
    size 112
    { sample_period, sample_freq } 4000
    sample_type IP|TID|TIME|PERIOD
    disabled 1
    inherit 1
    mmap 1
    comm 1
    freq 1
    enable_on_exec 1
    task 1
    sample_id_all 1
    exclude_guest 1
    mmap2 1
    comm_exec 1
    ------------------------------------------------------------

    $ ./perf evlist -vv
    cycles: size: 112, { sample_period, sample_freq }: 4000, sample_type:
    IP|TID|TIME|PERIOD, disabled: 1, inherit: 1, mmap: 1, comm: 1, freq:
    1, enable_on_exec: 1, task: 1, sample_id_all: 1, exclude_guest: 1,
    mmap2: 1, comm_exec: 1

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Adrian Hunter
    Acked-by: Ingo Molnar
    Acked-by: Jiri Olsa
    Cc: "H. Peter Anvin"
    Cc: Andrew Morton
    Cc: David Ahern
    Cc: John Stultz
    Cc: Linus Torvalds
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150407091150.644238729@infradead.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Peter Zijlstra
     
  • Teach perf-record about the new perf_event_attr::{use_clockid, clockid}
    fields. Add a simple parameter to set the clock (if any) to be used for
    the events to be recorded into the data file.

    Since we store the entire perf_event_attr in the EVENT_DESC section we
    also already store the used clockid in the data file.

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: David Ahern
    Cc: "H. Peter Anvin"
    Cc: Adrian Hunter
    Cc: Andrew Morton
    Cc: Jiri Olsa
    Cc: John Stultz
    Cc: Linus Torvalds
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Yunlong Song
    Link: http://lkml.kernel.org/r/20150407154851.GR23123@twins.programming.kicks-ass.net
    [ Conditionally define CLOCK_BOOTTIME, at least rhel6 doesn't have it - dsahern
    Ditto for CLOCK_MONOTONIC_RAW, sles11sp2 doesn't have it - yunlong.song ]
    Signed-off-by: Arnaldo Carvalho de Melo

    Peter Zijlstra
     
  • …d of the default value 10

    Since sched->replay_repeat is set to 10 as default, the sched->run_avg,
    sched->runavg_cpu_usage, and sched->runavg_parent_cpu_usage all use
    10 to calculate their value.

    However, the replay_repeat can be changed to other value by using -r
    option, so the calculation above should use replay_repeat to achieve
    more accurate results instead of the default value 10.

    Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
    Cc: Paul Mackerras <paulus@samba.org>
    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: Wang Nan <wangnan0@huawei.com>
    Link: http://lkml.kernel.org/r/1427809596-29559-10-git-send-email-yunlong.song@huawei.com
    Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

    Yunlong Song
     
  • Enable to use perf.data when it is not owned by current user or root.

    Example:

    $ ls -al perf.data
    -rw------- 1 Yunlong.Song Yunlong.Song 5321918 Mar 25 15:14 perf.data
    $ sudo id
    uid=0(root) gid=0(root) groups=0(root),64(pkcs11)

    Before this patch:

    $ sudo perf sched replay -f
    run measurement overhead: 98 nsecs
    sleep measurement overhead: 52909 nsecs
    the run test took 1000015 nsecs
    the sleep test took 1054253 nsecs
    File perf.data not owned by current user or root (use -f to override)

    As shown above, the -f option does not work at all.

    After this patch:

    $ sudo perf sched replay -f
    run measurement overhead: 221 nsecs
    sleep measurement overhead: 40514 nsecs
    the run test took 1000003 nsecs
    the sleep test took 1056098 nsecs
    nr_run_events: 10
    nr_sleep_events: 1562
    nr_wakeup_events: 5
    task 0 ( :1: 1), nr_events: 1
    task 1 ( :2: 2), nr_events: 1
    task 2 ( :3: 3), nr_events: 1
    ...
    ...
    task 1549 ( :163132: 163132), nr_events: 1
    task 1550 ( :163540: 163540), nr_events: 1
    task 1551 ( : 0), nr_events: 10
    ------------------------------------------------------------
    #1 : 50.198, ravg: 50.20, cpu: 2335.18 / 2335.18
    #2 : 219.099, ravg: 67.09, cpu: 2835.11 / 2385.17
    #3 : 238.626, ravg: 84.24, cpu: 3278.26 / 2474.48
    #4 : 200.364, ravg: 95.85, cpu: 2977.41 / 2524.77
    #5 : 176.882, ravg: 103.96, cpu: 2801.35 / 2552.43
    #6 : 191.093, ravg: 112.67, cpu: 2813.70 / 2578.56
    #7 : 189.448, ravg: 120.35, cpu: 2809.21 / 2601.62
    #8 : 200.637, ravg: 128.38, cpu: 2849.91 / 2626.45
    #9 : 248.338, ravg: 140.37, cpu: 4380.61 / 2801.87
    #10 : 511.139, ravg: 177.45, cpu: 3077.73 / 2829.45

    As shown above, the -f option really works now.

    Besides for replay, -f option can also work for latency and map.

    Signed-off-by: Yunlong Song
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Wang Nan
    Link: http://lkml.kernel.org/r/1427809596-29559-9-git-send-email-yunlong.song@huawei.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Yunlong Song
     
  • The soft maximum number of open files for a calling process is 1024,
    which is defined as INR_OPEN_CUR in include/uapi/linux/fs.h, and the
    hard maximum number of open files for a calling process is 4096, which
    is defined as INR_OPEN_MAX in include/uapi/linux/fs.h.

    Both INR_OPEN_CUR and INR_OPEN_MAX are used to limit the value of
    RLIMIT_NOFILE in include/asm-generic/resource.h.

    And the soft maximum number finally decides the limitation of the
    maximum files which are allowed to be opened.

    That is to say a process can use at most 1024 file descriptors for its
    o pened files, or an EMFILE error will happen.

    This error can be fixed by increasing the soft maximum number, under the
    constraint that the soft maximum number can not exceed the hard maximum
    number, or both soft and hard maximum number should be increased
    simultaneously with privilege.

    For perf sched replay, it uses sys_perf_event_open to create the file
    descriptor for each of the tasks in order to handle information of perf
    events.

    That is to say each task needs a unique file descriptor. In x86_64,
    there may be over 1024 or 4096 tasks correspoinding to the record in
    perf.data, which causes that no enough file descriptors can be used.

    As a result, EMFILE error happens and stops the replay process. To solve
    this problem, we adaptively increase the soft and hard maximum number of
    open files with a '-f' option.

    Example:

    Test environment: x86_64 with 160 cores

    $ cat /proc/sys/kernel/pid_max
    163840
    $ cat /proc/sys/fs/file-max
    6815744
    $ ulimit -Sn
    1024
    $ ulimit -Hn
    4096

    Before this patch:

    $ perf sched replay
    ...
    task 1549 ( :163132: 163132), nr_events: 1
    task 1550 ( :163540: 163540), nr_events: 1
    task 1551 ( : 0), nr_events: 10
    Error: sys_perf_event_open() syscall returned with -1 (Too many open
    files)

    After this patch:

    $ perf sched replay
    ...
    task 1549 ( :163132: 163132), nr_events: 1
    task 1550 ( :163540: 163540), nr_events: 1
    task 1551 ( : 0), nr_events: 10
    Error: sys_perf_event_open() syscall returned with -1 (Too many open
    files)
    Have a try with -f option

    $ perf sched replay -f
    ...
    task 1549 ( :163132: 163132), nr_events: 1
    task 1550 ( :163540: 163540), nr_events: 1
    task 1551 ( : 0), nr_events: 10
    ------------------------------------------------------------
    #1 : 54.401, ravg: 54.40, cpu: 3285.21 / 3285.21
    #2 : 199.548, ravg: 68.92, cpu: 4999.65 / 3456.66
    #3 : 170.483, ravg: 79.07, cpu: 1349.94 / 3245.99
    #4 : 192.034, ravg: 90.37, cpu: 1322.88 / 3053.67
    #5 : 182.929, ravg: 99.62, cpu: 1406.51 / 2888.96
    #6 : 152.974, ravg: 104.96, cpu: 1167.54 / 2716.82
    #7 : 155.579, ravg: 110.02, cpu: 2992.53 / 2744.39
    #8 : 130.557, ravg: 112.08, cpu: 1126.43 / 2582.59
    #9 : 138.520, ravg: 114.72, cpu: 1253.22 / 2449.65
    #10 : 134.328, ravg: 116.68, cpu: 1587.95 / 2363.48

    Signed-off-by: Yunlong Song
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Wang Nan
    Link: http://lkml.kernel.org/r/1427809596-29559-8-git-send-email-yunlong.song@huawei.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Yunlong Song
     
  • Since there is sem_wait for each task in the wait_for_tasks(), e.g.
    sem_wait(&task->work_done_sem).

    The sem_wait can continue only when work_done_sem is greater than 0, or
    it will be blocked.

    For perf sched replay, one task may sem_post the work_done_sem of
    another task, which causes the work_done_sem of that task processed in a
    reasonable sequence, e.g. sem_post, sem_wait, sem_wait, sem_post...

    This sequence simulates the sched process of the running tasks at the
    time when perf sched record runs.

    As a result, all the tasks are required and their threads must be
    successfully created.

    If any one (task A) of the tasks fails to create its thread, then
    another task (task B), whose work_done_sem needs sem_post from that
    failed task A, may likely block itself due to seg_wait.

    And this is a dead halt, since task B's thread_func cannot continue at
    all.

    To solve this problem, perf sched replay should exit once any task fails
    to create its thread.

    Example:

    Test environment: x86_64 with 160 cores

    Before this patch:

    $ perf sched replay
    ...
    Error: sys_perf_event_open() syscall returned with -1 (Too many open
    files)
    ------------------------------------------------------------ : 0), nr_events: 10
    Error: sys_perf_event_open() syscall returned with -1 (Too many open
    files)
    $

    As shown above, perf sched replay finishes the process after printing an
    error message and does not block itself.

    Signed-off-by: Yunlong Song
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Wang Nan
    Link: http://lkml.kernel.org/r/1427809596-29559-7-git-send-email-yunlong.song@huawei.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Yunlong Song
     
  • The pr_err in self_open_counters() prints error message to stderr.
    Unlike stdout, stderr uses memory buffer on the stack of each calling
    process.

    The pr_err in self_open_counters() works in a thread called thread_func
    created in function create_tasks, which concurrently creates
    sched->nr_tasks threads.

    If the error happens and pr_err prints the error message in each of
    these threads, the stack size of the perf process (default is 8192
    kbytes) will quickly run out and the segmentation fault will happen
    then.

    To solve this problem, pr_err with self_open_counters() should be moved
    from newly created threads to the old main thread of the perf process.
    Then the pr_err can work in a stable situation without the strange
    segmentation fault problem.

    Example:

    Test environment: x86_64 with 160 cores

    Before this patch:

    $ perf sched replay
    ...
    task 1549 ( :163132: 163132), nr_events: 1
    task 1550 ( :163540: 163540), nr_events: 1
    task 1551 ( : 0), nr_events: 10
    Segmentation fault

    After this patch:

    $ perf sched replay
    ...
    task 1549 ( :163132: 163132), nr_events: 1
    task 1550 ( :163540: 163540), nr_events: 1
    task 1551 ( : 0), nr_events: 10
    ...

    As shown above, the result continues without any segmentation fault.

    Signed-off-by: Yunlong Song
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Wang Nan
    Link: http://lkml.kernel.org/r/1427809596-29559-6-git-send-email-yunlong.song@huawei.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Yunlong Song
     
  • …fferent pid_max configurations

    Although the memory of pid_to_task can be allocated via calloc according
    to the value of /proc/sys/kernel/pid_max, it cannot handle the case when
    pid_max is changed after 'perf sched record' has created its perf.data.

    If the new pid_max configured in 'perf sched replay' is smaller than the
    old pid_max configured in 'perf sched record', then it will cause the
    assertion failure problem.

    To solve this problem, we realloc the memory of pid_to_task stepwise
    once the passed-in pid parameter in register_pid is larger than the
    current pid_max.

    Example:

    Test environment: x86_64 with 160 cores

    $ cat /proc/sys/kernel/pid_max
    163840
    $ perf sched record ls
    $ echo 5000 > /proc/sys/kernel/pid_max
    $ cat /proc/sys/kernel/pid_max
    5000

    Before this patch:

    $ perf sched replay
    run measurement overhead: 221 nsecs
    sleep measurement overhead: 55356 nsecs
    the run test took 1000011 nsecs
    the sleep test took 1060940 nsecs
    perf: builtin-sched.c:337: register_pid: Assertion `!(pid >= (unsigned
    long)pid_max)' failed.
    Aborted

    After this patch:

    $ perf sched replay
    run measurement overhead: 221 nsecs
    sleep measurement overhead: 55611 nsecs
    the run test took 1000026 nsecs
    the sleep test took 1060486 nsecs
    nr_run_events: 10
    nr_sleep_events: 1562
    nr_wakeup_events: 5
    task 0 ( :1: 1), nr_events: 1
    task 1 ( :2: 2), nr_events: 1
    task 2 ( :3: 3), nr_events: 1
    task 3 ( :5: 5), nr_events: 1
    ...

    Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
    Cc: Paul Mackerras <paulus@samba.org>
    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: Wang Nan <wangnan0@huawei.com>
    Link: http://lkml.kernel.org/r/1427809596-29559-5-git-send-email-yunlong.song@huawei.com
    Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

    Yunlong Song
     
  • …nexpected change of pid_max

    The current memory allocation of struct task_desc *pid_to_task[MAX_PID]
    is in a permanent and preset way, and it has two problems:

    Problem 1: If the pid_max, which is the max number of pids in the
    system, is much smaller than MAX_PID (1024*1000), then it causes a waste
    of stack memory. This may happen in the case where the number of cpu
    cores is much smaller than 1000.

    Problem 2: If the pid_max is changed from the default value to a value
    larger than MAX_PID, then it will cause assertion failure problem. The
    maximum value of pid_max can be set to pid_max_max (see pidmap_init
    defined in kernel/pid.c), which equals to PID_MAX_LIMIT. In x86_64,
    PID_MAX_LIMIT is 4*1024*1024 (defined in include/linux/threads.h). This
    value is much larger than MAX_PID, and will take up 32768 Kbytes
    (4*1024*1024*8/1024) for memory allocation of pid_to_task, which is much
    larger than the default 8192 Kbytes of the stack size of calling
    process.

    Due to these two problems, we use calloc to allocate the memory of
    pid_to_task dynamically.

    Example:

    Test environment: x86_64 with 160 cores

    $ cat /proc/sys/kernel/pid_max
    163840
    $ echo 1025000 > /proc/sys/kernel/pid_max
    $ cat /proc/sys/kernel/pid_max
    1025000

    Run some applications until the pid of some process is greater than
    the value of MAX_PID (1024*1000).

    Before this patch:

    $ perf sched replay
    run measurement overhead: 221 nsecs
    sleep measurement overhead: 55480 nsecs
    the run test took 1000008 nsecs
    the sleep test took 1063151 nsecs
    perf: builtin-sched.c:330: register_pid: Assertion `!(pid >= 1024000)'
    failed.
    Aborted

    After this patch:

    $ perf sched replay
    run measurement overhead: 221 nsecs
    sleep measurement overhead: 55435 nsecs
    the run test took 1000004 nsecs
    the sleep test took 1059312 nsecs
    nr_run_events: 10
    nr_sleep_events: 1562
    nr_wakeup_events: 5
    task 0 ( :1: 1), nr_events: 1
    task 1 ( :2: 2), nr_events: 1
    task 2 ( :3: 3), nr_events: 1
    task 3 ( :5: 5), nr_events: 1
    ...

    Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
    Cc: Paul Mackerras <paulus@samba.org>
    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: Wang Nan <wangnan0@huawei.com>
    Link: http://lkml.kernel.org/r/1427809596-29559-4-git-send-email-yunlong.song@huawei.com
    Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

    Yunlong Song
     
  • Current MAX_PID is only 65536, which will cause assertion failure problem
    when CPU cores are more than 64 in x86_64.

    This is because the pid_max value in x86_64 is at least
    PIDS_PER_CPU_DEFAULT * num_possible_cpus() (see function pidmap_init
    defined in kernel/pid.c), where PIDS_PER_CPU_DEFAULT is 1024 (defined in
    include/linux/threads.h).

    Thus for MAX_PID = 65536, the correspoinding CPU cores are
    65536/1024=64. This is obviously not enough at all for x86_64, and will
    cause an assertion failure problem due to BUG_ON(pid >= MAX_PID) in the
    codes.

    We increase MAX_PID value from 65536 to 1024*1000, which can be used in
    x86_64 with 1000 cores.

    This number is finally decided according to the limitation of stack size
    of calling process.

    Use 'ulimit -a', the result shows the stack size of any process is 8192
    Kbytes, which is defined in include/uapi/linux/resource.h (#define
    _STK_LIM (8*1024*1024)).

    Thus we choose a large enough value for MAX_PID, and make it satisfy to
    the limitation of the stack size, i.e., making the perf process take up
    a memory space just smaller than 8192 Kbytes.

    We have calculated and tested that 1024*1000 is OK for MAX_PID.

    This means perf sched replay can now be used with at most 1000 cores in
    x86_64 without any assertion failure problem.

    Example:

    Test environment: x86_64 with 160 cores

    $ cat /proc/sys/kernel/pid_max
    163840

    Before this patch:

    $ perf sched replay
    run measurement overhead: 240 nsecs
    sleep measurement overhead: 55379 nsecs
    the run test took 1000004 nsecs
    the sleep test took 1059424 nsecs
    perf: builtin-sched.c:330: register_pid: Assertion `!(pid >= 65536)'
    failed.
    Aborted

    After this patch:

    $ perf sched replay
    run measurement overhead: 221 nsecs
    sleep measurement overhead: 55397 nsecs
    the run test took 999920 nsecs
    the sleep test took 1053313 nsecs
    nr_run_events: 10
    nr_sleep_events: 1562
    nr_wakeup_events: 5
    task 0 ( :1: 1), nr_events: 1
    task 1 ( :2: 2), nr_events: 1
    task 2 ( :3: 3), nr_events: 1
    task 3 ( :5: 5), nr_events: 1
    ...

    Signed-off-by: Yunlong Song
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Wang Nan
    Link: http://lkml.kernel.org/r/1427809596-29559-3-git-send-email-yunlong.song@huawei.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Yunlong Song
     
  • There is no struct task_task at all, thus it is a typo error in the old
    commits, now fix it to what it should be in order to avoid unnecessary
    misunderstanding.

    Signed-off-by: Yunlong Song
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Wang Nan
    Link: http://lkml.kernel.org/r/1427809596-29559-2-git-send-email-yunlong.song@huawei.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Yunlong Song
     
  • Currently the perf kmem does not respect -i option.

    Initializing the file.path properly after options get parsed.

    Signed-off-by: Jiri Olsa
    Cc: David Ahern
    Cc: Jiri Olsa
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Peter Zijlstra
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/1428298576-9785-2-git-send-email-namhyung@kernel.org
    Signed-off-by: Namhyung Kim
    Signed-off-by: Arnaldo Carvalho de Melo

    Jiri Olsa
     
  • Currently it ignores operator priority and just sets processed args as a
    right operand. But it could result in priority inversion in case that
    the right operand is also a operator arg and its priority is lower.

    For example, following print format is from new kmem events.

    "page=%p", REC->pfn != -1UL ? (((struct page *)(0xffffea0000000000UL)) + (REC->pfn)) : ((void *)0)

    But this was treated as below:

    REC->pfn != ((null - 1UL) ? ((struct page *)0xffffea0000000000UL + REC->pfn) : (void *) 0)

    In this case, the right arg was '?' operator which has lower priority.
    But it just sets the whole arg so making the output confusing - page was
    always 0 or 1 since that's the result of logical operation.

    With this patch, it can handle it properly like following:

    ((REC->pfn != (null - 1UL)) ? ((struct page *)0xffffea0000000000UL + REC->pfn) : (void *) 0)

    Signed-off-by: Namhyung Kim
    Acked-by: Steven Rostedt
    Cc: David Ahern
    Cc: Jiri Olsa
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Peter Zijlstra
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/1428298576-9785-10-git-send-email-namhyung@kernel.org
    [ Replaced 'swap' with 'rotate' in a comment as requested by Steve and agreed by Namhyung ]
    Signed-off-by: Arnaldo Carvalho de Melo

    Namhyung Kim
     
  • This patch add checks in places where map__kmap is used to get kmaps
    from struct kmap.

    Error messages are added at map__kmap to warn invalid accessing of kmap
    (for the case of !map->dso->kernel, kmap(map) does not exists at all).

    Also, introduces map__kmaps() to warn uninitialized kmaps.

    Reviewed-by: Ingo Molnar
    Signed-off-by: Wang Nan
    Cc: pi3orama@163.com
    Cc: Jiri Olsa
    Cc: Namhyung Kim
    Cc: Zefan Li
    Link: http://lkml.kernel.org/r/1428394966-131044-2-git-send-email-wangnan0@huawei.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Wang Nan
     
  • perf_evlist__mmap_consume() uses perf_mmap__empty() to judge whether
    perf_mmap is empty and can be released. But the result is inverted so
    fix it.

    Signed-off-by: He Kuang
    Tested-by: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Peter Zijlstra
    Cc: Wang Nan
    Link: http://lkml.kernel.org/r/1428399071-7141-1-git-send-email-hekuang@huawei.com
    Signed-off-by: Arnaldo Carvalho de Melo

    He Kuang
     
  • Conflicts:
    arch/x86/kernel/entry_64.S

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • This is my sigreturn test, added mostly unchanged from its old
    home. It exercises the sigreturn(2) syscall, specifically
    focusing on its interactions with various IRET corner cases.

    It tests for correct behavior in several areas that were
    historically dangerously buggy. For example, it exercises espfix
    on kernels of both bitnesses under various conditions, and it
    contains testcases for several now-fixed bugs in IRET error
    handling.

    If you run it on older kernels without the fixes, your system will
    crash. It probably won't eat your data in the process.

    There is no released kernel on which the sigreturn_64 test will
    pass, but it passes on tip:x86/asm.

    I plan to switch to lib.mk for Linux 4.2.

    I'm not using the ksft_ helpers at all yet. I can do that later.

    Signed-off-by: Andy Lutomirski
    Acked-by: Shuah Khan
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Michael Ellerman
    Cc: Shuah Khan
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/89d10b76b92c7202d8123654dc8d36701c017b3d.1428386971.git.luto@kernel.org
    [ Fixed empty format string GCC build warning in trivial_32bit_program.c ]
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     

07 Apr, 2015

1 commit


03 Apr, 2015

2 commits

  • The usleep is only provided on distros from Redhat so running ftracetest
    on other distro resulted in failures due to the missing usleep.

    The reason of using [u]sleep in the test was to generate (scheduler)
    events. It can be done various ways like this:

    yield() { ping localhost -c 1 || sleep .001 || usleep 1 || sleep 1; }

    For more information to the history of this patch, please refer to:

    Link: http://lkml.kernel.org/r/1427329943-16896-1-git-send-email-namhyung@kernel.org

    Reported-by: Michael Ellerman
    Reported-by: Dave Jones
    Reported-by: Luis Henriques
    Suggested-by: Pádraig Brady
    Acked-by: Steven Rostedt
    Acked-by: Masami Hiramatsu
    Signed-off-by: Namhyung Kim
    Signed-off-by: Shuah Khan

    Namhyung Kim
     
  • Conflicts:
    drivers/net/usb/asix_common.c
    drivers/net/usb/sr9800.c
    drivers/net/usb/usbnet.c
    include/linux/usb/usbnet.h
    net/ipv4/tcp_ipv4.c
    net/ipv6/tcp_ipv6.c

    The TCP conflicts were overlapping changes. In 'net' we added a
    READ_ONCE() to the socket cached RX route read, whilst in 'net-next'
    Eric Dumazet touched the surrounding code dealing with how mini
    sockets are handled.

    With USB, it's a case of the same bug fix first going into net-next
    and then I cherry picked it back into net.

    Signed-off-by: David S. Miller

    David S. Miller