18 Mar, 2016

2 commits

  • This adds two command line keys:

    -c|--cgroup path|@inode Walk only pages owned by this memory cgroup
    -C|--list-cgroup Show memory cgroup inodes

    [vdavydov@virtuozzo.com: opt_cgroup should be uint64_t. Fix conflicts with "tools/vm/page-types.c: support swap entry"]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Naoya Horiguchi
    Reviewed-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • /proc/pid/pagemap (pte_to_pagemap_entry() internally) already reports
    about swap entry, so let's make the in-kernel utility aware of it.

    Signed-off-by: Naoya Horiguchi
    Cc: Vladimir Davydov
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

17 Mar, 2016

3 commits

  • Pull libnvdimm updates from Dan Williams:

    - Asynchronous address range scrub:

    Given the capacities of next generation persistent memory devices a
    scrub operation to find all poison may take 10s of seconds. We
    want this scrub work to be done asynchronously with the rest of
    system initialization, so we move it out of line from the NFIT
    probing, i.e. acpi_nfit_add().

    - Clear poison:

    ACPI 6.1 introduces the ability to send "clear error" commands to
    the ACPI0012:00 device representing the root of an "nvdimm bus".
    Similar to relocating a bad block on a disk, this support clears
    media errors in response to a write.

    - Persistent memory resource tracking:

    A persistent memory range may be designated as simply "reserved" by
    platform firmware in the efi/e820 memory map. Later when the NFIT
    driver loads it discovers that the range is "Persistent Memory".

    The NFIT bus driver inserts a resource to advertise that
    "persistent" attribute in the system resource tree for /proc/iomem
    and kernel-internal usages.

    - Miscellaneous cleanups and fixes:

    Workaround section misaligned pmem ranges when allocating a struct
    page memmap, fix handling of the read-only case in the ioctl path,
    and clean up block device major number allocation.

    * tag 'libnvdimm-for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (26 commits)
    libnvdimm, pmem: clear poison on write
    libnvdimm, pmem: fix kmap_atomic() leak in error path
    nvdimm/btt: don't allocate unused major device number
    nvdimm/blk: don't allocate unused major device number
    pmem: don't allocate unused major device number
    ACPI: Change NFIT driver to insert new resource
    resource: Export insert_resource and remove_resource
    resource: Add remove_resource interface
    resource: Change __request_region to inherit from immediate parent
    libnvdimm, pmem: fix ia64 build, use PHYS_PFN
    nfit, libnvdimm: clear poison command support
    libnvdimm, pfn: 'resource'-address and 'size' attributes for pfn devices
    libnvdimm, pmem: adjust for section collisions with 'System RAM'
    libnvdimm, pmem: fix 'pfn' support for section-misaligned namespaces
    libnvdimm: Fix security issue with DSM IOCTL.
    libnvdimm: Clean-up access mode check.
    tools/testing/nvdimm: expand ars unit testing
    nfit: disable userspace initiated ars during scrub
    nfit: scrub and register regions in a workqueue
    nfit, libnvdimm: async region scrub workqueue
    ...

    Linus Torvalds
     
  • Pull power management and ACPI updates from Rafael Wysocki:
    "This time the majority of changes go into cpufreq and they are
    significant.

    First off, the way CPU frequency updates are triggered is different
    now. Instead of having to set up and manage a deferrable timer for
    each CPU in the system to evaluate and possibly change its frequency
    periodically, cpufreq governors set up callbacks to be invoked by the
    scheduler on a regular basis (basically on utilization updates). The
    "old" governors, "ondemand" and "conservative", still do all of their
    work in process context (although that is triggered by the scheduler
    now), but intel_pstate does it all in the callback invoked by the
    scheduler with no need for any additional asynchronous processing.

    Of course, this eliminates the overhead related to the management of
    all those timers, but also it allows the cpufreq governor code to be
    simplified quite a bit. On top of that, the common code and data
    structures used by the "ondemand" and "conservative" governors are
    cleaned up and made more straightforward and some long-standing and
    quite annoying problems are addressed. In particular, the handling of
    governor sysfs attributes is modified and the related locking becomes
    more fine grained which allows some concurrency problems to be avoided
    (particularly deadlocks with the core cpufreq code).

    In principle, the new mechanism for triggering frequency updates
    allows utilization information to be passed from the scheduler to
    cpufreq. Although the current code doesn't make use of it, in the
    works is a new cpufreq governor that will make decisions based on the
    scheduler's utilization data. That should allow the scheduler and
    cpufreq to work more closely together in the long run.

    In addition to the core and governor changes, cpufreq drivers are
    updated too. Fixes and optimizations go into intel_pstate, the
    cpufreq-dt driver is updated on top of some modification in the
    Operating Performance Points (OPP) framework and there are fixes and
    other updates in the powernv cpufreq driver.

    Apart from the cpufreq updates there is some new ACPICA material,
    including a fix for a problem introduced by previous ACPICA updates,
    and some less significant changes in the ACPI code, like CPPC code
    optimizations, ACPI processor driver cleanups and support for loading
    ACPI tables from initrd.

    Also updated are the generic power domains framework, the Intel RAPL
    power capping driver and the turbostat utility and we have a bunch of
    traditional assorted fixes and cleanups.

    Specifics:

    - Redesign of cpufreq governors and the intel_pstate driver to make
    them use callbacks invoked by the scheduler to trigger CPU
    frequency evaluation instead of using per-CPU deferrable timers for
    that purpose (Rafael Wysocki).

    - Reorganization and cleanup of cpufreq governor code to make it more
    straightforward and fix some concurrency problems in it (Rafael
    Wysocki, Viresh Kumar).

    - Cleanup and improvements of locking in the cpufreq core (Viresh
    Kumar).

    - Assorted cleanups in the cpufreq core (Rafael Wysocki, Viresh
    Kumar, Eric Biggers).

    - intel_pstate driver updates including fixes, optimizations and a
    modification to make it enable enable hardware-coordinated P-state
    selection (HWP) by default if supported by the processor (Philippe
    Longepe, Srinivas Pandruvada, Rafael Wysocki, Viresh Kumar, Felipe
    Franciosi).

    - Operating Performance Points (OPP) framework updates to improve its
    handling of voltage regulators and device clocks and updates of the
    cpufreq-dt driver on top of that (Viresh Kumar, Jon Hunter).

    - Updates of the powernv cpufreq driver to fix initialization and
    cleanup problems in it and correct its worker thread handling with
    respect to CPU offline, new powernv_throttle tracepoint (Shilpasri
    Bhat).

    - ACPI cpufreq driver optimization and cleanup (Rafael Wysocki).

    - ACPICA updates including one fix for a regression introduced by
    previos changes in the ACPICA code (Bob Moore, Lv Zheng, David Box,
    Colin Ian King).

    - Support for installing ACPI tables from initrd (Lv Zheng).

    - Optimizations of the ACPI CPPC code (Prashanth Prakash, Ashwin
    Chaugule).

    - Support for _HID(ACPI0010) devices (ACPI processor containers) and
    ACPI processor driver cleanups (Sudeep Holla).

    - Support for ACPI-based enumeration of the AMBA bus (Graeme Gregory,
    Aleksey Makarov).

    - Modification of the ACPI PCI IRQ management code to make it treat
    255 in the Interrupt Line register as "not connected" on x86 (as
    per the specification) and avoid attempts to use that value as a
    valid interrupt vector (Chen Fan).

    - ACPI APEI fixes related to resource leaks (Josh Hunt).

    - Removal of modularity from a few ACPI drivers (BGRT, GHES,
    intel_pmic_crc) that cannot be built as modules in practice (Paul
    Gortmaker).

    - PNP framework update to make it treat ACPI_RESOURCE_TYPE_SERIAL_BUS
    as a valid resource type (Harb Abdulhamid).

    - New device ID (future AMD I2C controller) in the ACPI driver for
    AMD SoCs (APD) and in the designware I2C driver (Xiangliang Yu).

    - Assorted ACPI cleanups (Colin Ian King, Kaiyen Chang, Oleg Drokin).

    - cpuidle menu governor optimization to avoid a square root
    computation in it (Rasmus Villemoes).

    - Fix for potential use-after-free in the generic device properties
    framework (Heikki Krogerus).

    - Updates of the generic power domains (genpd) framework including
    support for multiple power states of a domain, fixes and debugfs
    output improvements (Axel Haslam, Jon Hunter, Laurent Pinchart,
    Geert Uytterhoeven).

    - Intel RAPL power capping driver updates to reduce IPI overhead in
    it (Jacob Pan).

    - System suspend/hibernation code cleanups (Eric Biggers, Saurabh
    Sengar).

    - Year 2038 fix for the process freezer (Abhilash Jindal).

    - turbostat utility updates including new features (decoding of more
    registers and CPUID fields, sub-second intervals support, GFX MHz
    and RC6 printout, --out command line option), fixes (syscall jitter
    detection and workaround, reductioin of the number of syscalls
    made, fixes related to Xeon x200 processors, compiler warning
    fixes) and cleanups (Len Brown, Hubert Chrzaniuk, Chen Yu)"

    * tag 'pm+acpi-4.6-rc1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (182 commits)
    tools/power turbostat: bugfix: TDP MSRs print bits fixing
    tools/power turbostat: correct output for MSR_NHM_SNB_PKG_CST_CFG_CTL dump
    tools/power turbostat: call __cpuid() instead of __get_cpuid()
    tools/power turbostat: indicate SMX and SGX support
    tools/power turbostat: detect and work around syscall jitter
    tools/power turbostat: show GFX%rc6
    tools/power turbostat: show GFXMHz
    tools/power turbostat: show IRQs per CPU
    tools/power turbostat: make fewer systems calls
    tools/power turbostat: fix compiler warnings
    tools/power turbostat: add --out option for saving output in a file
    tools/power turbostat: re-name "%Busy" field to "Busy%"
    tools/power turbostat: Intel Xeon x200: fix turbo-ratio decoding
    tools/power turbostat: Intel Xeon x200: fix erroneous bclk value
    tools/power turbostat: allow sub-sec intervals
    ACPI / APEI: ERST: Fixed leaked resources in erst_init
    ACPI / APEI: Fix leaked resources
    intel_pstate: Do not skip samples partially
    intel_pstate: Remove freq calculation from intel_pstate_calc_busy()
    intel_pstate: Move intel_pstate_calc_busy() into get_target_pstate_use_performance()
    ...

    Linus Torvalds
     
  • Merge first patch-bomb from Andrew Morton:

    - some misc things

    - ofs2 updates

    - about half of MM

    - checkpatch updates

    - autofs4 update

    * emailed patches from Andrew Morton : (120 commits)
    autofs4: fix string.h include in auto_dev-ioctl.h
    autofs4: use pr_xxx() macros directly for logging
    autofs4: change log print macros to not insert newline
    autofs4: make autofs log prints consistent
    autofs4: fix some white space errors
    autofs4: fix invalid ioctl return in autofs4_root_ioctl_unlocked()
    autofs4: fix coding style line length in autofs4_wait()
    autofs4: fix coding style problem in autofs4_get_set_timeout()
    autofs4: coding style fixes
    autofs: show pipe inode in mount options
    kallsyms: add support for relative offsets in kallsyms address table
    kallsyms: don't overload absolute symbol type for percpu symbols
    x86: kallsyms: disable absolute percpu symbols on !SMP
    checkpatch: fix another left brace warning
    checkpatch: improve UNSPECIFIED_INT test for bare signed/unsigned uses
    checkpatch: warn on bare unsigned or signed declarations without int
    checkpatch: exclude asm volatile from complex macro check
    mm: memcontrol: drop unnecessary lru locking from mem_cgroup_migrate()
    mm: migrate: consolidate mem_cgroup_migrate() calls
    mm/compaction: speed up pageblock_pfn_to_page() when zone is contiguous
    ...

    Linus Torvalds
     

16 Mar, 2016

5 commits

  • In tracepoints, it's possible to print gfp flags in a human-friendly
    format through a macro show_gfp_flags(), which defines a translation
    array and passes is to __print_flags(). Since the following patch will
    introduce support for gfp flags printing in printk(), it would be nice
    to reuse the array. This is not straightforward, since __print_flags()
    can't simply reference an array defined in a .c file such as mm/debug.c
    - it has to be a macro to allow the macro magic to communicate the
    format to userspace tools such as trace-cmd.

    The solution is to create a macro __def_gfpflag_names which is used both
    in show_gfp_flags(), and to define the gfpflag_names[] array in
    mm/debug.c.

    On the other hand, mm/debug.c also defines translation tables for page
    flags and vma flags, and desire was expressed (but not implemented in
    this series) to use these also from tracepoints. Thus, this patch also
    renames the events/gfpflags.h file to events/mmflags.h and moves the
    table definitions there, using the same macro approach as for gfpflags.
    This allows translating all three kinds of mm-specific flags both in
    tracepoints and printk.

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Michal Hocko
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Ingo Molnar
    Cc: Rasmus Villemoes
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • When updating tracing's show_gfp_flags() I have noticed that perf's
    gfp_compact_table is also outdated. Fill in the missing flags and place
    a note in gfp.h to increase chance that future updates are synced.
    Convert the __GFP_X flags from "GFP_X" to "__GFP_X" strings in line with
    the previous patch.

    Signed-off-by: Vlastimil Babka
    Acked-by: David Rientjes
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Ingo Molnar
    Cc: Rasmus Villemoes
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • SLAB_DEBUG_FREE allows expensive consistency checks at free to be turned
    on or off. Expand its use to be able to turn off all consistency
    checks. This gives a nice speed up if you only want features such as
    poisoning or tracing.

    Credit to Mathias Krause for the original work which inspired this
    series

    Signed-off-by: Laura Abbott
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Kees Cook
    Cc: Mathias Krause
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laura Abbott
     
  • Pull RCU updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Miscellaneous fixes, cleanups, restructuring.

    - RCU torture-test updates"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    rcu: Export rcu_gp_is_normal()
    rcu: Remove rcu_user_hooks_switch
    rcu: Catch up rcu_report_qs_rdp() comment with reality
    rcu: Document unique-name limitation for DEFINE_STATIC_SRCU()
    rcu: Make rcu/tiny_plugin.h explicitly non-modular
    irq: Privatize irq_common_data::state_use_accessors
    RCU: Privatize rcu_node::lock
    sparse: Add __private to privatize members of structs
    rcu: Remove useless rcu_data_p when !PREEMPT_RCU
    rcutorture: Correct no-expedite console messages
    rcu: Set rdp->gpwrap when CPU is idle
    rcu: Stop treating in-kernel CPU-bound workloads as errors
    rcu: Update rcu_report_qs_rsp() comment
    rcu: Assign false instead of 0 for ->core_needs_qs
    rcutorture: Check for self-detected stalls
    rcutorture: Don't keep empty console.log.diags files
    rcutorture: Add checks for rcutorture writer starvation

    Linus Torvalds
     
  • Pull x86 asm updates from Ingo Molnar:
    "This is another big update. Main changes are:

    - lots of x86 system call (and other traps/exceptions) entry code
    enhancements. In particular the complex parts of the 64-bit entry
    code have been migrated to C code as well, and a number of dusty
    corners have been refreshed. (Andy Lutomirski)

    - vDSO special mapping robustification and general cleanups (Andy
    Lutomirski)

    - cpufeature refactoring, cleanups and speedups (Borislav Petkov)

    - lots of other changes ..."

    * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
    x86/cpufeature: Enable new AVX-512 features
    x86/entry/traps: Show unhandled signal for i386 in do_trap()
    x86/entry: Call enter_from_user_mode() with IRQs off
    x86/entry/32: Change INT80 to be an interrupt gate
    x86/entry: Improve system call entry comments
    x86/entry: Remove TIF_SINGLESTEP entry work
    x86/entry/32: Add and check a stack canary for the SYSENTER stack
    x86/entry/32: Simplify and fix up the SYSENTER stack #DB/NMI fixup
    x86/entry: Only allocate space for tss_struct::SYSENTER_stack if needed
    x86/entry: Vastly simplify SYSENTER TF (single-step) handling
    x86/entry/traps: Clear DR6 early in do_debug() and improve the comment
    x86/entry/traps: Clear TIF_BLOCKSTEP on all debug exceptions
    x86/entry/32: Restore FLAGS on SYSEXIT
    x86/entry/32: Filter NT and speed up AC filtering in SYSENTER
    x86/entry/compat: In SYSENTER, sink AC clearing below the existing FLAGS test
    selftests/x86: In syscall_nt, test NT|TF as well
    x86/asm-offsets: Remove PARAVIRT_enabled
    x86/entry/32: Introduce and use X86_BUG_ESPFIX instead of paravirt_enabled
    uprobes: __create_xol_area() must nullify xol_mapping.fault
    x86/cpufeature: Create a new synthetic cpu capability for machine check recovery
    ...

    Linus Torvalds
     

15 Mar, 2016

2 commits

  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Pull perf updates from Ingo Molnar:
    "Main kernel side changes:

    - Big reorganization of the x86 perf support code. The old code grew
    organically deep inside arch/x86/kernel/cpu/perf* and its naming
    became somewhat messy.

    The new location is under arch/x86/events/, using the following
    cleaner hierarchy of source code files:

    perf/x86: Move perf_event.c .................. => x86/events/core.c
    perf/x86: Move perf_event_amd.c .............. => x86/events/amd/core.c
    perf/x86: Move perf_event_amd_ibs.c .......... => x86/events/amd/ibs.c
    perf/x86: Move perf_event_amd_iommu.[ch] ..... => x86/events/amd/iommu.[ch]
    perf/x86: Move perf_event_amd_uncore.c ....... => x86/events/amd/uncore.c
    perf/x86: Move perf_event_intel_bts.c ........ => x86/events/intel/bts.c
    perf/x86: Move perf_event_intel.c ............ => x86/events/intel/core.c
    perf/x86: Move perf_event_intel_cqm.c ........ => x86/events/intel/cqm.c
    perf/x86: Move perf_event_intel_cstate.c ..... => x86/events/intel/cstate.c
    perf/x86: Move perf_event_intel_ds.c ......... => x86/events/intel/ds.c
    perf/x86: Move perf_event_intel_lbr.c ........ => x86/events/intel/lbr.c
    perf/x86: Move perf_event_intel_pt.[ch] ...... => x86/events/intel/pt.[ch]
    perf/x86: Move perf_event_intel_rapl.c ....... => x86/events/intel/rapl.c
    perf/x86: Move perf_event_intel_uncore.[ch] .. => x86/events/intel/uncore.[ch]
    perf/x86: Move perf_event_intel_uncore_nhmex.c => x86/events/intel/uncore_nmhex.c
    perf/x86: Move perf_event_intel_uncore_snb.c => x86/events/intel/uncore_snb.c
    perf/x86: Move perf_event_intel_uncore_snbep.c => x86/events/intel/uncore_snbep.c
    perf/x86: Move perf_event_knc.c .............. => x86/events/intel/knc.c
    perf/x86: Move perf_event_p4.c ............... => x86/events/intel/p4.c
    perf/x86: Move perf_event_p6.c ............... => x86/events/intel/p6.c
    perf/x86: Move perf_event_msr.c .............. => x86/events/msr.c

    (Borislav Petkov)

    - Update various x86 PMU constraint and hw support details (Stephane
    Eranian)

    - Optimize kprobes for BPF execution (Martin KaFai Lau)

    - Rewrite, refactor and fix the Intel uncore PMU driver code (Thomas
    Gleixner)

    - Rewrite, refactor and fix the Intel RAPL PMU code (Thomas Gleixner)

    - Various fixes and smaller cleanups.

    There are lots of perf tooling updates as well. A few highlights:

    perf report/top:

    - Hierarchy histogram mode for 'perf top' and 'perf report',
    showing multiple levels, one per --sort entry: (Namhyung Kim)

    On a mostly idle system:

    # perf top --hierarchy -s comm,dso

    Then expand some levels and use 'P' to take a snapshot:

    # cat perf.hist.0
    - 92.32% perf
    58.20% perf
    22.29% libc-2.22.so
    5.97% [kernel]
    4.18% libelf-0.165.so
    1.69% [unknown]
    - 4.71% qemu-system-x86
    3.10% [kernel]
    1.60% qemu-system-x86_64 (deleted)
    + 2.97% swapper
    #

    - Add 'L' hotkey to dynamicly set the percent threshold for
    histogram entries and callchains, i.e. dynamicly do what the
    --percent-limit command line option to 'top' and 'report' does.
    (Namhyung Kim)

    perf mem:

    - Allow specifying events via -e in 'perf mem record', also listing
    what events can be specified via 'perf mem record -e list' (Jiri
    Olsa)

    perf record:

    - Add 'perf record' --all-user/--all-kernel options, so that one
    can tell that all the events in the command line should be
    restricted to the user or kernel levels (Jiri Olsa), i.e.:

    perf record -e cycles:u,instructions:u

    is equivalent to:

    perf record --all-user -e cycles,instructions

    - Make 'perf record' collect CPU cache info in the perf.data file header:

    $ perf record usleep 1
    [ perf record: Woken up 1 times to write data ]
    [ perf record: Captured and wrote 0.017 MB perf.data (7 samples) ]
    $ perf report --header-only -I | tail -10 | head -8
    # CPU cache info:
    # L1 Data 32K [0-1]
    # L1 Instruction 32K [0-1]
    # L1 Data 32K [2-3]
    # L1 Instruction 32K [2-3]
    # L2 Unified 256K [0-1]
    # L2 Unified 256K [2-3]
    # L3 Unified 4096K [0-3]

    Will be used in 'perf c2c' and eventually in 'perf diff' to
    allow, for instance running the same workload in multiple
    machines and then when using 'diff' show the hardware difference.
    (Jiri Olsa)

    - Improved support for Java, using the JVMTI agent library to do
    jitdumps that then will be inserted in synthesized
    PERF_RECORD_MMAP2 events via 'perf inject' pointed to synthesized
    ELF files stored in ~/.debug and keyed with build-ids, to allow
    symbol resolution and even annotation with source line info, see
    the changeset comments to see how to use it (Stephane Eranian)

    perf script/trace:

    - Decode data_src values (e.g. perf.data files generated by 'perf
    mem record') in 'perf script': (Jiri Olsa)

    # perf script
    perf 693 [1] 4.088652: 1 cpu/mem-loads,ldlat=30/P: ffff88007d0b0f40 68100142 L1 hit|SNP None|TLB L1 or L2 hit|LCK No
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    - Improve support to 'data_src', 'weight' and 'addr' fields in
    'perf script' (Jiri Olsa)

    - Handle empty print fmts in 'perf script -s' i.e. when running
    python or perl scripts (Taeung Song)

    perf stat:

    - 'perf stat' now shows shadow metrics (insn per cycle, etc) in
    interval mode too. E.g:

    # perf stat -I 1000 -e instructions,cycles sleep 1
    # time counts unit events
    1.000215928 519,620 instructions # 0.69 insn per cycle
    1.000215928 752,003 cycles

    - Port 'perf kvm stat' to PowerPC (Hemant Kumar)

    - Implement CSV metrics output in 'perf stat' (Andi Kleen)

    perf BPF support:

    - Support converting data from bpf events in 'perf data' (Wang Nan)

    - Print bpf-output events in 'perf script': (Wang Nan).

    # perf record -e bpf-output/no-inherit,name=evt/ -e ./test_bpf_output_3.c/map:channel.event=evt/ usleep 1000
    # perf script
    usleep 4882 21384.532523: evt: ffffffff810e97d1 sys_nanosleep ([kernel.kallsyms])
    BPF output: 0000: 52 61 69 73 65 20 61 20 Raise a
    0008: 42 50 46 20 65 76 65 6e BPF even
    0010: 74 21 00 00 t!..
    BPF string: "Raise a BPF event!"
    #

    - Add API to set values of map entries in a BPF object, be it
    individual map slots or ranges (Wang Nan)

    - Introduce support for the 'bpf-output' event (Wang Nan)

    - Add glue to read perf events in a BPF program (Wang Nan)

    - Improve support for bpf-output events in 'perf trace' (Wang Nan)

    ... and tons of other changes as well - see the shortlog and git log
    for details!"

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (342 commits)
    perf stat: Add --metric-only support for -A
    perf stat: Implement --metric-only mode
    perf stat: Document CSV format in manpage
    perf hists browser: Check sort keys before hot key actions
    perf hists browser: Allow thread filtering for comm sort key
    perf tools: Add sort__has_comm variable
    perf tools: Recalc total periods using top-level entries in hierarchy
    perf tools: Remove nr_sort_keys field
    perf hists browser: Cleanup hist_browser__fprintf_hierarchy_entry()
    perf tools: Remove hist_entry->fmt field
    perf tools: Fix command line filters in hierarchy mode
    perf tools: Add more sort entry check functions
    perf tools: Fix hist_entry__filter() for hierarchy
    perf jitdump: Build only on supported archs
    tools lib traceevent: Add '~' operation within arg_num_eval()
    perf tools: Omit unnecessary cast in perf_pmu__parse_scale
    perf tools: Pass perf_hpp_list all the way through setup_sort_list
    perf tools: Fix perf script python database export crash
    perf jitdump: DWARF is also needed
    perf bench mem: Prepare the x86-64 build for upstream memcpy_mcsafe() changes
    ...

    Linus Torvalds
     

14 Mar, 2016

1 commit

  • Pull turbostat updates for 4.6 from Len Brown.

    * 'turbostat' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
    tools/power turbostat: bugfix: TDP MSRs print bits fixing
    tools/power turbostat: correct output for MSR_NHM_SNB_PKG_CST_CFG_CTL dump
    tools/power turbostat: call __cpuid() instead of __get_cpuid()
    tools/power turbostat: indicate SMX and SGX support
    tools/power turbostat: detect and work around syscall jitter
    tools/power turbostat: show GFX%rc6
    tools/power turbostat: show GFXMHz
    tools/power turbostat: show IRQs per CPU
    tools/power turbostat: make fewer systems calls
    tools/power turbostat: fix compiler warnings
    tools/power turbostat: add --out option for saving output in a file
    tools/power turbostat: re-name "%Busy" field to "Busy%"
    tools/power turbostat: Intel Xeon x200: fix turbo-ratio decoding
    tools/power turbostat: Intel Xeon x200: fix erroneous bclk value
    tools/power turbostat: allow sub-sec intervals
    tools/power turbostat: Decode MSR_MISC_PWR_MGMT
    tools/power turbostat: decode HWP registers
    x86 msr-index: Simplify syntax for HWP fields
    tools/power turbostat: CPUID(0x16) leaf shows base, max, and bus frequency
    tools/power turbostat: decode more CPUID fields

    Rafael J. Wysocki
     

13 Mar, 2016

15 commits

  • MSR_CONFIG_TDP_NOMINAL:
    should print all 8 bits of base_ratio (bit 0:7) 0xFF

    MSR_CONFIG_TDP_LEVEL_1:
    should print all 15 bits of PKG_MIN_PWR_LVL1 (bit 48:62) 0x7FFF
    should print all 15 bits of PKG_MAX_PWR_LVL1 (bit 32:46) 0x7FFF
    should print all 8 bits of LVL1_RATIO (bit 16:23) 0xFF
    should print all 15 bits of PKG_TDP_LVL1 (bit 0:14) 0x7FFF

    And the same modification to MSR_CONFIG_TDP_LEVEL_2.

    MSR_TURBO_ACTIVATION_RATIO:
    should print all 8 bits of MAX_NON_TURBO_RATIO (bit 0:7) 0xFF

    Signed-off-by: Chen Yu
    Signed-off-by: Len Brown

    Chen Yu
     
  • MSR_NHM_SNB_PKG_CST_CFG_CTL: 0x1e008008 (...pkg-cstate-limit=0: unlimited)
    should print as
    MSR_NHM_SNB_PKG_CST_CFG_CTL: 0x1e008008 (...pkg-cstate-limit=8: unlimited)

    Signed-off-by: Len Brown

    Len Brown
     
  • turbostat already checks whether calling each cpuid leavf is legal,
    and it doesn't look at the function return value,
    so call the simpler gcc intrinsic __cpuid() instead of __get_cpuid().

    syntax only, no functional change

    Signed-off-by: Len Brown

    Len Brown
     
  • SGX presence is related to a SKL power workaround,
    so lets show when that is enabled.

    Signed-off-by: Len Brown

    Len Brown
     
  • The accuracy of Bzy_Mhz and Busy% depend on reading
    the TSC, APERF, and MPERF close together in time.

    When there is a very short measurement interval,
    or a large system is profoundly idle, the changes
    in APERF and MPERF may be very small.
    They can be small enough that an expensive interrupt
    between reading APERF and MPERF can cause the APERF/MPERF
    ratio to become inaccurate, resulting in invalid
    calculation and display of Bzy_MHz.

    A dummy APERF read of APERF makes this problem
    much more rare. Apparently this 1st systemn call
    after exiting a long stretch of idle is when we
    typically see expensive timer interrupts that cause
    large jitter.

    For the cases that dummy APERF read fails to prevent,
    we compare the latency of the APERF and MPERF reads.
    If they differ by more than 2x, we re-issue them.

    Signed-off-by: Len Brown

    Len Brown
     
  • The column "GFX%c6" show the percentage of time the GPU
    is in the "render C6" state, rc6. Deep package C-states on several
    systems depend on the GPU being in RC6.

    This information comes from the counter
    /sys/class/drm/card0/power/rc6_residency_ms,
    as read before and after the measurement interval.

    Signed-off-by: Len Brown

    Len Brown
     
  • Under the column "GFXMHz", show a snapshot of this attribute:
    /sys/class/graphics/fb0/device/drm/card0/gt_cur_freq_mhz

    This is an instantaneous snapshot of what sysfs presents
    at the end of the measurement interval. turbostat does
    not average or otherwise perform any math on this value.

    Signed-off-by: Len Brown

    Len Brown
     
  • The new IRQ column shows how many interrupts have occurred on each CPU
    during the measurement inteval. This information comes from
    the difference between /proc/interrupts shapshots made before
    and after the measurement interval.

    The first row, the system summary, shows the sum of the IRQS
    for all CPUs during that interval.

    Signed-off-by: Len Brown

    Len Brown
     
  • skip the open(2)/close(2) on each msr read
    by keeping the /dev/cpu/*/msr files open.

    The remaining read(2) is generally far fewer cycles
    than the removed open(2) system call.

    Signed-off-by: Len Brown

    Len Brown
     
  • Signed-off-by: Len Brown

    Len Brown
     
  • By default...

    Turbostat --debug gconfiguration info goes to stderr.

    In FORK mode, turbostat statistics go to stderr.

    In PERIODIC mode, turbostat statistics go to stdout.

    These defaults do not change, but an option "--out file"
    will send all output above only to the specified file.

    Signed-off-by: Len Brown

    Len Brown
     
  • some tools processing turbostat output
    have difficulty with items that begin with %...

    Reported-by: Jacob Pan
    Signed-off-by: Len Brown

    Len Brown
     
  • Following changes have been made:
    - changed MSR_NHM_TURBO_RATIO_LIMIT to MSR_TURBO_RATIO_LIMIT in debug print
    for consistency with Developer Manual
    - updated definition of bitfields in MSR_TURBO_RATIO_LIMIT and appropriate
    parsing code
    - added x200 to list of architectures that do not support Nahlem compatible
    definition of MSR_TURBO_RATIO_LIMIT register (x200 has the register but
    bits definition is custom)
    - fixed typo in code that parses MSR_TURBO_RATIO_LIMIT
    (logical instead of bitwise operator)
    - changed MSR_TURBO_RATIO_LIMIT parsing algorithm so the print out had the
    same order as implementations for other platforms

    Signed-off-by: Hubert Chrzaniuk
    Signed-off-by: Len Brown

    Hubert Chrzaniuk
     
  • x200 does not enable any way to programmatically obtain bus clock
    speed. Bclk for the architecture has a fixed value of 100 MHz.
    At the same time x200 cannot be included in has_snb_msrs since
    it does not support C7 idle state.

    prior to this patch, MHz values reported on this chip
    were erroneously calculated using bclk of 133MHz,
    causing MHz values to be reported 33% higher than actual.

    Signed-off-by: Hubert Chrzaniuk
    Signed-off-by: Len Brown

    Chrzaniuk, Hubert
     
  • turbostat -i interval_sec

    will sample and display statistics every interval_sec.
    interval_sec used to be a whole number of seconds,
    but now we accept a decimal, as small as 0.001 sec (1 ms).

    Signed-off-by: Len Brown

    Len Brown
     

11 Mar, 2016

12 commits

  • Add metric only support for -A too. This requires a new print function
    that prints the metrics in the right order.

    v2: Fix manpage
    v3: Simplify nrcpus computation

    Signed-off-by: Andi Kleen
    Acked-by: Jiri Olsa
    Link: http://lkml.kernel.org/r/1457049458-28956-7-git-send-email-andi@firstfloor.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Andi Kleen
     
  • Add a new mode to only print metrics. Sometimes we don't care about the
    raw values, just want the computed metrics. This allows more compact
    printing, so with -I each sample is only a single line. This also
    allows easier plotting and processing with other tools.

    The main target is with using --topdown, but it also works with -T and
    standard perf stat. A few metrics are not supported.

    To avoiding having to hardcode all the metrics in the code it uses a two
    pass approach: first compute dummy metrics and only print the headers in
    the print_metric callback. Then use the callback to print the actual
    values.

    There are some additional changes in the stat printout code to handle
    all metrics being on a single line.

    One issue is that the column code doesn't know in advance what events
    are not supported by the CPU, and it would be hard to find out as this
    could change based on dynamic conditions. That causes empty columns in
    some cases.

    The output can be fairly wide, often you may need more than 80 columns.

    Example:

    % perf stat -a -I 1000 --metric-only
    1.001452803 frontend cycles idle insn per cycle stalled cycles per insn branch-misses of all branches
    1.001452803 158.91% 0.66 2.39 2.92%
    2.002192321 180.63% 0.76 2.08 2.96%
    3.003088282 150.59% 0.62 2.57 2.84%
    4.004369835 196.20% 0.98 1.62 3.79%
    5.005227314 231.98% 0.84 1.90 4.71%

    v2: Lots of updates.
    v3: Use slightly narrower columns
    v4: Add comment

    Signed-off-by: Andi Kleen
    Acked-by: Jiri Olsa
    Link: http://lkml.kernel.org/r/1457049458-28956-6-git-send-email-andi@firstfloor.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Andi Kleen
     
  • With all the recently added fields in the perf stat CSV output we should
    finally document them in the man page. Do this here.

    v2: Fix fields in documentation (Jiri)
    v3: fix order of fields again (Jiri)
    v4: Change order again.
    v5: Document more fields (Jiri)
    v6: Move time stamp first
    v7: More fixes (Jiri)

    Signed-off-by: Andi Kleen
    Acked-by: Jiri Olsa
    Link: http://lkml.kernel.org/r/1457049458-28956-5-git-send-email-andi@firstfloor.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Andi Kleen
     
  • The context menu in TUI hists browser checks corresponding sort keys
    when creating the menu item. But hotkey actions lacks these checks so
    it can filter using incorrect info.

    For example, default sort key of 'perf top' doesn't contain 'comm' or
    'pid' sort key so each hist entry's thread info is not reliable. Thus
    it should prohibit using thread filter on 't' key.

    Signed-off-by: Namhyung Kim
    Cc: David Ahern
    Cc: Jiri Olsa
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1457533253-21419-3-git-send-email-namhyung@kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Namhyung Kim
     
  • The commit 2eafd410e669 ("perf hists browser: Only 'Zoom into thread'
    only when sort order has 'pid'") disabled thread filtering in hist
    browser for the default sort key. However the he->thread is still valid
    even if 'pid' sort key is not given. Only thing it should not use is
    the pid (or tid) of the thread. So allow to filter by thread when
    'comm' sort key is given and show pid only if 'pid' sort key is given.

    Signed-off-by: Namhyung Kim
    Cc: David Ahern
    Cc: Jiri Olsa
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1457536490-24084-1-git-send-email-namhyung@kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Namhyung Kim
     
  • The sort__has_comm variable is to check whether the comm sort key is
    given. This is necessary to support thread filtering in the TUI hists
    browser later.

    Signed-off-by: Namhyung Kim
    Cc: David Ahern
    Cc: Jiri Olsa
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1457533253-21419-1-git-send-email-namhyung@kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Namhyung Kim
     
  • When hierarchy mode is enabled, each entry in a hierarchy level shares
    the period. IOW an upper level entry's period is the sum of lower level
    entries. Thus perf uses only one of them to calculate the total period
    of hists. It was lowest-level (leaf) entries but it has a problem when
    it comes to filters.

    If a filter is applied, entries in the same level will be filtered or
    not. But upper level entries still have period of their sum including
    filtered one. So total sum of upper level entries will not be same as
    sum of lower level entries.

    This resulted in entries having more than 100% of overhead and it can be
    produced using perf top with filter(s).

    Reported-and-Tested-by: Jiri Olsa
    Signed-off-by: Namhyung Kim
    Cc: Andi Kleen
    Cc: David Ahern
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Wang Nan
    Link: http://lkml.kernel.org/r/1457531222-18130-8-git-send-email-namhyung@kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Namhyung Kim
     
  • The nr_sort_keys field is to carry the number of sort entries in a
    hpp_list or hists to determine the depth of indentation of a hist entry.
    As it's only used in hierarchy mode and now we have used nr_hpp_node for
    this reason, there's no need to keep it anymore. Let's get rid of it.

    Signed-off-by: Namhyung Kim
    Tested-by: Jiri Olsa
    Cc: Andi Kleen
    Cc: David Ahern
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Wang Nan
    Link: http://lkml.kernel.org/r/1457531222-18130-7-git-send-email-namhyung@kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Namhyung Kim
     
  • The hist_browser__fprintf_hierarchy_entry() if to dump current output
    into a file so it needs to be sync-ed with the corresponding function
    hist_browser__show_hierarchy_entry(). So use hists->nr_hpp_node to
    indent width and use first fmt_node to print overhead columns instead of
    checking whether it's a sort entry (or dynamic entry).

    Signed-off-by: Namhyung Kim
    Tested-by: Jiri Olsa
    Cc: Andi Kleen
    Cc: David Ahern
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Wang Nan
    Link: http://lkml.kernel.org/r/1457531222-18130-6-git-send-email-namhyung@kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Namhyung Kim
     
  • It's not used anymore and the output format is accessed by the hpp_list
    pointer instead when hierarchy is enabled. Let's get rid of it.

    Signed-off-by: Namhyung Kim
    Tested-by: Jiri Olsa
    Cc: Andi Kleen
    Cc: David Ahern
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Wang Nan
    Link: http://lkml.kernel.org/r/1457531222-18130-5-git-send-email-namhyung@kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Namhyung Kim
     
  • When a command-line filter is applied in hierarchy mode, output is
    broken especially when filtering on lower level. The higher level
    entries doesn't show up so it's hard to see the results.

    Also it needs to handle multi sort keys in a single hierarchy level.

    Before:

    $ perf report --hierarchy -s 'cpu,{dso,comm}' --comms swapper --stdio
    ...
    # Overhead CPU / Shared Object+Command
    # ........... ...........................
    #
    13.79% [kernel.vmlinux] swapper
    31.71% 000
    13.80% [kernel.vmlinux] swapper
    0.43% [e1000e] swapper
    11.89% [kernel.vmlinux] swapper
    9.18% [kernel.vmlinux] swapper

    After:

    # Overhead CPU / Shared Object+Command
    # ........... ...............................
    #
    33.09% 003
    13.79% [kernel.vmlinux] swapper
    31.71% 000
    13.80% [kernel.vmlinux] swapper
    0.43% [e1000e] swapper
    21.90% 002
    11.89% [kernel.vmlinux] swapper
    13.30% 001
    9.18% [kernel.vmlinux] swapper

    Signed-off-by: Namhyung Kim
    Tested-by: Arnaldo Carvalho de Melo
    Tested-by: Jiri Olsa
    Cc: Andi Kleen
    Cc: David Ahern
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Wang Nan
    Link: http://lkml.kernel.org/r/1457531222-18130-4-git-send-email-namhyung@kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Namhyung Kim
     
  • Those functions are for checkinf if a given perf_hpp_fmt is a
    filter-related sort entry. With hierarchy mode, it needs to check
    filters on the hist entries with its own hpp format list.

    Signed-off-by: Namhyung Kim
    Tested-by: Jiri Olsa
    Cc: Andi Kleen
    Cc: David Ahern
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Wang Nan
    Link: http://lkml.kernel.org/r/1457531222-18130-3-git-send-email-namhyung@kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Namhyung Kim