14 Dec, 2020

6 commits


03 Nov, 2020

1 commit

  • Ian reports an issue that the metric DRAM_BW_Use often remains 0.

    The metric expression for DRAM_BW_Use on CLX/SKX:

    "( 64 * ( uncore_imc@cas_count_read@ + uncore_imc@cas_count_write@ ) / 1000000000 ) / duration_time"

    The counts of uncore_imc/cas_count_read/ and uncore_imc/cas_count_write/
    are scaled up by 64, that is to turn a count of cache lines into bytes,
    the count is then divided by 1000000000 to give GB.

    However, the counts of uncore_imc/cas_count_read/ and
    uncore_imc/cas_count_write/ have been scaled yet.

    The scale values are from sysfs, such as
    /sys/devices/uncore_imc_0/events/cas_count_read.scale.
    It's 6.103515625e-5 (64 / 1024.0 / 1024.0).

    So if we use original metric expression, the result is not correct.

    But the difficulty is, for SKL client, the counts are not scaled.

    The metric expression for DRAM_BW_Use on SKL:

    "64 * ( arb@event\\=0x81\\,umask\\=0x1@ + arb@event\\=0x84\\,umask\\=0x1@ ) / 1000000 / duration_time / 1000"

    root@kbl-ppc:~# perf stat -M DRAM_BW_Use -a -- sleep 1

    Performance counter stats for 'system wide':

    190 arb/event=0x84,umask=0x1/ # 1.86 DRAM_BW_Use
    29,093,178 arb/event=0x81,umask=0x1/
    1,000,703,287 ns duration_time

    1.000703287 seconds time elapsed

    The result is expected.

    So the easy way is just change the metric expression for CLX/SKX.
    This patch changes the metric expression to:

    "( ( ( uncore_imc@cas_count_read@ + uncore_imc@cas_count_write@ ) * 1048576 ) / 1000000000 ) / duration_time"

    1048576 = 1024 * 1024.

    Before (tested on CLX):

    root@lkp-csl-2sp5 ~# perf stat -M DRAM_BW_Use -a -- sleep 1

    Performance counter stats for 'system wide':

    765.35 MiB uncore_imc/cas_count_read/ # 0.00 DRAM_BW_Use
    5.42 MiB uncore_imc/cas_count_write/
    1001515088 ns duration_time

    1.001515088 seconds time elapsed

    After:

    root@lkp-csl-2sp5 ~# perf stat -M DRAM_BW_Use -a -- sleep 1

    Performance counter stats for 'system wide':

    767.95 MiB uncore_imc/cas_count_read/ # 0.80 DRAM_BW_Use
    5.02 MiB uncore_imc/cas_count_write/
    1001900010 ns duration_time

    1.001900010 seconds time elapsed

    Fixes: 038d3b53c284 ("perf vendor events intel: Update CascadelakeX events to v1.08")
    Fixes: b5ff7f2799a4 ("perf vendor events: Update SkylakeX events to v1.21")
    Signed-off-by: Jin Yao
    Acked-by: Ian Rogers
    Cc: Alexander Shishkin
    Cc: Andi Kleen
    Cc: Jiri Olsa
    Cc: Kan Liang
    Cc: Peter Zijlstra
    Link: http://lore.kernel.org/lkml/20201023005334.7869-1-yao.jin@linux.intel.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Jin Yao
     

15 Oct, 2020

1 commit

  • The event code for events referencing std arch events is incorrectly
    evaluated in json_events().

    The issue is that je.event is evaluated properly from try_fixup(), but
    later NULLified from the real_event() call, as "event" may be NULL.

    Fix by setting "event" same je.event in try_fixup().

    Also remove support for overwriting event code for events using std arch
    events, as it is not used.

    Signed-off-by: John Garry
    Reviewed-By: Kajol Jain
    Acked-by: Jiri Olsa
    Link: https://lore.kernel.org/r/1602170368-11892-1-git-send-email-john.garry@huawei.com
    Signed-off-by: Arnaldo Carvalho de Melo

    John Garry
     

13 Oct, 2020

1 commit

  • This replaces the incorrectly spelled word "localtion" with "location"
    in some power8 PMU event descriptions.

    Fixes: 2a81fa3bb5ed ("perf vendor events: Add power8 PMU events")
    Signed-off-by: Sandipan Das
    Reviewed-by: Kajol Jain
    Cc: Jiri Olsa
    Cc: Madhavan Srinivasan
    Cc: Michael Ellerman
    Cc: Ravi Bangoria
    Cc: Sukadev Bhattiprolu
    Link: http://lore.kernel.org/lkml/20201012050205.328523-1-sandipan@linux.ibm.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Sandipan Das
     

28 Sep, 2020

2 commits

  • - Update SkylakeX events to v1.21.
    - Update SkylakeX JSON metrics from TMAM 4.0.

    Other fixes:

    - Add NO_NMI_WATCHDOG metric constraint to Backend_Bound
    - Fix misspelled error

    Signed-off-by: Jin Yao
    Reviewed-by: Andi Kleen
    Acked-by: Ian Rogers
    Cc: Alexander Shishkin
    Cc: Ingo Molnar
    Cc: Jiri Olsa
    Cc: Kan Liang
    Cc: Peter Zijlstra
    Link: https://lore.kernel.org/lkml/20200922031918.3723-1-yao.jin@linux.intel.com/
    Signed-off-by: Arnaldo Carvalho de Melo

    Jin Yao
     
  • - Update CascadelakeX events to v1.08.
    - Update CascadelakeX JSON metrics from TMAM 4.0.

    Other fixes:

    - Add NO_NMI_WATCHDOG metric constraint to Backend_Bound
    - Change 'MB/sec' to 'MB' in UNC_M_PMM_BANDWIDTH.

    Signed-off-by: Jin Yao
    Reviewed-by: Andi Kleen
    Acked-by: Ian Rogers
    Cc: Jiri Olsa
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Alexander Shishkin
    Cc: Kan Liang
    Link: https://lore.kernel.org/lkml/20200922031918.3723-1-yao.jin@linux.intel.com/
    Signed-off-by: Arnaldo Carvalho de Melo

    Jin Yao
     

18 Sep, 2020

1 commit


15 Sep, 2020

1 commit

  • The amdzen2/core.json and amdzen/core.json vendor events files have the
    occasional trailing comma. Since that goes against the JSON standard,
    lets remove it.

    Signed-off-by: Henry Burns
    Acked-by: Kim Phillips
    Acked-by: Namhyung Kim
    Cc: Alexander Shishkin
    Cc: Andi Kleen
    Cc: Jiri Olsa
    Cc: Mark Rutland
    Cc: Peter Zijlstra
    Cc: Vijay Thakkar
    Link: http://lore.kernel.org/lkml/20200915004125.971-1-henrywolfeburns@gmail.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Henry Burns
     

10 Sep, 2020

4 commits

  • This patch adds hv_24x7 core level events in nest_metric.json file and
    also add PerChip/PerCore field in metric events.

    Result:

    power9 platform:

    command:# ./perf stat --metric-only -M PowerBUS_Frequency -C 0 -I 1000
    1.000070601 1.9 2.0
    2.000253881 2.0 1.9
    3.000364810 2.0 2.0

    Signed-off-by: Kajol Jain
    Acked-by: Ian Rogers
    Acked-by: Jiri Olsa
    Cc: Alexander Shishkin
    Cc: Andi Kleen
    Cc: Jin Yao
    Cc: John Garry
    Cc: Madhavan Srinivasan
    Cc: Mark Rutland
    Cc: Namhyung Kim
    Cc: Paul Clarke
    Cc: Peter Zijlstra
    Cc: Ravi Bangoria
    Link: http://lore.kernel.org/lkml/20200907064133.75090-6-kjain@linux.ibm.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Kajol Jain
     
  • Initially, every time we want to add new terms like chip, core thread etc,
    we need to create corrsponding fields in pmu_events and event struct.

    This patch adds an enum called 'aggr_mode_class' which store all these
    aggregation like perchip/percore. It also adds new field 'aggr_mode'
    to capture these terms.

    Now, if user wants to add any new term, they just need to add it in
    the enum defined.

    Signed-off-by: Kajol Jain
    Acked-by: Jiri Olsa
    Cc: Alexander Shishkin
    Cc: Andi Kleen
    Cc: Ian Rogers
    Cc: Jin Yao
    Cc: John Garry
    Cc: Madhavan Srinivasan
    Cc: Mark Rutland
    Cc: Namhyung Kim
    Cc: Paul Clarke
    Cc: Peter Zijlstra
    Cc: Ravi Bangoria
    Link: http://lore.kernel.org/lkml/20200907064133.75090-4-kjain@linux.ibm.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Kajol Jain
     
  • This patch adds new structure called 'json_event' inside jevents.c
    file to improve the callback prototype inside jevent files.

    Initially, whenever user want to add new field, they need to update
    in all function callback which make it more and more complex with
    increased number of parmeters.

    With this change, we just need to add it in new structure 'json_event'.

    Signed-off-by: Kajol Jain
    Reviewed-by: Andi Kleen
    Reviewed-by: John Garry
    Acked-by: Jiri Olsa
    Cc: Alexander Shishkin
    Cc: Ian Rogers
    Cc: Jin Yao
    Cc: Madhavan Srinivasan
    Cc: Mark Rutland
    Cc: Namhyung Kim
    Cc: Paul Clarke
    Cc: Peter Zijlstra
    Cc: Ravi Bangoria
    Link: http://lore.kernel.org/lkml/20200907064133.75090-3-kjain@linux.ibm.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Kajol Jain
     
  • This patch removes jevents.h and makes json_events function static.

    Signed-off-by: Kajol Jain
    Reviewed-by: John Garry
    Acked-by: Jiri Olsa
    Cc: Alexander Shishkin
    Cc: Andi Kleen
    Cc: Ian Rogers
    Cc: Jin Yao
    Cc: Madhavan Srinivasan
    Cc: Mark Rutland
    Cc: Namhyung Kim
    Cc: Paul Clarke
    Cc: Peter Zijlstra
    Cc: Ravi Bangoria
    Link: http://lore.kernel.org/lkml/20200907064133.75090-2-kjain@linux.ibm.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Kajol Jain
     

05 Sep, 2020

4 commits

  • This enables zen3 users by reusing mostly-compatible zen2 events
    until the official public list of zen3 events is published in a
    future PPR.

    Signed-off-by: Kim Phillips
    Acked-by: Ian Rogers
    Cc: Alexander Shishkin
    Cc: Andi Kleen
    Cc: Borislav Petkov
    Cc: Jin Yao
    Cc: Jiri Olsa
    Cc: John Garry
    Cc: Jon Grimm
    Cc: Kan Liang
    Cc: Mark Rutland
    Cc: Martin Jambor
    Cc: Martin Liška
    Cc: Michael Petlan
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Vijay Thakkar
    Cc: William Cohen
    Cc: Yunfeng Ye
    Link: http://lore.kernel.org/lkml/20200901220944.277505-4-kim.phillips@amd.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Kim Phillips
     
  • Add support for events listed in Section 2.1.15.2 "Performance
    Measurement" of "PPR for AMD Family 17h Model 31h B0 - 55803
    Rev 0.54 - Sep 12, 2019".

    perf now supports these new events (-e):

    all_dc_accesses
    all_tlbs_flushed
    l1_dtlb_misses
    l2_cache_accesses_from_dc_misses
    l2_cache_accesses_from_ic_misses
    l2_cache_hits_from_dc_misses
    l2_cache_hits_from_ic_misses
    l2_cache_misses_from_dc_misses
    l2_cache_misses_from_ic_miss
    l2_dtlb_misses
    l2_itlb_misses
    sse_avx_stalls
    uops_dispatched
    uops_retired
    l3_accesses
    l3_misses

    and these metrics (-M):

    branch_misprediction_ratio
    all_l2_cache_accesses
    all_l2_cache_hits
    all_l2_cache_misses
    ic_fetch_miss_ratio
    l2_cache_accesses_from_l2_hwpf
    l2_cache_hits_from_l2_hwpf
    l2_cache_misses_from_l2_hwpf
    l3_read_miss_latency
    l1_itlb_misses
    all_remote_links_outbound
    nps1_die_to_dram

    The nps1_die_to_dram event may need perf stat's --metric-no-group
    switch if the number of available data fabric counters is less
    than the number it uses (8).

    Committer testing:

    On a AMD Ryzen 3900x system:

    Before:

    # perf list all_dc_accesses all_tlbs_flushed l1_dtlb_misses l2_cache_accesses_from_dc_misses l2_cache_accesses_from_ic_misses l2_cache_hits_from_dc_misses l2_cache_hits_from_ic_misses l2_cache_misses_from_dc_misses l2_cache_misses_from_ic_miss l2_dtlb_misses l2_itlb_misses sse_avx_stalls uops_dispatched uops_retired l3_accesses l3_misses | grep -v "^Metric Groups:$" | grep -v "^$"
    #

    After:

    # perf list all_dc_accesses all_tlbs_flushed l1_dtlb_misses l2_cache_accesses_from_dc_misses l2_cache_accesses_from_ic_misses l2_cache_hits_from_dc_misses l2_cache_hits_from_ic_misses l2_cache_misses_from_dc_misses l2_cache_misses_from_ic_miss l2_dtlb_misses l2_itlb_misses sse_avx_stalls uops_dispatched uops_retired l3_accesses l3_misses | grep -v "^Metric Groups:$" | grep -v "^$" | grep -v "^recommended:$"
    all_dc_accesses
    [All L1 Data Cache Accesses]
    all_tlbs_flushed
    [All TLBs Flushed]
    l1_dtlb_misses
    [L1 DTLB Misses]
    l2_cache_accesses_from_dc_misses
    [L2 Cache Accesses from L1 Data Cache Misses (including prefetch)]
    l2_cache_accesses_from_ic_misses
    [L2 Cache Accesses from L1 Instruction Cache Misses (including
    prefetch)]
    l2_cache_hits_from_dc_misses
    [L2 Cache Hits from L1 Data Cache Misses]
    l2_cache_hits_from_ic_misses
    [L2 Cache Hits from L1 Instruction Cache Misses]
    l2_cache_misses_from_dc_misses
    [L2 Cache Misses from L1 Data Cache Misses]
    l2_cache_misses_from_ic_miss
    [L2 Cache Misses from L1 Instruction Cache Misses]
    l2_dtlb_misses
    [L2 DTLB Misses & Data page walks]
    l2_itlb_misses
    [L2 ITLB Misses & Instruction page walks]
    sse_avx_stalls
    [Mixed SSE/AVX Stalls]
    uops_dispatched
    [Micro-ops Dispatched]
    uops_retired
    [Micro-ops Retired]
    l3_accesses
    [L3 Accesses. Unit: amd_l3]
    l3_misses
    [L3 Misses (includes Chg2X). Unit: amd_l3]
    #

    # perf stat -a -e all_dc_accesses,all_tlbs_flushed,l1_dtlb_misses,l2_cache_accesses_from_dc_misses,l2_cache_accesses_from_ic_misses,l2_cache_hits_from_dc_misses,l2_cache_hits_from_ic_misses,l2_cache_misses_from_dc_misses,l2_cache_misses_from_ic_miss,l2_dtlb_misses,l2_itlb_misses,sse_avx_stalls,uops_dispatched,uops_retired,l3_accesses,l3_misses sleep 2

    Performance counter stats for 'system wide':

    433,439,949 all_dc_accesses (35.66%)
    443 all_tlbs_flushed (35.66%)
    2,985,885 l1_dtlb_misses (35.66%)
    18,318,019 l2_cache_accesses_from_dc_misses (35.68%)
    50,114,810 l2_cache_accesses_from_ic_misses (35.72%)
    12,423,978 l2_cache_hits_from_dc_misses (35.74%)
    40,703,103 l2_cache_hits_from_ic_misses (35.74%)
    6,698,673 l2_cache_misses_from_dc_misses (35.74%)
    12,090,892 l2_cache_misses_from_ic_miss (35.74%)
    614,267 l2_dtlb_misses (35.74%)
    216,036 l2_itlb_misses (35.74%)
    11,977 sse_avx_stalls (35.74%)
    999,276,223 uops_dispatched (35.73%)
    1,075,311,620 uops_retired (35.69%)
    1,420,763 l3_accesses
    540,164 l3_misses

    2.002344121 seconds time elapsed

    # perf stat -a -e all_dc_accesses,all_tlbs_flushed,l1_dtlb_misses,l2_cache_accesses_from_dc_misses,l2_cache_accesses_from_ic_misses sleep 2

    Performance counter stats for 'system wide':

    175,943,104 all_dc_accesses
    310 all_tlbs_flushed
    2,280,359 l1_dtlb_misses
    11,700,151 l2_cache_accesses_from_dc_misses
    25,414,963 l2_cache_accesses_from_ic_misses

    2.001957818 seconds time elapsed

    #

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
    Signed-off-by: Kim Phillips
    Acked-by: Ian Rogers
    Tested-by: Arnaldo Carvalho de Melo
    Cc: Alexander Shishkin
    Cc: Andi Kleen
    Cc: Borislav Petkov
    Cc: Jin Yao
    Cc: Jiri Olsa
    Cc: John Garry
    Cc: Jon Grimm
    Cc: Kan Liang
    Cc: Mark Rutland
    Cc: Martin Jambor
    Cc: Martin Liška
    Cc: Michael Petlan
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Vijay Thakkar
    Cc: William Cohen
    Cc: Yunfeng Ye
    Link: http://lore.kernel.org/lkml/20200901220944.277505-3-kim.phillips@amd.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Kim Phillips
     
  • The ITLB Instruction Fetch Hits event isn't documented even in later
    zen1 PPRs, but it seems to count correctly on zen1 hardware.

    Add it to zen1 group so zen1 users can use the upcoming IC Fetch Miss
    Ratio Metric.

    The IF1G, 1IF2M, IF4K (Instruction fetches to a 1 GB, 2 MB, and 4K page)
    unit masks are not added because unlike zen2 hardware, zen1 hardware
    counts all its unit masks with a 0 unit mask according to the old
    convention:

    zen1$ perf stat -e cpu/event=0x94/,cpu/event=0x94,umask=0xff/ sleep 1

    Performance counter stats for 'sleep 1':

    211,318 cpu/event=0x94/u
    211,318 cpu/event=0x94,umask=0xff/u

    Rome/zen2:

    zen2$ perf stat -e cpu/event=0x94/,cpu/event=0x94,umask=0xff/ sleep 1

    Performance counter stats for 'sleep 1':

    0 cpu/event=0x94/u
    190,744 cpu/event=0x94,umask=0xff/u

    Signed-off-by: Kim Phillips
    Acked-by: Ian Rogers
    Tested-by: Arnaldo Carvalho de Melo # on Zen2 only (3900x)
    Cc: Alexander Shishkin
    Cc: Andi Kleen
    Cc: Borislav Petkov
    Cc: Jin Yao
    Cc: Jiri Olsa
    Cc: John Garry
    Cc: Jon Grimm
    Cc: Kan Liang
    Cc: Mark Rutland
    Cc: Martin Jambor
    Cc: Martin Liška
    Cc: Michael Petlan
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Vijay Thakkar
    Cc: William Cohen
    Cc: Yunfeng Ye
    Link: http://lore.kernel.org/lkml/20200901220944.277505-2-kim.phillips@amd.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Kim Phillips
     
  • Later revisions of PPRs that post-date the original Family 17h events
    submission patch add these events.

    Specifically, they were not in this 2017 revision of the F17h PPR:

    Processor Programming Reference (PPR) for AMD Family 17h Model 01h, Revision B1 Processors Rev 1.14 - April 15, 2017

    But e.g., are included in this 2019 version of the PPR:

    Processor Programming Reference (PPR) for AMD Family 17h Model 18h, Revision B1 Processors Rev. 3.14 - Sep 26, 2019

    Fixes: 98c07a8f74f8 ("perf vendor events amd: perf PMU events for AMD Family 17h")
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
    Signed-off-by: Kim Phillips
    Reviewed-by: Ian Rogers
    Cc: Alexander Shishkin
    Cc: Andi Kleen
    Cc: Borislav Petkov
    Cc: Jin Yao
    Cc: Jiri Olsa
    Cc: John Garry
    Cc: Jon Grimm
    Cc: Kan Liang
    Cc: Mark Rutland
    Cc: Martin Jambor
    Cc: Martin Liška
    Cc: Michael Petlan
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: stable@vger.kernel.org
    Cc: Stephane Eranian
    Cc: Vijay Thakkar
    Cc: William Cohen
    Cc: Yunfeng Ye
    Link: http://lore.kernel.org/lkml/20200901220944.277505-1-kim.phillips@amd.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Kim Phillips
     

04 Sep, 2020

1 commit

  • The new string should have enough space for the original string and the
    back slashes IMHO.

    Fixes: fbc2844e84038ce3 ("perf vendor events: Use more flexible pattern matching for CPU identification for mapfile.csv")
    Signed-off-by: Namhyung Kim
    Reviewed-by: Ian Rogers
    Cc: Alexander Shishkin
    Cc: Andi Kleen
    Cc: Jiri Olsa
    Cc: John Garry
    Cc: Kajol Jain
    Cc: Mark Rutland
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: William Cohen
    Link: http://lore.kernel.org/lkml/20200903152510.489233-1-namhyung@kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Namhyung Kim
     

14 Aug, 2020

1 commit

  • These changes take advantage of the new capability added in merge commit
    00e4db51259a5f936fec1424b884f029479d3981 "Allow using computed metrics
    in calculating other metrics".

    The net is a simplification of the expressions for a handful of metrics,
    but no functional change.

    Signed-off-by: Paul Clarke
    Reviewed-by: Kajol Jain
    Acked-by: Ian Rogers
    Cc: Jiri Olsa
    Cc: Madhavan Srinivasan
    Link: http://lore.kernel.org/lkml/20200813222155.268183-1-pc@us.ibm.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Paul A. Clarke
     

03 Aug, 2020

1 commit


21 Jul, 2020

1 commit

  • Change the counter name DLFT_CCERROR to DLFT_CCFINISH on IBM z15.
    This counter counts completed DEFLATE instructions with exit code
    0, 1 or 2. Since exit code 0 means success and exit code 1 or 2
    indicate errors, change the counter name to avoid confusion.
    This counter is incremented each time the DEFLATE instruction
    completed regardless if an error was detected or not.

    Fixes: d68d5d51dc89 ("s390/cpum_cf: Add new extended counters for IBM z15")
    Fixes: e7950166e402 ("perf vendor events s390: Add new deflate counters for IBM z15")
    Cc: stable@vger.kernel.org # v5.7
    Signed-off-by: Thomas Richter
    Reviewed-by: Sumanth Korikkar
    Signed-off-by: Heiko Carstens

    Thomas Richter
     

06 Jul, 2020

1 commit

  • Added nest imc metric events.

    Signed-off-by: Kajol Jain
    Acked-by: Ian Rogers
    Cc: Alexander Shishkin
    Cc: Andi Kleen
    Cc: Anju T Sudhakar
    Cc: Jin Yao
    Cc: Jiri Olsa
    Cc: Kan Liang
    Cc: Mark Rutland
    Cc: Nageswara R Sastry
    Cc: Namhyung Kim
    Cc: Paul Clarke
    Cc: Peter Zijlstra
    Cc: Ravi Bangoria
    Cc: maddy@linux.ibm.com
    Link: http://lore.kernel.org/lkml/20200703065658.377467-1-kjain@linux.ibm.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Kajol Jain
     

30 May, 2020

1 commit

  • This header is part of the jsmn JSON parser, introduced in 867a979a83.
    Correct the SPDX tag to indicate that it is under the MIT license.

    Signed-off-by: Ed Maste
    Acked-by: Andi Kleen
    Cc: Greg Kroah-Hartman
    Link: http://lore.kernel.org/lkml/20200528170858.48457-1-emaste@freefall.freebsd.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Ed Maste
     

28 May, 2020

6 commits

  • Uses of "ICT" and "Ict" are expanded to "Instruction Completion Table".

    Signed-off-by: Paul Clarke
    Cc: Ananth N Mavinakayanahalli
    Cc: Ian Rogers
    Cc: Madhavan Srinivasan
    Cc: Michael Ellerman
    Cc: Naveen N. Rao
    Cc: Sukadev Bhattiprolu
    Link: http://lore.kernel.org/lkml/1589915886-22992-1-git-send-email-pc@us.ibm.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Paul A. Clarke
     
  • Add the following metrics to the POWER9 'cpi_breakdown' metricgroup:

    - ict_noslot_br_mpred_cpi
    - ict_noslot_br_mpred_icmiss_cpi
    - ict_noslot_cyc_other_cpi
    - ict_noslot_disp_held_cpi
    - ict_noslot_disp_held_hb_full_cpi
    - ict_noslot_disp_held_issq_cpi
    - ict_noslot_disp_held_other_cpi
    - ict_noslot_disp_held_sync_cpi
    - ict_noslot_disp_held_tbegin_cpi
    - ict_noslot_ic_l2_cpi
    - ict_noslot_ic_l3_cpi
    - ict_noslot_ic_l3miss_cpi
    - ict_noslot_ic_miss_cpi

    Signed-off-by: Paul Clarke
    Cc: Ananth N Mavinakayanahalli
    Reviewed-by: Kajol Jain
    Tested-by: Ian Rogers
    Cc: Madhavan Srinivasan
    Cc: Michael Ellerman
    Cc: Naveen N. Rao
    Cc: Sukadev Bhattiprolu
    Link: http://lore.kernel.org/lkml/1588868938-21933-3-git-send-email-pc@us.ibm.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Paul A. Clarke
     
  • Mismatched parentheses.

    Fixes: 7f3cf5ac7743 (perf vendor events power9: Cpi_breakdown & estimated_dcache_miss_cpi metrics)
    Signed-off-by: Ian Rogers
    Reviewed-by: Paul Clarke
    Acked-by: Jiri Olsa
    Cc: Adrian Hunter
    Cc: Alexander Shishkin
    Cc: Andi Kleen
    Cc: Haiyan Song
    Cc: Jin Yao
    Cc: John Garry
    Cc: Kajol Jain
    Cc: Kan Liang
    Cc: Leo Yan
    Cc: Mark Rutland
    Cc: Namhyung Kim
    Cc: Paul Clarke
    Cc: Peter Zijlstra
    Cc: Ravi Bangoria
    Cc: Song Liu
    Cc: Stephane Eranian
    Link: http://lore.kernel.org/lkml/20200501173333.227162-10-irogers@google.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Ian Rogers
     
  • Mismatched parentheses.

    Fixes: dd81eafacc52 (perf vendor events power8: Cpi_breakdown & estimated_dcache_miss_cpi metrics)
    Signed-off-by: Ian Rogers
    Reviewed-by: Paul Clarke
    Acked-by: Jiri Olsa
    Cc: Adrian Hunter
    Cc: Alexander Shishkin
    Cc: Andi Kleen
    Cc: Haiyan Song
    Cc: Jin Yao
    Cc: John Garry
    Cc: Kajol Jain
    Cc: Kan Liang
    Cc: Leo Yan
    Cc: Mark Rutland
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Ravi Bangoria
    Cc: Song Liu
    Cc: Stephane Eranian
    Link: http://lore.kernel.org/lkml/20200501173333.227162-9-irogers@google.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Ian Rogers
     
  • Remove over escaping with \\.

    Fixes: fd5500989c8f (perf vendor events intel: Update metrics from TMAM 3.5)
    Signed-off-by: Ian Rogers
    Acked-by: Jiri Olsa
    Cc: Adrian Hunter
    Cc: Alexander Shishkin
    Cc: Andi Kleen
    Cc: Haiyan Song
    Cc: Jin Yao
    Cc: John Garry
    Cc: Kajol Jain
    Cc: Kan Liang
    Cc: Leo Yan
    Cc: Mark Rutland
    Cc: Namhyung Kim
    Cc: Paul Clarke
    Cc: Peter Zijlstra
    Cc: Ravi Bangoria
    Cc: Song Liu
    Cc: Stephane Eranian
    Link: http://lore.kernel.org/lkml/20200501173333.227162-4-irogers@google.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Ian Rogers
     
  • Remove over escaping with \\.
    Remove extraneous if 1 if 0 == 1 else 0 else 0.

    Fixes: fd5500989c8f (perf vendor events intel: Update metrics from TMAM 3.5)
    Signed-off-by: Ian Rogers
    Acked-by: Jiri Olsa
    Cc: Adrian Hunter
    Cc: Alexander Shishkin
    Cc: Andi Kleen
    Cc: Haiyan Song
    Cc: Jin Yao
    Cc: John Garry
    Cc: Kajol Jain
    Cc: Kan Liang
    Cc: Leo Yan
    Cc: Mark Rutland
    Cc: Namhyung Kim
    Cc: Paul Clarke
    Cc: Peter Zijlstra
    Cc: Ravi Bangoria
    Cc: Song Liu
    Cc: Stephane Eranian
    Link: http://lore.kernel.org/lkml/20200501173333.227162-3-irogers@google.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Ian Rogers
     

30 Apr, 2020

2 commits

  • The hv_24×7 feature in IBM® POWER9™ processor-based servers provide the
    facility to continuously collect large numbers of hardware performance
    metrics efficiently and accurately.

    This patch adds hv_24x7 metric file for different Socket/chip
    resources.

    Result:

    power9 platform:

    command:# ./perf stat --metric-only -M Memory_RD_BW_Chip -C 0 -I 1000

    1.000096188 0.9 0.3
    2.000285720 0.5 0.1
    3.000424990 0.4 0.1

    command:# ./perf stat --metric-only -M PowerBUS_Frequency -C 0 -I 1000

    1.000097981 2.3 2.3
    2.000291713 2.3 2.3
    3.000421719 2.3 2.3
    4.000550912 2.3 2.3

    Signed-off-by: Kajol Jain
    Acked-by: Jiri Olsa
    Cc: Alexander Shishkin
    Cc: Andi Kleen
    Cc: Anju T Sudhakar
    Cc: Benjamin Herrenschmidt
    Cc: Greg Kroah-Hartman
    Cc: Jin Yao
    Cc: Joe Mario
    Cc: Kan Liang
    Cc: Madhavan Srinivasan
    Cc: Mamatha Inamdar
    Cc: Mark Rutland
    Cc: Michael Ellerman
    Cc: Michael Petlan
    Cc: Namhyung Kim
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Ravi Bangoria
    Cc: Sukadev Bhattiprolu
    Cc: Thomas Gleixner
    Cc: linuxppc-dev@lists.ozlabs.org
    Link: http://lore.kernel.org/lkml/20200401203340.31402-8-kjain@linux.ibm.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Kajol Jain
     
  • get_cpuid_str() is used in tools/perf/arch/xxx/util/header.c,
    fix the name in comment.

    Signed-off-by: Shaokun Zhang
    Cc: Andi Kleen
    Link: http://lore.kernel.org/lkml/1588141992-48382-1-git-send-email-zhangshaokun@hisilicon.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Shaokun Zhang
     

03 Apr, 2020

1 commit

  • The kernel utilization metric does multiplexing currently and is somewhat
    unreliable. The problem is that it uses two instances of the fixed counter,
    and the kernel has to multipleplex which causes errors. So should use
    CPU_CLK_UNHALTED.THREAD instead.

    Before:

    # perf stat -M Kernel_Utilization -- sleep 1

    Performance counter stats for 'sleep 1':

    1,419,425 cpu_clk_unhalted.ref_tsc:k
    cpu_clk_unhalted.ref_tsc (0.00%)

    After:

    # perf stat -M Kernel_Utilization -- sleep 1

    Performance counter stats for 'sleep 1':

    746,688 cpu_clk_unhalted.thread:k # 0.7 Kernel_Utilization
    1,088,348 cpu_clk_unhalted.thread

    Signed-off-by: Jin Yao
    Reviewed-by: Andi Kleen
    Reviewed-by: Kan Liang
    Cc: Alexander Shishkin
    Cc: Jin Yao
    Cc: Jiri Olsa
    Cc: Peter Zijlstra
    Link: http://lore.kernel.org/lkml/20200309013125.7559-1-yao.jin@linux.intel.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Jin Yao
     

24 Mar, 2020

4 commits

  • With the goal of supporting pmu-events test case, introduce support for
    a test events folder.

    These test events can be used for testing generation of pmu-event tables
    and alias creation for any arch.

    When running the pmu-events test case, these test events will be used as
    the platform-agnostic events, so aliases can be created per-PMU and
    validated against known expected values.

    To support the test events, add a "testcpu" entry in pmu_events_map[].
    The pmu-events test will be able to lookup the events map for "testcpu",
    to verify the generated tables against expected values.

    The resultant generated pmu-events.c will now look like the following:

    struct pmu_event pme_ampere_emag[] = {
    {
    .name = "ldrex_spec",
    .event = "event=0x6c",
    .desc = "Exclusive operation spe...",
    .topic = "intrinsic",
    .long_desc = "Exclusive operation ...",
    },
    ...
    };

    struct pmu_event pme_test_cpu[] = {
    {
    .name = "uncore_hisi_ddrc.flux_wcmd",
    .event = "event=0x2",
    .desc = "DDRC write commands. Unit: hisi_sccl,ddrc ",
    .topic = "uncore",
    .long_desc = "DDRC write commands",
    .pmu = "hisi_sccl,ddrc",
    },
    {
    .name = "unc_cbo_xsnp_response.miss_eviction",
    .event = "umask=0x81,event=0x22",
    .desc = "Unit: uncore_cbox A cross-core snoop resulted ...",
    .topic = "uncore",
    .long_desc = "A cross-core snoop resulted from L3 ...",
    .pmu = "uncore_cbox",
    },
    {
    .name = "eist_trans",
    .event = "umask=0x0,period=200000,event=0x3a",
    .desc = "Number of Enhanced Intel SpeedStep(R) ...",
    .topic = "other",
    },
    {
    .name = 0,
    },
    };

    struct pmu_events_map pmu_events_map[] = {
    ...
    {
    .cpuid = "0x00000000500f0000",
    .version = "v1",
    .type = "core",
    .table = pme_ampere_emag
    },
    ...
    {
    .cpuid = "testcpu",
    .version = "v1",
    .type = "core",
    .table = pme_test_cpu,
    },
    {
    .cpuid = 0,
    .version = 0,
    .type = 0,
    .table = 0,
    },
    };

    Signed-off-by: John Garry
    Acked-by: Jiri Olsa
    Cc: Alexander Shishkin
    Cc: Andi Kleen
    Cc: James Clark
    Cc: Joakim Zhang
    Cc: Mark Rutland
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Will Deacon
    Cc: linuxarm@huawei.com
    Link: http://lore.kernel.org/lkml/1584442939-8911-3-git-send-email-john.garry@huawei.com
    Signed-off-by: Arnaldo Carvalho de Melo

    John Garry
     
  • Add some test PMU events. The events are randomly chosen from x86 and
    arm64 JSONs. The events include CPU and uncore events.

    Signed-off-by: John Garry
    Acked-by: Jiri Olsa
    Cc: Alexander Shishkin
    Cc: Andi Kleen
    Cc: James Clark
    Cc: Joakim Zhang
    Cc: Mark Rutland
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Will Deacon
    Cc: linuxarm@huawei.com
    Link: http://lore.kernel.org/lkml/1584442939-8911-2-git-send-email-john.garry@huawei.com
    Signed-off-by: Arnaldo Carvalho de Melo

    John Garry
     
  • This patch updates the PMCs for AMD Zen1 core based processors (Family
    17h; Models 0 through 2F) to be in accordance with PMCs as
    documented in the latest versions of the AMD Processor Programming
    Reference [1], [2] and [3]. Note that some events, such as FPU pipe
    assignment are missing in [1], and therefore [3] is included for full
    coverage of events.

    PMCs added:

    fpu_pipe_assignment.dual{0|1|2|3}
    fpu_pipe_assignment.total{0|1|2|3}
    ls_mab_alloc.dc_prefetcher
    ls_mab_alloc.stores
    ls_mab_alloc.loads
    bp_dyn_ind_pred
    bp_de_redirect

    PMC removed:

    ex_ret_cond_misp

    Cumulative counts, fpu_pipe_assignment.total and
    fpu_pipe_assignment.dual, existed in v1, but did expose port-level
    counters.

    ex_ret_cond_misp has been removed as it has been removed from the latest
    versions of the PPR, and when tested, always seems to sample zero as
    tested on a Ryzen 3400G system.

    [1]: Processor Programming Reference (PPR) for AMD Family 17h Models
    01h,08h, Revision B2 Processors, 54945 Rev 3.03 - Jun 14, 2019.

    [2]: Processor Programming Reference (PPR) for AMD Family 17h Model 18h,
    Revision B1 Processors, 55570-B1 Rev 3.14 - Sep 26, 2019.

    [3]: OSRR for AMD Family 17h processors, Models 00h-2Fh, 56255 Rev 3.03 - July, 2018

    All of the PPRs can be found at:
    https://bugzilla.kernel.org/show_bug.cgi?id=206537

    Signed-off-by: Vijay Thakkar
    Acked-by: Kim Phillips
    Cc: Alexander Shishkin
    Cc: Jiri Olsa
    Cc: Jon Grimm
    Cc: Martin Liška
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: vijay thakkar
    Link: http://lore.kernel.org/lkml/20200318190002.307290-4-vijaythakkar@me.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Vijay Thakkar
     
  • This patch adds PMU events for AMD Zen2 core based processors, namely,
    Matisse (model 71h), Castle Peak (model 31h) and Rome (model 2xh), as
    documented in the AMD Processor Programming Reference for Matisse [1].
    The model number regex has been set to detect all the models under
    family 17 that do not match those of Zen1, as the range is larger for
    zen2.

    Zen2 adds some additional counters that are not present in Zen1 and
    events for them have been added in this patch. Some counters have also
    been removed for Zen2 thatwere previously present in Zen1 and have been
    confirmed to always sample zero on zen2. These added/removed counters
    have been omitted for brevity but can be found here:
    https://gist.github.com/thakkarV/5b12ca5fd7488eb2c42e451e40bdd5f3

    Note that PPR for Zen2 [1] does not include some counters that were
    documented in the PPR for Zen1 based processors [2]. After having tested
    these counters, some of them that still work for zen2 systems have been
    preserved in the events for zen2. The counters that are omitted in [1]
    but are still measurable and non-zero on zen2 (tested on a Ryzen 3900X
    system) are the following:

    PMC 0x000 fpu_pipe_assignment.{total|total0|total1|total2|total3}
    PMC 0x004 fp_num_mov_elim_scal_op.*
    PMC 0x046 ls_tablewalker.*
    PMC 0x062 l2_latency.l2_cycles_waiting_on_fills
    PMC 0x063 l2_wcb_req.*
    PMC 0x06D l2_fill_pending.l2_fill_busy
    PMC 0x080 ic_fw32
    PMC 0x081 ic_fw32_miss
    PMC 0x086 bp_snp_re_sync
    PMC 0x087 ic_fetch_stall.*
    PMC 0x08C ic_cache_inval.*
    PMC 0x099 bp_tlb_rel
    PMC 0x0C7 ex_ret_brn_resync
    PMC 0x28A ic_oc_mode_switch.*
    L3PMC 0x001 l3_request_g1.*
    L3PMC 0x006 l3_comb_clstr_state.*

    [1]: Processor Programming Reference (PPR) for AMD Family 17h Model 71h,
    Revision B0 Processors, 56176 Rev 3.06 - Jul 17, 2019

    [2]: Processor Programming Reference (PPR) for AMD Family 17h Models
    01h,08h, Revision B2 Processors, 54945 Rev 3.03 - Jun 14, 2019

    All of the PPRs can be found at:

    https://bugzilla.kernel.org/show_bug.cgi?id=206537

    Here are the results of running "fpu_pipe_assignment.total" events on my
    Ryzen 3900X family 17h model 71h system:

    Before this patch:

    $> perf list *fpu_pipe_assignment*

    List of pre-defined events (to be used in -e):

    After:

    $> perf list *fpu_pipe_assignment*

    floating point:
    fpu_pipe_assignment.total
    [Total number of fp uOps]
    fpu_pipe_assignment.total0
    [Total number uOps assigned to pipe 0]
    fpu_pipe_assignment.total1
    [Total number uOps assigned to pipe 1]
    fpu_pipe_assignment.total2
    [Total number uOps assigned to pipe 2]
    fpu_pipe_assignment.total3
    [Total number uOps assigned to pipe 3]

    Metric Groups:

    $> perf stat -e fpu_pipe_assignment.total sleep 1

    Performance counter stats for 'sleep 1':

    25,883 fpu_pipe_assignment.total

    1.004145868 seconds time elapsed

    0.001805000 seconds user
    0.000000000 seconds sys

    Usage tests while running Linpackin the background:

    $> perf stat -I1000 -e fpu_pipe_assignment.total
    1.000266796 79,313,191,516 fpu_pipe_assignment.total
    2.000809630 68,091,474,430 fpu_pipe_assignment.total
    3.001028115 52,925,023,174 fpu_pipe_assignment.total

    $> perf record -e fpu_pipe_assignment.total,fpu_pipe_assignment.total0 -a sleep 1
    [ perf record: Woken up 9 times to write data ]
    [ perf record: Captured and wrote 4.031 MB perf.data (64764 samples) ]

    $> perf report --stdio --no-header | head -30
    98.33% xhpl xhpl [.] dgemm_kernel
    0.28% xhpl xhpl [.] dtrsm_kernel_LT
    0.10% xhpl [kernel.kallsyms] [k] entry_SYSCALL_64
    0.08% xhpl xhpl [.] idamax_k
    0.07% baloo_file_extr liblmdb.so [.] mdb_mid2l_insert
    0.06% xhpl xhpl [.] dgemm_itcopy
    0.06% xhpl xhpl [.] dgemm_oncopy
    0.06% xhpl [kernel.kallsyms] [k] __schedule
    0.06% xhpl [kernel.kallsyms] [k] syscall_trace_enter
    0.06% xhpl [kernel.kallsyms] [k] native_sched_clock
    0.06% xhpl [kernel.kallsyms] [k] pick_next_task_fair
    0.05% xhpl xhpl [.] blas_thread_server.llvm.15009391670273914865
    0.04% xhpl [kernel.kallsyms] [k] do_syscall_64
    0.04% xhpl [kernel.kallsyms] [k] yield_task_fair
    0.04% xhpl libpthread-2.31.so [.] __pthread_mutex_unlock_usercnt
    0.03% xhpl [kernel.kallsyms] [k] cpuacct_charge
    0.03% xhpl [kernel.kallsyms] [k] syscall_return_via_sysret
    0.03% xhpl libc-2.31.so [.] __sched_yield
    0.03% xhpl [kernel.kallsyms] [k] __calc_delta

    $> perf annotate --stdio2 dgemm_kernel | egrep '^ {0,2}[0-9]+' -B2 -A2
    sub $0x60,%rsp
    mov %rbx,(%rsp)
    0.00 mov %rbp,0x8(%rsp)
    mov %r12,0x10(%rsp)
    0.00 mov %r13,0x18(%rsp)
    mov %r14,0x20(%rsp)
    mov %r15,0x28(%rsp)
    --
    mov %rdi,%r13
    mov %rsi,0x28(%rsp)
    0.00 mov %rdx,%r12
    vmovsd %xmm0,0x30(%rsp)
    shl $0x3,%r10
    mov 0x28(%rsp),%rax
    0.00 xor %rdx,%rdx
    mov $0x18,%rdi
    div %rdi
    --
    nop
    a0: mov %r12,%rax
    0.00 shl $0x3,%rax
    mov %r8,%rdi
    lea (%r8,%rax,8),%r15
    --
    mov %r12,%rax
    nop
    0.00 c0: vmovups (%rdi),%ymm1
    0.09 vmovups 0x20(%rdi),%ymm2
    0.02 vmovups (%r15),%ymm3
    0.10 vmovups %ymm1,(%rsi)
    0.07 vmovups %ymm2,0x20(%rsi)
    0.07 vmovups %ymm3,0x40(%rsi)
    0.06 add $0x40,%rdi
    add $0x40,%r15
    add $0x60,%rsi
    0.00 dec %rax
    ↑ jne c0
    mov %r9,%r15
    --
    nop
    110: lea 0x80(%rsp),%rsi
    0.01 add $0x60,%rsi
    0.03 mov %r12,%rax
    0.00 sar $0x3,%rax
    cmp $0x2,%rax
    ↓ jl d26
    prefetcht0 0x200(%rdi)
    0.01 vmovups -0x60(%rsi),%ymm1
    0.02 prefetcht0 0xa0(%rsi)
    0.00 vbroadcastsd -0x80(%rdi),%ymm0
    0.00 prefetcht0 0xe0(%rsi)
    0.03 vmovups -0x40(%rsi),%ymm2
    0.00 prefetcht0 0x120(%rsi)
    vmovups -0x20(%rsi),%ymm3
    vmulpd %ymm0,%ymm1,%ymm4
    0.01 prefetcht0 0x160(%rsi)
    vmulpd %ymm0,%ymm2,%ymm8
    0.01 vmulpd %ymm0,%ymm3,%ymm12
    0.02 prefetcht0 0x1a0(%rsi)
    0.01 vbroadcastsd -0x78(%rdi),%ymm0
    vmulpd %ymm0,%ymm1,%ymm5
    0.01 vmulpd %ymm0,%ymm2,%ymm9
    vmulpd %ymm0,%ymm3,%ymm13
    0.01 vbroadcastsd -0x70(%rdi),%ymm0
    vmulpd %ymm0,%ymm1,%ymm6
    0.00 vmulpd %ymm0,%ymm2,%ymm10
    0.00 add $0x60,%rsi

    ... snip ...

    nop
    65e0: vmovddup -0x60(%rsi),%xmm2
    0.00 vmovups -0x80(%rdi),%xmm0
    vmovups -0x70(%rdi),%xmm1
    0.00 vmovddup -0x58(%rsi),%xmm3
    vfmadd231pd %xmm0,%xmm2,%xmm4
    0.00 vfmadd231pd %xmm1,%xmm2,%xmm5
    0.00 vfmadd231pd %xmm0,%xmm3,%xmm6
    0.00 vfmadd231pd %xmm1,%xmm3,%xmm7
    0.00 add $0x10,%rsi
    add $0x20,%rdi
    0.00 dec %rax
    ↑ jne 65e0
    nop
    nop
    6620: vmovddup 0x30(%rsp),%xmm0
    0.00 vmulpd %xmm0,%xmm4,%xmm4
    0.00 vmulpd %xmm0,%xmm5,%xmm5
    vmulpd %xmm0,%xmm6,%xmm6
    vmulpd %xmm0,%xmm7,%xmm7
    vaddpd (%r15),%xmm4,%xmm4
    vaddpd 0x10(%r15),%xmm5,%xmm5
    0.00 vaddpd (%r15,%r10,1),%xmm6,%xmm6
    0.00 vaddpd 0x10(%r15,%r10,1),%xmm7,%xmm7
    0.00 vmovups %xmm4,(%r15)
    vmovups %xmm5,0x10(%r15)
    0.00 vmovups %xmm6,(%r15,%r10,1)
    vmovups %xmm7,0x10(%r15,%r10,1)
    add $0x20,%r15
    --
    lea (%r8,%rax,8),%r8
    69d8: mov 0x20(%rsp),%r14
    0.00 test $0x1,%r14
    ↓ je 6d84
    mov %r9,%r15
    --
    vbroadcastsd -0x28(%rsi),%ymm3
    vfmadd231pd (%rdi),%ymm0,%ymm4
    0.00 vfmadd231pd 0x20(%rdi),%ymm1,%ymm5
    vfmadd231pd 0x40(%rdi),%ymm2,%ymm6
    vfmadd231pd 0x60(%rdi),%ymm3,%ymm7
    --
    vmulpd %ymm0,%ymm4,%ymm4
    vaddpd (%r15),%ymm4,%ymm4
    0.00 vmovups %ymm4,(%r15)
    add $0x20,%r15
    dec %r11
    --
    mov %rbx,%rsp
    mov (%rsp),%rbx
    0.01 mov 0x8(%rsp),%rbp
    mov 0x10(%rsp),%r12
    mov 0x18(%rsp),%r13

    Signed-off-by: Vijay Thakkar
    Tested-by: Arnaldo Carvalho de Melo
    Acked-by: Kim Phillips
    Cc: Alexander Shishkin
    Cc: Jiri Olsa
    Cc: Jon Grimm
    Cc: Martin Liška
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Link: http://lore.kernel.org/lkml/20200318190002.307290-3-vijaythakkar@me.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Vijay Thakkar