13 Sep, 2010

1 commit

  • Fix a bug introduced with commit de725de and the change in the
    meaning of the return value of intel_pmu_handle_irq(). With the
    current code, when you are using the BTS, you get 'dazed by NMI'
    each time the BTS buffer fills up.

    BTS does interrupt on the PMU vector, thus NMI. You need to take
    this into account in the return value of the function.

    This version fixes initial patch which was missing changes to
    perf_event_intel_ds.c.

    Signed-off-by: Stephane Eranian
    Acked-by: Don Zickus
    Cc: peterz@infradead.org
    Cc: paulus@samba.org
    Cc: davem@davemloft.net
    Cc: fweisbec@gmail.com
    Cc: perfmon2-devel@lists.sf.net
    Cc: eranian@gmail.com
    Cc: robert.richter@amd.com
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     

10 Sep, 2010

25 commits

  • We ought to return -ENOENT when non of the registered PMUs
    recognise the requested event.

    This fixes a boot crash that occurs if no PMU is available
    but the NMI watchdog tries to register an event.

    Reported-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Even though we call it from the inherit path, where the child is
    not yet accessible, we need to hold ctx->lock, add_event_to_ctx()
    assumes IRQs are disabled.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • asm-generic/hardirq.h needs asm/irq.h which might include
    linux/interrupt.h as in the sparc 32 case. At this point
    we need irq_cpustat generic definitions, but those are
    included later in asm-generic/hardirq.h.

    Then delay a bit the inclusion of irq.h from
    asm-generic/hardirq.h, it doesn't need to be included early.

    This fixes:

    include/linux/interrupt.h: In function '__raise_softirq_irqoff':
    include/linux/interrupt.h:414: error: implicit declaration of function 'local_softirq_pending'
    include/linux/interrupt.h:414: error: lvalue required as left operand of assignment

    Reported-by: Ingo Molnar
    Signed-off-by: Frederic Weisbecker
    Cc: Lai Jiangshan
    Cc: Koki Sanagi
    Cc: mathieu.desnoyers@efficios.com
    Cc: rostedt@goodmis.org
    Cc: nhorman@tuxdriver.com
    Cc: scott.a.mcmillan@intel.com
    Cc: eric.dumazet@gmail.com
    Cc: kaneshige.kenji@jp.fujitsu.com
    Cc: davem@davemloft.net
    Cc: izumi.taku@jp.fujitsu.com
    Cc: kosaki.motohiro@jp.fujitsu.com
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • I missed a perf_event_ctxp user when converting it to an array. Pull this
    last user into perf_event.c as well and fix it up.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Assuming we don't mix events of different pmus onto a single context
    (with the exeption of software events inside a hardware group) we can
    now assume that all events on a particular context belong to the same
    pmu, hence we can disable the pmu for the entire context operations.

    This reduces the amount of hardware writes.

    The exception for swevents comes from the fact that the sw pmu disable
    is a nop.

    Signed-off-by: Peter Zijlstra
    Cc: paulus
    Cc: stephane eranian
    Cc: Robert Richter
    Cc: Frederic Weisbecker
    Cc: Lin Ming
    Cc: Yanmin
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Since software events are always schedulable, mixing them up with
    hardware events (who are not) can lead to funny scheduling oddities.

    Giving them their own context solves this.

    Signed-off-by: Peter Zijlstra
    Cc: paulus
    Cc: stephane eranian
    Cc: Robert Richter
    Cc: Frederic Weisbecker
    Cc: Lin Ming
    Cc: Yanmin
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Provide the infrastructure for multiple task contexts.

    A more flexible approach would have resulted in more pointer chases
    in the scheduling hot-paths. This approach has the limitation of a
    static number of task contexts.

    Since I expect most external PMUs to be system wide, or at least node
    wide (as per the intel uncore unit) they won't actually need a task
    context.

    Signed-off-by: Peter Zijlstra
    Cc: paulus
    Cc: stephane eranian
    Cc: Robert Richter
    Cc: Frederic Weisbecker
    Cc: Lin Ming
    Cc: Yanmin
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Unify the two perf_event_context allocation sites.

    Signed-off-by: Peter Zijlstra
    Cc: paulus
    Cc: stephane eranian
    Cc: Robert Richter
    Cc: Frederic Weisbecker
    Cc: Lin Ming
    Cc: Yanmin
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Move all inherit code near each other.

    Signed-off-by: Peter Zijlstra
    Cc: paulus
    Cc: stephane eranian
    Cc: Robert Richter
    Cc: Frederic Weisbecker
    Cc: Lin Ming
    Cc: Yanmin
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Allocate per-cpu contexts per pmu.

    Signed-off-by: Peter Zijlstra
    Cc: paulus
    Cc: stephane eranian
    Cc: Robert Richter
    Cc: Frederic Weisbecker
    Cc: Lin Ming
    Cc: Yanmin
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Give each cpu-context its own timer so that it is a self contained
    entity, this eases the way for per-pmu-per-cpu contexts as well as
    provides the basic infrastructure to allow different rotation
    times per pmu.

    Things to look at:
    - folding the tick and these TICK_NSEC timers
    - separate task context rotation

    Signed-off-by: Peter Zijlstra
    Cc: paulus
    Cc: stephane eranian
    Cc: Robert Richter
    Cc: Frederic Weisbecker
    Cc: Lin Ming
    Cc: Yanmin
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Separate the swevent hash-table from the cpu_context bits in
    preparation for per pmu cpu contexts.

    This keeps the swevent hash a global entity.

    Signed-off-by: Peter Zijlstra
    Cc: paulus
    Cc: stephane eranian
    Cc: Robert Richter
    Cc: Frederic Weisbecker
    Cc: Lin Ming
    Cc: Yanmin
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Separate find_get_context() from the event allocation and
    initialization so that we may make find_get_context() depend
    on the event pmu in a later patch.

    Signed-off-by: Peter Zijlstra
    Cc: paulus
    Cc: stephane eranian
    Cc: Robert Richter
    Cc: Frederic Weisbecker
    Cc: Lin Ming
    Cc: Yanmin
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Neither the overcommit nor the reservation sysfs parameter were
    actually working, remove them as they'll only get in the way.

    Signed-off-by: Peter Zijlstra
    Cc: paulus
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Replace pmu::{enable,disable,start,stop,unthrottle} with
    pmu::{add,del,start,stop}, all of which take a flags argument.

    The new interface extends the capability to stop a counter while
    keeping it scheduled on the PMU. We replace the throttled state with
    the generic stopped state.

    This also allows us to efficiently stop/start counters over certain
    code paths (like IRQ handlers).

    It also allows scheduling a counter without it starting, allowing for
    a generic frozen state (useful for rotating stopped counters).

    The stopped state is implemented in two different ways, depending on
    how the architecture implemented the throttled state:

    1) We disable the counter:
    a) the pmu has per-counter enable bits, we flip that
    b) we program a NOP event, preserving the counter state

    2) We store the counter state and ignore all read/overflow events

    Signed-off-by: Peter Zijlstra
    Cc: paulus
    Cc: stephane eranian
    Cc: Robert Richter
    Cc: Will Deacon
    Cc: Paul Mundt
    Cc: Frederic Weisbecker
    Cc: Cyrill Gorcunov
    Cc: Lin Ming
    Cc: Yanmin
    Cc: Deng-Cheng Zhu
    Cc: David Miller
    Cc: Michael Cree
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Use hw_perf_event::period_left instead of hw_perf_event::remaining
    and win back 8 bytes.

    Signed-off-by: Peter Zijlstra
    Cc: paulus
    Cc: Frederic Weisbecker
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Provide default implementations for the pmu txn methods, this
    allows us to remove some conditional code.

    Signed-off-by: Peter Zijlstra
    Cc: paulus
    Cc: stephane eranian
    Cc: Robert Richter
    Cc: Will Deacon
    Cc: Paul Mundt
    Cc: Frederic Weisbecker
    Cc: Cyrill Gorcunov
    Cc: Lin Ming
    Cc: Yanmin
    Cc: Deng-Cheng Zhu
    Cc: David Miller
    Cc: Michael Cree
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Changes perf_disable() into perf_pmu_disable().

    Signed-off-by: Peter Zijlstra
    Cc: paulus
    Cc: stephane eranian
    Cc: Robert Richter
    Cc: Will Deacon
    Cc: Paul Mundt
    Cc: Frederic Weisbecker
    Cc: Cyrill Gorcunov
    Cc: Lin Ming
    Cc: Yanmin
    Cc: Deng-Cheng Zhu
    Cc: David Miller
    Cc: Michael Cree
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Since the current perf_disable() usage is only an optimization,
    remove it for now. This eases the removal of the __weak
    hw_perf_enable() interface.

    Signed-off-by: Peter Zijlstra
    Cc: paulus
    Cc: stephane eranian
    Cc: Robert Richter
    Cc: Will Deacon
    Cc: Paul Mundt
    Cc: Frederic Weisbecker
    Cc: Cyrill Gorcunov
    Cc: Lin Ming
    Cc: Yanmin
    Cc: Deng-Cheng Zhu
    Cc: David Miller
    Cc: Michael Cree
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Fixup random annoying style bits.

    Signed-off-by: Peter Zijlstra
    Cc: paulus
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Simple registration interface for struct pmu, this provides the
    infrastructure for removing all the weak functions.

    Signed-off-by: Peter Zijlstra
    Cc: paulus
    Cc: stephane eranian
    Cc: Robert Richter
    Cc: Will Deacon
    Cc: Paul Mundt
    Cc: Frederic Weisbecker
    Cc: Cyrill Gorcunov
    Cc: Lin Ming
    Cc: Yanmin
    Cc: Deng-Cheng Zhu
    Cc: David Miller
    Cc: Michael Cree
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • sed -ie 's/const struct pmu\>/struct pmu/g' `git grep -l "const struct pmu\>"`

    Signed-off-by: Peter Zijlstra
    Cc: paulus
    Cc: stephane eranian
    Cc: Robert Richter
    Cc: Will Deacon
    Cc: Paul Mundt
    Cc: Frederic Weisbecker
    Cc: Cyrill Gorcunov
    Cc: Lin Ming
    Cc: Yanmin
    Cc: Deng-Cheng Zhu
    Cc: David Miller
    Cc: Michael Cree
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Merge reason: Pick up pending fixes before applying dependent new changes.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Since we have UP_PREPARE, we should also have UP_CANCELED.

    Signed-off-by: Peter Zijlstra
    Cc: paulus
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Commit 1c024eca (perf, trace: Optimize tracepoints by using
    per-tracepoint-per-cpu hlist to track events) caused a module
    refcount leak.

    Reported-And-Tested-by: Avi Kivity
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Li Zefan
     

09 Sep, 2010

1 commit

  • Reading the file set_ftrace_filter does three things.

    1) shows whether or not filters are set for the function tracer
    2) shows what functions are set for the function tracer
    3) shows what triggers are set on any functions

    3 is independent from 1 and 2.

    The way this file currently works is that it is a state machine,
    and as you read it, it may change state. But this assumption breaks
    when you use lseek() on the file. The state machine gets out of sync
    and the t_show() may use the wrong pointer and cause a kernel oops.

    Luckily, this will only kill the app that does the lseek, but the app
    dies while holding a mutex. This prevents anyone else from using the
    set_ftrace_filter file (or any other function tracing file for that matter).

    A real fix for this is to rewrite the code, but that is too much for
    a -rc release or stable. This patch simply disables llseek on the
    set_ftrace_filter() file for now, and we can do the proper fix for the
    next major release.

    Reported-by: Robert Swiecki
    Cc: Chris Wright
    Cc: Tavis Ormandy
    Cc: Eugene Teo
    Cc: vendor-sec@lst.de
    Cc:
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

08 Sep, 2010

6 commits

  • Check the argument name whether it is invalid (not C-like symbol name). This
    makes event format simple.

    Reported-by: Srikar Dronamraju
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Mathieu Desnoyers
    LKML-Reference:
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Arnaldo Carvalho de Melo

    Masami Hiramatsu
     
  • Set "argN" name for each argument automatically if it has no specified name.
    Since dynamic trace event(kprobe_events) accepts special characters for its
    argument, its format can show those special characters (e.g. '$', '%', '+').
    However, perf can't parse those format because of the character (especially
    '%') mess up the format. This sets "argX" name for those arguments if user
    omitted the argument names.

    E.g.
    # echo 'p do_fork %ax IP=%ip $stack' > tracing/kprobe_events
    # cat tracing/kprobe_events
    p:kprobes/p_do_fork_0 do_fork arg1=%ax IP=%ip arg3=$stack

    Reported-by: Srikar Dronamraju
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Mathieu Desnoyers
    LKML-Reference:
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Arnaldo Carvalho de Melo

    Masami Hiramatsu
     
  • Don't make argument names from raw parameters (means the parameters are written
    in kprobe-tracer syntax), because the argument syntax may include special
    characters. Just leave it, then kprobe-tracer gives a new name.

    Reported-by: Srikar Dronamraju
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Frederic Weisbecker
    LKML-Reference:
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Arnaldo Carvalho de Melo

    Masami Hiramatsu
     
  • Fix a bug to support %return probe syntax again. Previous commit 4235b04 has a
    bug which disables the %return syntax on perf probe.

    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Frederic Weisbecker
    LKML-Reference:
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Arnaldo Carvalho de Melo

    Masami Hiramatsu
     
  • Fix a memory leak which happens when a field name conflicts with others. In
    error case, free_trace_probe() will free all arguments until nr_args, so this
    increments nr_args the begining of the loop instead of the end.

    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Mathieu Desnoyers
    LKML-Reference:
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Arnaldo Carvalho de Melo

    Masami Hiramatsu
     
  • Add a perf script which shows packets processing and processed
    time. It helps us to investigate networking or network devices.

    If you want to use it, install perf and record perf.data like
    following.

    If you set script, perf gathers records until it ends.
    If not, you must Ctrl-C to stop recording.

    And if you want a report from record,

    If you use some options, you can limit the output.
    Option is below.

    tx: show only tx packets processing
    rx: show only rx packets processing
    dev=: show processing on this device
    debug: work with debug mode. It shows buffer status.

    For example, if you want to show received packets processing
    associated with eth4,

    106133.171439sec cpu=0
    irq_entry(+0.000msec irq=24:eth4)
    |
    softirq_entry(+0.006msec)
    |
    |---netif_receive_skb(+0.010msec skb=f2d15900 len=100)
    | |
    | skb_copy_datagram_iovec(+0.039msec 10291::10291)
    |
    napi_poll_exit(+0.022msec eth4)

    This perf script helps us to analyze the processing time of a
    transmit/receive sequence.

    Signed-off-by: Koki Sanagi
    Acked-by: David S. Miller
    Cc: Neil Horman
    Cc: Mathieu Desnoyers
    Cc: Kaneshige Kenji
    Cc: Izumo Taku
    Cc: Kosaki Motohiro
    Cc: Lai Jiangshan
    Cc: Scott Mcmillan
    Cc: Steven Rostedt
    Cc: Eric Dumazet
    Cc: Tom Zanussi
    LKML-Reference:
    Signed-off-by: Frederic Weisbecker

    Koki Sanagi
     

07 Sep, 2010

4 commits

  • This patch adds tracepoint to consume_skb and add trace_kfree_skb
    before __kfree_skb in skb_free_datagram_locked and net_tx_action.
    Combinating with tracepoint on dev_hard_start_xmit, we can check
    how long it takes to free transmitted packets. And using it, we can
    calculate how many packets driver had at that time. It is useful when
    a drop of transmitted packet is a problem.

    sshd-6828 [000] 112689.258154: consume_skb: skbaddr=f2d99bb8

    Signed-off-by: Koki Sanagi
    Acked-by: David S. Miller
    Acked-by: Neil Horman
    Cc: Mathieu Desnoyers
    Cc: Kaneshige Kenji
    Cc: Izumo Taku
    Cc: Kosaki Motohiro
    Cc: Lai Jiangshan
    Cc: Scott Mcmillan
    Cc: Steven Rostedt
    Cc: Eric Dumazet
    LKML-Reference:
    Signed-off-by: Frederic Weisbecker

    Koki Sanagi
     
  • This patch adds tracepoint to dev_queue_xmit, dev_hard_start_xmit,
    netif_rx and netif_receive_skb. These tracepoints help you to monitor
    network driver's input/output.

    -0 [001] 112447.902030: netif_rx: dev=eth1 skbaddr=f3ef0900 len=84
    -0 [001] 112447.902039: netif_receive_skb: dev=eth1 skbaddr=f3ef0900 len=84
    sshd-6828 [000] 112447.903257: net_dev_queue: dev=eth4 skbaddr=f3fca538 len=226
    sshd-6828 [000] 112447.903260: net_dev_xmit: dev=eth4 skbaddr=f3fca538 len=226 rc=0

    Signed-off-by: Koki Sanagi
    Acked-by: David S. Miller
    Acked-by: Neil Horman
    Cc: Mathieu Desnoyers
    Cc: Kaneshige Kenji
    Cc: Izumo Taku
    Cc: Kosaki Motohiro
    Cc: Lai Jiangshan
    Cc: Scott Mcmillan
    Cc: Steven Rostedt
    Cc: Eric Dumazet
    LKML-Reference:
    Signed-off-by: Frederic Weisbecker

    Koki Sanagi
     
  • This patch converts trace_napi_poll from DECLARE_EVENT to TRACE_EVENT
    to improve the usability of napi_poll tracepoint.

    -0 [001] 241302.750777: napi_poll: napi poll on napi struct f6acc480 for device eth3
    -0 [000] 241302.852389: napi_poll: napi poll on napi struct f5d0d70c for device eth1

    The original patch is below:
    http://marc.info/?l=linux-kernel&m=126021713809450&w=2

    [ sanagi.koki@jp.fujitsu.com: And add a fix by Steven Rostedt:
    http://marc.info/?l=linux-kernel&m=126150506519173&w=2 ]

    Signed-off-by: Neil Horman
    Acked-by: David S. Miller
    Acked-by: Neil Horman
    Cc: Mathieu Desnoyers
    Cc: Kaneshige Kenji
    Cc: Izumo Taku
    Cc: Kosaki Motohiro
    Cc: Lai Jiangshan
    Cc: Scott Mcmillan
    Cc: Steven Rostedt
    Cc: Eric Dumazet
    LKML-Reference:
    Signed-off-by: Koki Sanagi
    Signed-off-by: Frederic Weisbecker

    Neil Horman
     
  • Add a tracepoint for tracing when softirq action is raised.

    This and the existing tracepoints complete softirq's tracepoints:
    softirq_raise, softirq_entry and softirq_exit.

    And when this tracepoint is used in combination with
    the softirq_entry tracepoint we can determine
    the softirq raise latency.

    Signed-off-by: Lai Jiangshan
    Acked-by: Mathieu Desnoyers
    Acked-by: Neil Horman
    Cc: David Miller
    Cc: Kaneshige Kenji
    Cc: Izumo Taku
    Cc: Kosaki Motohiro
    Cc: Lai Jiangshan
    Cc: Scott Mcmillan
    Cc: Steven Rostedt
    Cc: Eric Dumazet
    LKML-Reference:
    [ factorize softirq events with DECLARE_EVENT_CLASS ]
    Signed-off-by: Koki Sanagi
    Signed-off-by: Frederic Weisbecker

    Lai Jiangshan
     

03 Sep, 2010

3 commits

  • When the PMU is enabled it is valid to have unhandled nmis, two
    events could trigger 'simultaneously' raising two back-to-back
    NMIs. If the first NMI handles both, the latter will be empty
    and daze the CPU.

    The solution to avoid an 'unknown nmi' massage in this case was
    simply to stop the nmi handler chain when the PMU is enabled by
    stating the nmi was handled. This has the drawback that a) we
    can not detect unknown nmis anymore, and b) subsequent nmi
    handlers are not called.

    This patch addresses this. Now, we check this unknown NMI if it
    could be a PMU back-to-back NMI. Otherwise we pass it and let
    the kernel handle the unknown nmi.

    This is a debug log:

    cpu #6, nmi #32333, skip_nmi #32330, handled = 1, time = 1934364430
    cpu #6, nmi #32334, skip_nmi #32330, handled = 1, time = 1934704616
    cpu #6, nmi #32335, skip_nmi #32336, handled = 2, time = 1936032320
    cpu #6, nmi #32336, skip_nmi #32336, handled = 0, time = 1936034139
    cpu #6, nmi #32337, skip_nmi #32336, handled = 1, time = 1936120100
    cpu #6, nmi #32338, skip_nmi #32336, handled = 1, time = 1936404607
    cpu #6, nmi #32339, skip_nmi #32336, handled = 1, time = 1937983416
    cpu #6, nmi #32340, skip_nmi #32341, handled = 2, time = 1938201032
    cpu #6, nmi #32341, skip_nmi #32341, handled = 0, time = 1938202830
    cpu #6, nmi #32342, skip_nmi #32341, handled = 1, time = 1938443743
    cpu #6, nmi #32343, skip_nmi #32341, handled = 1, time = 1939956552
    cpu #6, nmi #32344, skip_nmi #32341, handled = 1, time = 1940073224
    cpu #6, nmi #32345, skip_nmi #32341, handled = 1, time = 1940485677
    cpu #6, nmi #32346, skip_nmi #32347, handled = 2, time = 1941947772
    cpu #6, nmi #32347, skip_nmi #32347, handled = 1, time = 1941949818
    cpu #6, nmi #32348, skip_nmi #32347, handled = 0, time = 1941951591
    Uhhuh. NMI received for unknown reason 00 on CPU 6.
    Do you have a strange power saving mode enabled?
    Dazed and confused, but trying to continue

    Deltas:

    nmi #32334 340186
    nmi #32335 1327704
    nmi #32336 1819 <<<< back-to-back nmi [1]
    nmi #32337 85961
    nmi #32338 284507
    nmi #32339 1578809
    nmi #32340 217616
    nmi #32341 1798 <<<< back-to-back nmi [2]
    nmi #32342 240913
    nmi #32343 1512809
    nmi #32344 116672
    nmi #32345 412453
    nmi #32346 1462095 <<<< 1st nmi (standard) handling 2 counters
    nmi #32347 2046 <<<< 2nd nmi (back-to-back) handling one
    counter nmi #32348 1773 <<<< 3rd nmi (back-to-back)
    handling no counter! [3]

    For back-to-back nmi detection there are the following rules:

    The PMU nmi handler was handling more than one counter and no
    counter was handled in the subsequent nmi (see [1] and [2]
    above).

    There is another case if there are two subsequent back-to-back
    nmis [3]. The 2nd is detected as back-to-back because the first
    handled more than one counter. If the second handles one counter
    and the 3rd handles nothing, we drop the 3rd nmi because it
    could be a back-to-back nmi.

    Signed-off-by: Robert Richter
    Signed-off-by: Peter Zijlstra
    [ renamed nmi variable to pmu_nmi to avoid clash with .nmi in entry.S ]
    Signed-off-by: Don Zickus
    Cc: peterz@infradead.org
    Cc: gorcunov@gmail.com
    Cc: fweisbec@gmail.com
    Cc: ying.huang@intel.com
    Cc: ming.m.lin@intel.com
    Cc: eranian@google.com
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Robert Richter
     
  • Now that we rely on the number of handled overflows, ensure all
    handle_irq implementations actually return the right number.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Don Zickus
    Cc: peterz@infradead.org
    Cc: robert.richter@amd.com
    Cc: gorcunov@gmail.com
    Cc: fweisbec@gmail.com
    Cc: ying.huang@intel.com
    Cc: ming.m.lin@intel.com
    Cc: eranian@google.com
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • During testing of a patch to stop having the perf subsytem
    swallow nmis, it was uncovered that Nehalem boxes were randomly
    getting unknown nmis when using the perf tool.

    Moving the ack'ing of the PMI closer to when we get the status
    allows the hardware to properly re-set the PMU bit signaling
    another PMI was triggered during the processing of the first
    PMI. This allows the new logic for dealing with the
    shortcomings of multiple PMIs to handle the extra NMI by
    'eat'ing it later.

    Now one can wonder why are we getting a second PMI when we
    disable all the PMUs in the begining of the NMI handler to
    prevent such a case, for that I do not know. But I know the fix
    below helps deal with this quirk.

    Tested on multiple Nehalems where the problem was occuring.
    With the patch, the code now loops a second time to handle the
    second PMI (whereas before it was not).

    Signed-off-by: Don Zickus
    Cc: peterz@infradead.org
    Cc: robert.richter@amd.com
    Cc: gorcunov@gmail.com
    Cc: fweisbec@gmail.com
    Cc: ying.huang@intel.com
    Cc: ming.m.lin@intel.com
    Cc: eranian@google.com
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Don Zickus