04 Jul, 2011

2 commits


03 Jul, 2011

6 commits

  • rbp is used in SAVE_ARGS_IRQ to save the old stack pointer
    in order to restore it later in ret_from_intr.

    It is convenient because we save its value in the irq regs
    and it's easily restored using the leave instruction.

    However this is a kind of abuse of the frame pointer which
    role is to help unwinding the kernel by chaining frames
    together, each node following the return address to the
    previous frame.

    But although we are breaking the frame by changing the stack
    pointer, there is no preceding return address before the new
    frame. Hence using the frame pointer to link the two stacks
    breaks the stack unwinders that find a random value instead of
    a return address here.

    There is no workaround that can work in every case. We are using
    the fixup_bp_irq_link() function to dereference that abused frame
    pointer in the case of non nesting interrupt (which means stack
    changed).
    But that doesn't fix the case of interrupts that don't change the
    stack (but we still have the unconditional frame link), which is
    the case of hardirq interrupting softirq. We have no way to detect
    this transition so the frame irq link is considered as a real frame
    pointer and the return address is dereferenced but it is still a
    spurious one.

    There are two possible results of this: either the spurious return
    address, a random stack value, luckily belongs to the kernel text
    and then the unwinding can continue and we just have a weird entry
    in the stack trace. Or it doesn't belong to the kernel text and
    unwinding stops there.

    This is the reason why stacktraces (including perf callchains) on
    irqs that interrupted softirqs don't work very well.

    To solve this, we don't save the old stack pointer on rbp anymore
    but we save it to a scratch register that we push on the new
    stack and that we pop back later on irq return.

    This preserves the whole frame chain without spurious return addresses
    in the middle and drops the need for the horrid fixup_bp_irq_link()
    workaround.

    And finally irqs that interrupt softirq are sanely unwinded.

    Before:

    99.81% perf [kernel.kallsyms] [k] perf_pending_event
    |
    --- perf_pending_event
    irq_work_run
    smp_irq_work_interrupt
    irq_work_interrupt
    |
    |--41.60%-- __read
    | |
    | |--99.90%-- create_worker
    | | bench_sched_messaging
    | | cmd_bench
    | | run_builtin
    | | main
    | | __libc_start_main
    | --0.10%-- [...]

    After:

    1.64% swapper [kernel.kallsyms] [k] perf_pending_event
    |
    --- perf_pending_event
    irq_work_run
    smp_irq_work_interrupt
    irq_work_interrupt
    |
    |--95.00%-- arch_irq_work_raise
    | irq_work_queue
    | __perf_event_overflow
    | perf_swevent_overflow
    | perf_swevent_event
    | perf_tp_event
    | perf_trace_softirq
    | __do_softirq
    | call_softirq
    | do_softirq
    | irq_exit
    | |
    | |--73.68%-- smp_apic_timer_interrupt
    | | apic_timer_interrupt
    | | |
    | | |--96.43%-- amd_e400_idle
    | | | cpu_idle
    | | | start_secondary

    Signed-off-by: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: H. Peter Anvin
    Cc: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Jan Beulich

    Frederic Weisbecker
     
  • The unwinder backlink in interrupt entry is very useless.
    It's actually not part of the stack frame chain and thus is
    never used.

    Signed-off-by: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: H. Peter Anvin
    Cc: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Jan Beulich

    Frederic Weisbecker
     
  • Just for clarity in the code. Have a first block that handles
    the frame pointer and a separate one that handles pt_regs
    pointer and its use.

    Signed-off-by: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: H. Peter Anvin
    Cc: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Jan Beulich

    Frederic Weisbecker
     
  • The save_regs function that saves the regs on low level
    irq entry is complicated because of the fact it changes
    its stack in the middle and also because it manipulates
    data allocated in the caller frame and accesses there
    are directly calculated from callee rsp value with the
    return address in the middle of the way.

    This complicates the static stack offsets calculation and
    require more dynamic ones. It also needs a save/restore
    of the function's return address.

    To simplify and optimize this, turn save_regs() into a
    macro.

    Signed-off-by: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: H. Peter Anvin
    Cc: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Jan Beulich

    Frederic Weisbecker
     
  • When regs are passed to dump_stack(), we fetch the frame
    pointer from the regs but the stack pointer is taken from
    the current frame.

    Thus the frame and stack pointers may not come from the same
    context. For example this can result in the unwinder to
    think the context is in irq, due to the current value of
    the stack, but the frame pointer coming from the regs points
    to a frame from another place. It then tries to fix up
    the irq link but ends up dereferencing a random frame
    pointer that doesn't belong to the irq stack:

    [ 9131.706906] ------------[ cut here ]------------
    [ 9131.707003] WARNING: at arch/x86/kernel/dumpstack_64.c:129 dump_trace+0x2aa/0x330()
    [ 9131.707003] Hardware name: AMD690VM-FMH
    [ 9131.707003] Perf: bad frame pointer = 0000000000000005 in callchain
    [ 9131.707003] Modules linked in:
    [ 9131.707003] Pid: 1050, comm: perf Not tainted 3.0.0-rc3+ #181
    [ 9131.707003] Call Trace:
    [ 9131.707003] [] warn_slowpath_common+0x7a/0xb0
    [ 9131.707003] [] warn_slowpath_fmt+0x41/0x50
    [ 9131.707003] [] ? bad_to_user+0x6d/0x10be
    [ 9131.707003] [] dump_trace+0x2aa/0x330
    [ 9131.707003] [] ? native_sched_clock+0x13/0x50
    [ 9131.707003] [] perf_callchain_kernel+0x54/0x70
    [ 9131.707003] [] perf_prepare_sample+0x19f/0x2a0
    [ 9131.707003] [] __perf_event_overflow+0x16c/0x290
    [ 9131.707003] [] ? __perf_event_overflow+0x130/0x290
    [ 9131.707003] [] ? native_sched_clock+0x13/0x50
    [ 9131.707003] [] ? sched_clock+0x9/0x10
    [ 9131.707003] [] ? T.375+0x15/0x90
    [ 9131.707003] [] ? trace_hardirqs_on_caller+0x64/0x180
    [ 9131.707003] [] ? trace_hardirqs_off+0xd/0x10
    [ 9131.707003] [] perf_event_overflow+0x14/0x20
    [ 9131.707003] [] perf_swevent_hrtimer+0x11c/0x130
    [ 9131.707003] [] ? error_exit+0x51/0xb0
    [ 9131.707003] [] __run_hrtimer+0x83/0x1e0
    [ 9131.707003] [] ? perf_event_overflow+0x20/0x20
    [ 9131.707003] [] hrtimer_interrupt+0x106/0x250
    [ 9131.707003] [] ? trace_hardirqs_off_thunk+0x3a/0x3c
    [ 9131.707003] [] smp_apic_timer_interrupt+0x53/0x90
    [ 9131.707003] [] apic_timer_interrupt+0x13/0x20
    [ 9131.707003] [] ? error_exit+0x51/0xb0
    [ 9131.707003] [] ? error_exit+0x4c/0xb0
    [ 9131.707003] ---[ end trace b2560d4876709347 ]---

    Fix this by simply taking the stack pointer from regs->sp
    when regs are provided.

    Signed-off-by: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: H. Peter Anvin
    Cc: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo

    Frederic Weisbecker
     
  • In order to prepare for fetching the stack pointer from the
    regs when possible in dump_trace() instead of taking the
    local one, save the current stack pointer in perf live regs saving.

    Signed-off-by: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: H. Peter Anvin
    Cc: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo

    Frederic Weisbecker
     

01 Jul, 2011

13 commits

  • The patch a8b0ca17b80e ("perf: Remove the nmi parameter from the swevent
    and overflow interface") missed a spot in the ppc hw_breakpoint code,
    fix this up so things compile again.

    Reported-by: Ingo Molnar
    Cc: Anton Blanchard
    Cc: Eric B Munson
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-09pfip95g88s70iwkxu6nnbt@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The v1 PMU does not have any fixed counters. Using the v2 constraints,
    which do have fixed counters, causes an additional choice to be present
    in the weight calculation, but not when actually scheduling the event,
    leading to an event being not scheduled at all.

    Signed-off-by: Avi Kivity
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1309362157-6596-3-git-send-email-avi@redhat.com
    Signed-off-by: Ingo Molnar

    Avi Kivity
     
  • The perf_event overflow handler does not receive any caller-derived
    argument, so many callers need to resort to looking up the perf_event
    in their local data structure. This is ugly and doesn't scale if a
    single callback services many perf_events.

    Fix by adding a context parameter to perf_event_create_kernel_counter()
    (and derived hardware breakpoints APIs) and storing it in the perf_event.
    The field can be accessed from the callback as event->overflow_handler_context.
    All callers are updated.

    Signed-off-by: Avi Kivity
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1309362157-6596-2-git-send-email-avi@redhat.com
    Signed-off-by: Ingo Molnar

    Avi Kivity
     
  • Add a NODE level to the generic cache events which is used to measure
    local vs remote memory accesses. Like all other cache events, an
    ACCESS is HIT+MISS, if there is no way to distinguish between reads
    and writes do reads only etc..

    The below needs filling out for !x86 (which I filled out with
    unsupported events).

    I'm fairly sure ARM can leave it like that since it doesn't strike me as
    an architecture that even has NUMA support. SH might have something since
    it does appear to have some NUMA bits.

    Sparc64, PowerPC and MIPS certainly want a good look there since they
    clearly are NUMA capable.

    Signed-off-by: Peter Zijlstra
    Cc: David Miller
    Cc: Anton Blanchard
    Cc: David Daney
    Cc: Deng-Cheng Zhu
    Cc: Paul Mundt
    Cc: Will Deacon
    Cc: Robert Richter
    Cc: Stephane Eranian
    Link: http://lkml.kernel.org/r/1303508226.4865.8.camel@laptop
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Since the OFFCORE registers are fully symmetric, try the other one
    when the specified one is already in use.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1306141897.18455.8.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • This patch adds Intel Sandy Bridge offcore_response support by
    providing the low-level constraint table for those events.

    On Sandy Bridge, there are two offcore_response events. Each uses
    its own dedictated extra register. But those registers are NOT shared
    between sibling CPUs when HT is on unlike Nehalem/Westmere. They are
    always private to each CPU. But they still need to be controlled within
    an event group. All events within an event group must use the same
    value for the extra MSR. That's not controlled by the second patch in
    this series.

    Furthermore on Sandy Bridge, the offcore_response events have NO
    counter constraints contrary to what the official documentation
    indicates, so drop the events from the contraint table.

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110606145712.GA7304@quad
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     
  • The validate_group() function needs to validate events with
    extra shared regs. Within an event group, only events with
    the same value for the extra reg can co-exist. This was not
    checked by validate_group() because it was missing the
    shared_regs logic.

    This patch changes the allocation of the fake cpuc used for
    validation to also point to a fake shared_regs structure such
    that group events be properly testing.

    It modifies __intel_shared_reg_get_constraints() to use
    spin_lock_irqsave() to avoid lockdep issues.

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110606145708.GA7279@quad
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     
  • This patch improves the code managing the extra shared registers
    used for offcore_response events on Intel Nehalem/Westmere. The
    idea is to use static allocation instead of dynamic allocation.
    This simplifies greatly the get and put constraint routines for
    those events.

    The patch also renames per_core to shared_regs because the same
    data structure gets used whether or not HT is on. When HT is
    off, those events still need to coordination because they use
    a extra MSR that has to be shared within an event group.

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110606145703.GA7258@quad
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     
  • Since only samples call perf_output_sample() its much saner (and more
    correct) to put the sample logic in there than in the
    perf_output_begin()/perf_output_end() pair.

    Saves a useless argument, reduces conditionals and shrinks
    struct perf_output_handle, win!

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-2crpvsx3cqu67q3zqjbnlpsc@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The nmi parameter indicated if we could do wakeups from the current
    context, if not, we would set some state and self-IPI and let the
    resulting interrupt do the wakeup.

    For the various event classes:

    - hardware: nmi=0; PMI is in fact an NMI or we run irq_work_run from
    the PMI-tail (ARM etc.)
    - tracepoint: nmi=0; since tracepoint could be from NMI context.
    - software: nmi=[0,1]; some, like the schedule thing cannot
    perform wakeups, and hence need 0.

    As one can see, there is very little nmi=1 usage, and the down-side of
    not using it is that on some platforms some software events can have a
    jiffy delay in wakeup (when arch_irq_work_raise isn't implemented).

    The up-side however is that we can remove the nmi parameter and save a
    bunch of conditionals in fast paths.

    Signed-off-by: Peter Zijlstra
    Cc: Michael Cree
    Cc: Will Deacon
    Cc: Deng-Cheng Zhu
    Cc: Anton Blanchard
    Cc: Eric B Munson
    Cc: Heiko Carstens
    Cc: Paul Mundt
    Cc: David S. Miller
    Cc: Frederic Weisbecker
    Cc: Jason Wessel
    Cc: Don Zickus
    Link: http://lkml.kernel.org/n/tip-agjev8eu666tvknpb3iaj0fg@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Due to restriction and specifics of Netburst PMU we need a separated
    event for NMI watchdog. In particular every Netburst event
    consumes not just a counter and a config register, but also an
    additional ESCR register.

    Since ESCR registers are grouped upon counters (i.e. if ESCR is occupied
    for some event there is no room for another event to enter until its
    released) we need to pick up the "least" used ESCR (or the most available
    one) for nmi-watchdog purposes -- so MSR_P4_CRU_ESCR2/3 was chosen.

    With this patch nmi-watchdog and perf top should be able to run simultaneously.

    Signed-off-by: Cyrill Gorcunov
    CC: Lin Ming
    CC: Arnaldo Carvalho de Melo
    CC: Frederic Weisbecker
    Tested-and-reviewed-by: Don Zickus
    Tested-and-reviewed-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110623124918.GC13050@sun
    Signed-off-by: Ingo Molnar

    Cyrill Gorcunov
     
  • Commit e360adbe29 ("irq_work: Add generic hardirq context
    callbacks") fouled up the Alpha bit, not properly naming the
    arch specific function that raises the 'self-IPI'.

    Signed-off-by: Peter Zijlstra
    Cc: Michael Cree
    Cc: stable@kernel.org # 37+
    Link: http://lkml.kernel.org/n/tip-gukh0txmql2l4thgrekzzbfy@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Commit e360adbe29 ("irq_work: Add generic hardirq context
    callbacks") fouled up the ppc bit, not properly naming the
    arch specific function that raises the 'self-IPI'.

    Cc: Huang Ying
    Cc: Benjamin Herrenschmidt
    Cc: Anton Blanchard
    Cc: Eric B Munson
    Cc: stable@kernel.org # 37+
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-eg0aqien8p1aqvzu9dft6dtv@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

28 Jun, 2011

2 commits

  • To make SLUB work on UML we need this_cpu_cmpxchg from
    asm-generic/percpu.h.

    Signed-off-by: Richard Weinberger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Weinberger
     
  • commit 21a3c96 uses node_start/end_pfn(nid) for detection start/end
    of nodes. But, it's not defined in linux/mmzone.h but defined in
    /arch/???/include/mmzone.h which is included only under
    CONFIG_NEED_MULTIPLE_NODES=y.

    Then, we see
    mm/page_cgroup.c: In function 'page_cgroup_init':
    mm/page_cgroup.c:308: error: implicit declaration of function 'node_start_pfn'
    mm/page_cgroup.c:309: error: implicit declaration of function 'node_end_pfn'

    So, fixiing page_cgroup.c is an idea...

    But node_start_pfn()/node_end_pfn() is a very generic macro and
    should be implemented in the same manner for all archs.
    (m32r has different implementation...)

    This patch removes definitions of node_start/end_pfn() in each archs
    and defines a unified one in linux/mmzone.h. It's not under
    CONFIG_NEED_MULTIPLE_NODES, now.

    A result of macro expansion is here (mm/page_cgroup.c)

    for !NUMA
    start_pfn = ((&contig_page_data)->node_start_pfn);
    end_pfn = ({ pg_data_t *__pgdat = (&contig_page_data); __pgdat->node_start_pfn + __pgdat->node_spanned_pages;});

    for NUMA (x86-64)
    start_pfn = ((node_data[nid])->node_start_pfn);
    end_pfn = ({ pg_data_t *__pgdat = (node_data[nid]); __pgdat->node_start_pfn + __pgdat->node_spanned_pages;});

    Changelog:
    - fixed to avoid using "nid" twice in node_end_pfn() macro.

    Reported-and-acked-by: Randy Dunlap
    Reported-and-tested-by: Ingo Molnar
    Acked-by: Mel Gorman
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

27 Jun, 2011

2 commits

  • * 'fixes' of master.kernel.org:/home/rmk/linux-2.6-arm:
    ARM: pm: ensure ARMv7 CPUs save and restore the TLS register
    ARM: pm: proc-v7: fix missing struct processor pointers for suspend code
    ARM: 6969/1: plat-iop: fix build error
    ARM: 6961/1: zImage: Add build-time check for correctly-sized proc_type entries
    ARM: SMP: wait for CPU to be marked active
    ARM: 6963/1: Thumb-2: Relax relocation requirements for non-function symbols
    ARM: 6962/1: mach-h720x: fix build error
    ARM: 6959/1: SMP build fix for entry-macro-multi.S

    Linus Torvalds
     
  • * 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6:
    [S390] allow setting of upper 32 bit in smp_ctl_set_bit
    [S390] hwsampler: Set a sane default sampling rate
    [S390] s390: enforce HW limits for the initial sampling rate
    [S390] kvm-s390: fix kconfig dependencies

    Linus Torvalds
     

24 Jun, 2011

3 commits


22 Jun, 2011

5 commits

  • The bit shift operation in smp_ctl_set_bit does not specify the type
    of the shifted bit so integer is used as default. Therefore it is not
    possible to set bits in the upper 32 bit of the control register if
    the kernel runs in 64 bit mode. Fix this by specifying the type as
    unsigned long.

    Signed-off-by: Jan Glauber
    Signed-off-by: Martin Schwidefsky

    Jan Glauber
     
  • The sampling interval for the hardware sampler is specified in cycles.
    (see SA23-2260-01 The Load-Program-Parameter and the CPU-Measurement
    Facilities)
    The current default value will therefore result in millions of samples.
    This patch changes the default sampling interval to 4M, which will
    result in ~1500 samples per second on a z196 reducing the overhead
    of sampling.

    Signed-off-by: Christian Borntraeger
    Signed-off-by: Martin Schwidefsky

    Christian Borntraeger
     
  • On specific configurations with hwsampler opcontrol --start returns an
    error on "echo 1 >/dev/oprofile/enable". Turns out that the hw sampling
    interval is not checked against the hardware limits.

    Signed-off-by: Christian Borntraeger
    Signed-off-by: Martin Schwidefsky

    Christian Borntraeger
     
  • A user can create the Kconfig combination !VIRTUALIZATION, S390_GUEST
    which results in the following warnings:

    warning: (S390_GUEST) selects VIRTIO which has unmet direct dependencies (VIRTUALIZATION)
    warning: (S390_GUEST && VIRTIO_PCI && VIRTIO_BALLOON) selects VIRTIO_RING which has unmet direct dependencies (VIRTUALIZATION && VIRTIO)
    warning: (S390_GUEST) selects VIRTIO which has unmet direct dependencies (VIRTUALIZATION)
    warning: (S390_GUEST && VIRTIO_PCI && VIRTIO_BALLOON) selects VIRTIO_RING which has unmet direct dependencies (VIRTUALIZATION && VIRTIO)

    S390_GUEST has to select VIRTUALIZATION before selecting VIRTIO and
    friends.

    Reported-by: Jan Glauber
    Signed-off-by: Christian Borntraeger
    Signed-off-by: Martin Schwidefsky

    Christian Borntraeger
     
  • MN10300's asm/uaccess.h needs to #include linux/kernel.h to get might_sleep()
    otherwise it fails to build on MN10300 allyesconfig. This fails in a few
    places with messages like the following:

    In file included from security/keys/trusted.c:14:
    include/linux/uaccess.h: In function '__copy_from_user_nocache':
    include/linux/uaccess.h:52: error: implicit declaration of function 'might_sleep'

    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    David Howells
     

21 Jun, 2011

6 commits


20 Jun, 2011

1 commit